Setting things up

Installation

[1]:
%pip install AutoCarver[jupyter]

Califorinia Housing Prices Data

In this example notebook, we will use the California Housing Prices dataset.

The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.

Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).

[3]:
import pandas as pd

from sklearn import datasets

# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)

# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]

# Display the first few rows of the dataset
housing_data.head()
[3]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Target type and Carver selection

[4]:
target = "MedHouseVal"

housing_data[target].describe()
[4]:
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target "MedHouseVal" is a continuous target of type float64 used in a regression task. Hence we will use AutoCarver.ContinuousCarver and AutoCarver.selectors.RegressionSelector in following code blocks.

Data Sampling

[5]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)
[6]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[6]:
(2.0666362048018514, 2.072459655020552)

Picking up columns to Carve

[7]:
train_set.head()
[7]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
5088 0.9809 19.0 3.187726 1.129964 726.0 2.620939 33.98 -118.28 1.214
17096 4.2232 33.0 6.189696 1.086651 1015.0 2.377049 37.46 -122.23 3.637
5617 3.5488 42.0 4.821577 1.095436 1044.0 4.331950 33.79 -118.26 2.056
20060 1.6469 24.0 4.274194 1.048387 1686.0 4.532258 35.87 -119.26 0.476
895 3.9909 14.0 4.608303 1.089350 2738.0 2.471119 37.54 -121.96 2.360
[8]:
# column data types
train_set.dtypes
[8]:
MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object

All features are quantitative continuous features at the exception of Latitude and Longitude which are geographical featues (not supported by AutoCarver as is). All other features will be added to the list of quantitative_features.

[9]:
# lists of features per data type
quantitative_features = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]
qualitative_features = []
ordinal_features = []

# user-specified ordering for ordinal features
values_orders = {}

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:

  • For quantitative features, it defines the number of quantiles to initialy discretize the features with.

  • For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

[10]:
min_freq = 0.05

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[11]:
max_n_mod = 5

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Association metric

The attribute sort_by allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by Carvers.

[12]:
# Optional for ContinuousCarver, the implemented metric is "kruskal"
sort_by = "kruskal"

Grouping NaNs

The attribute dropna allows one to choose whether or not numpy.nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with numpy.nan.

[13]:
dropna = False  # anyway, there are no numpy.nan in this dataset

Optional attributes

Minimal frequency per carved modality

The attribute min_freq_mod allows one to choose the minimum frequency per output modality. It is used by Carvers in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to min_freq/2.

[14]:
min_freq_mod = None  # for 0.05,  at least 5 % of observations per output modality in train and dev sets

Type of output carved features

The attribute output_dtype allows one to choose the output type:

  • Use "float" for integer output (default)

  • Use "str" for string output

[15]:
output_dtype = "float"  # "str"

Fitting AutoCarver

  • First, all quantitative features are discretized:

    1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq=0.05)

    2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2=0.025) to be grouped with its closest modality

  • Second, all features are carved following this recipe:

    1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step

    2. Grouping modalities: all consecutive combinations of modalities are applied to train_set

    3. Computing associations: the association metric (sort_by="kruskal") is computed with the provided target train_set[target]

    4. Combinations are sorted in descending order by association value

    5. Testing robustness: finds the first combination that checks the following:

      • Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq_mod)

      • Distinct target rates per consecutive modalities on train_set and dev_set

      • No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)

    6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with numpy.nan are applied to train_set and steps 3. and 4. are run

    7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[16]:
from AutoCarver import ContinuousCarver

# intiating AutoCarver
auto_carver = ContinuousCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    min_freq_mod=min_freq_mod,
    max_n_mod=max_n_mod,
    dropna=dropna,
    sort_by=sort_by,
    output_dtype=output_dtype,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
 - [OrdinalDiscretizer] Fit ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
------


------
[AutoCarver] Fit AveBedrms (1/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 9.400e-01 2.0684 0.0500
9.400e-01 < x <= 9.672e-01 2.0735 0.0500
9.672e-01 < x <= 9.832e-01 2.2167 0.0501
9.832e-01 < x <= 9.958e-01 2.1706 0.0499
9.958e-01 < x <= 1.007e+00 2.1310 0.0500
1.007e+00 < x <= 1.015e+00 2.2358 0.0500
1.015e+00 < x <= 1.025e+00 2.1668 0.0500
1.025e+00 < x <= 1.033e+00 2.2102 0.0500
1.033e+00 < x <= 1.041e+00 2.1295 0.0500
1.041e+00 < x <= 1.050e+00 2.1548 0.0500
1.050e+00 < x <= 1.058e+00 2.1238 0.0500
1.058e+00 < x <= 1.067e+00 2.1025 0.0500
1.067e+00 < x <= 1.077e+00 2.0704 0.0500
1.077e+00 < x <= 1.088e+00 2.0664 0.0501
1.088e+00 < x <= 1.100e+00 2.1118 0.0499
1.100e+00 < x <= 1.116e+00 1.9937 0.0500
1.116e+00 < x <= 1.138e+00 1.9405 0.0500
1.138e+00 < x <= 1.174e+00 1.7990 0.0500
1.174e+00 < x <= 1.273e+00 1.9162 0.0500
1.273e+00 < x 1.6515 0.0500
X_dev distribution
target_rate frequency
2.0416 0.0539
2.2043 0.0527
2.0997 0.0482
2.1835 0.0487
2.2628 0.0552
2.1619 0.0480
2.2295 0.0567
2.1690 0.0493
2.1581 0.0528
2.1202 0.0476
2.1039 0.0452
2.1595 0.0509
2.1037 0.0521
2.0662 0.0484
2.0487 0.0489
1.9543 0.0467
1.8871 0.0484
1.8680 0.0499
1.8371 0.0465
1.7182 0.0498
Grouping modalities   : 100%|██████████| 5035/5035 [00:03<00:00, 1304.49it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 286.82it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 1.058e+00 2.1528 0.5500
1.058e+00 < x <= 1.100e+00 2.0878 0.2000
1.100e+00 < x <= 1.138e+00 1.9671 0.0999
1.138e+00 < x <= 1.273e+00 1.8575 0.1000
1.273e+00 < x 1.6515 0.0500
X_dev distribution
target_rate frequency
2.1597 0.5583
2.0954 0.2004
1.9201 0.0951
1.8531 0.0964
1.7182 0.0498
------


------
[AutoCarver] Fit AveRooms (2/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 3.441e+00 1.9126 0.0500
3.441e+00 < x <= 3.794e+00 1.8286 0.0500
3.794e+00 < x <= 4.055e+00 1.8169 0.0500
4.055e+00 < x <= 4.279e+00 1.8418 0.0500
4.279e+00 < x <= 4.459e+00 1.7529 0.0500
4.459e+00 < x <= 4.621e+00 1.7915 0.0500
4.621e+00 < x <= 4.791e+00 1.8214 0.0500
4.791e+00 < x <= 4.939e+00 1.7685 0.0500
4.939e+00 < x <= 5.087e+00 1.7466 0.0500
5.087e+00 < x <= 5.232e+00 1.7717 0.0500
5.232e+00 < x <= 5.383e+00 1.8664 0.0500
5.383e+00 < x <= 5.531e+00 1.8472 0.0500
5.531e+00 < x <= 5.694e+00 1.9199 0.0500
5.694e+00 < x <= 5.860e+00 1.9910 0.0500
5.860e+00 < x <= 6.058e+00 2.0870 0.0500
6.058e+00 < x <= 6.273e+00 2.1908 0.0500
6.273e+00 < x <= 6.542e+00 2.4050 0.0500
6.542e+00 < x <= 6.949e+00 2.6874 0.0500
6.949e+00 < x <= 7.652e+00 3.1129 0.0500
7.652e+00 < x 3.1718 0.0500
X_dev distribution
target_rate frequency
1.8659 0.0518
1.8728 0.0505
1.7627 0.0524
1.8020 0.0543
1.7223 0.0552
1.6802 0.0452
1.7707 0.0530
1.8030 0.0443
1.8209 0.0523
1.8326 0.0437
1.7923 0.0550
1.9388 0.0514
1.9465 0.0501
2.0248 0.0468
2.1049 0.0483
2.2239 0.0490
2.4339 0.0467
2.7667 0.0468
3.1001 0.0548
3.2429 0.0483
Grouping modalities   : 100%|██████████| 5035/5035 [00:04<00:00, 1134.64it/s]
Computing associations: 100%|██████████| 5035/5035 [00:21<00:00, 229.90it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 5.531e+00 1.8138 0.6000
5.531e+00 < x <= 5.860e+00 1.9554 0.0999
5.860e+00 < x <= 6.273e+00 2.1389 0.1000
6.273e+00 < x <= 6.542e+00 2.4050 0.0500
6.542e+00 < x 2.9907 0.1501
X_dev distribution
target_rate frequency
1.8055 0.6092
1.9844 0.0969
2.1649 0.0973
2.4339 0.0467
3.0420 0.1499
------


------
[AutoCarver] Fit MedInc (3/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 1.602e+00 1.1102 0.0500
1.602e+00 < x <= 1.905e+00 1.1285 0.0500
1.905e+00 < x <= 2.151e+00 1.2198 0.0500
2.151e+00 < x <= 2.355e+00 1.3171 0.0500
2.355e+00 < x <= 2.568e+00 1.3817 0.0500
2.568e+00 < x <= 2.737e+00 1.5409 0.0500
2.737e+00 < x <= 2.975e+00 1.6159 0.0500
2.975e+00 < x <= 3.143e+00 1.6906 0.0499
3.143e+00 < x <= 3.323e+00 1.8232 0.0500
3.323e+00 < x <= 3.539e+00 1.9059 0.0500
3.539e+00 < x <= 3.729e+00 2.0076 0.0502
3.729e+00 < x <= 3.974e+00 2.0271 0.0498
3.974e+00 < x <= 4.179e+00 2.1456 0.0500
4.179e+00 < x <= 4.461e+00 2.2433 0.0500
4.461e+00 < x <= 4.757e+00 2.3621 0.0501
4.757e+00 < x <= 5.116e+00 2.3986 0.0499
5.116e+00 < x <= 5.545e+00 2.6438 0.0500
5.545e+00 < x <= 6.155e+00 2.9324 0.0500
6.155e+00 < x <= 7.316e+00 3.4592 0.0500
7.316e+00 < x 4.3784 0.0500
X_dev distribution
target_rate frequency
1.1017 0.0509
1.0410 0.0502
1.2407 0.0501
1.2919 0.0506
1.4676 0.0536
1.5605 0.0417
1.6280 0.0584
1.7519 0.0471
1.8443 0.0504
1.8500 0.0498
2.0040 0.0533
2.0890 0.0502
2.1641 0.0505
2.2700 0.0540
2.3768 0.0439
2.5087 0.0479
2.6814 0.0483
2.9805 0.0479
3.3748 0.0530
4.3748 0.0483
Grouping modalities   : 100%|██████████| 5035/5035 [00:04<00:00, 1256.22it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 285.45it/s]
Testing robustness    :   0%|          | 0/5035 [00:01<?, ?it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 2.568e+00 1.2314 0.2500
2.568e+00 < x <= 3.323e+00 1.6676 0.2000
3.323e+00 < x <= 4.461e+00 2.0659 0.2499
4.461e+00 < x <= 6.155e+00 2.5843 0.2000
6.155e+00 < x 3.9191 0.1000
X_dev distribution
target_rate frequency
1.2315 0.2554
1.6984 0.1976
2.0779 0.2578
2.6424 0.1879
3.8516 0.1013
------


------
[AutoCarver] Fit AveOccup (4/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 1.870e+00 2.7122 0.0500
1.870e+00 < x <= 2.067e+00 2.6633 0.0500
2.067e+00 < x <= 2.225e+00 2.3373 0.0500
2.225e+00 < x <= 2.338e+00 2.3080 0.0500
2.338e+00 < x <= 2.432e+00 2.1976 0.0500
2.432e+00 < x <= 2.513e+00 2.2064 0.0500
2.513e+00 < x <= 2.595e+00 2.1736 0.0500
2.595e+00 < x <= 2.668e+00 2.1862 0.0500
2.668e+00 < x <= 2.743e+00 2.1378 0.0500
2.743e+00 < x <= 2.820e+00 2.1902 0.0500
2.820e+00 < x <= 2.898e+00 2.1824 0.0500
2.898e+00 < x <= 2.984e+00 2.0741 0.0500
2.984e+00 < x <= 3.073e+00 2.0255 0.0501
3.073e+00 < x <= 3.171e+00 1.9914 0.0498
3.171e+00 < x <= 3.282e+00 1.8992 0.0500
3.282e+00 < x <= 3.425e+00 1.8926 0.0500
3.425e+00 < x <= 3.607e+00 1.7085 0.0500
3.607e+00 < x <= 3.877e+00 1.5666 0.0500
3.877e+00 < x <= 4.325e+00 1.4505 0.0500
4.325e+00 < x 1.4294 0.0500
X_dev distribution
target_rate frequency
2.7684 0.0484
2.5334 0.0435
2.3989 0.0542
2.3641 0.0533
2.2272 0.0546
2.2969 0.0489
2.3179 0.0508
2.0793 0.0467
2.1847 0.0521
2.1752 0.0504
2.0762 0.0533
2.0535 0.0501
2.0535 0.0528
1.9477 0.0458
1.8397 0.0449
1.8861 0.0514
1.7301 0.0448
1.6200 0.0499
1.4423 0.0527
1.4596 0.0515
Grouping modalities   : 100%|██████████| 5035/5035 [00:03<00:00, 1272.47it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 294.55it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 2.067e+00 2.6878 0.1000
2.067e+00 < x <= 2.898e+00 2.2133 0.4500
2.898e+00 < x <= 3.425e+00 1.9766 0.2500
3.425e+00 < x <= 3.877e+00 1.6375 0.1000
3.877e+00 < x 1.4400 0.1000
X_dev distribution
target_rate frequency
2.6573 0.0919
2.2376 0.4642
1.9594 0.2450
1.6721 0.0947
1.4509 0.1042
------


------
[AutoCarver] Fit Population (5/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 3.530e+02 1.9859 0.0501
3.530e+02 < x <= 5.140e+02 2.1616 0.0501
5.140e+02 < x <= 6.270e+02 2.1117 0.0501
6.270e+02 < x <= 7.150e+02 2.2819 0.0497
7.150e+02 < x <= 7.930e+02 2.0335 0.0509
7.930e+02 < x <= 8.640e+02 2.2113 0.0492
8.640e+02 < x <= 9.380e+02 2.0772 0.0498
9.380e+02 < x <= 1.015e+03 2.1386 0.0500
1.015e+03 < x <= 1.091e+03 2.0430 0.0503
1.091e+03 < x <= 1.170e+03 2.0506 0.0496
1.170e+03 < x <= 1.264e+03 2.0870 0.0505
1.264e+03 < x <= 1.354e+03 2.0195 0.0497
1.354e+03 < x <= 1.464e+03 2.0004 0.0502
1.464e+03 < x <= 1.583e+03 2.1102 0.0498
1.583e+03 < x <= 1.729e+03 2.0346 0.0500
1.729e+03 < x <= 1.908e+03 1.9139 0.0499
1.908e+03 < x <= 2.152e+03 2.0006 0.0500
2.152e+03 < x <= 2.563e+03 2.0707 0.0500
2.563e+03 < x <= 3.297e+03 1.9614 0.0500
3.297e+03 < x 2.0428 0.0500
X_dev distribution
target_rate frequency
1.9012 0.0530
2.1915 0.0520
2.1706 0.0523
2.1062 0.0514
2.2019 0.0531
2.1765 0.0490
2.2025 0.0506
2.1329 0.0553
2.1744 0.0437
2.1319 0.0480
1.9939 0.0534
2.0096 0.0465
1.9569 0.0465
1.9756 0.0504
2.0815 0.0496
2.0272 0.0461
1.9789 0.0487
1.9355 0.0496
2.0714 0.0518
2.0157 0.0487
Grouping modalities   : 100%|██████████| 5035/5035 [00:05<00:00, 981.35it/s]
Computing associations: 100%|██████████| 5035/5035 [00:16<00:00, 300.33it/s]
Testing robustness    :   1%|          | 41/5035 [00:00<01:53, 43.83it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 3.530e+02 1.9859 0.0501
3.530e+02 < x <= 7.930e+02 2.1464 0.2008
7.930e+02 < x <= 8.640e+02 2.2113 0.0492
8.640e+02 < x <= 2.152e+03 2.0433 0.5498
2.152e+03 < x 2.0250 0.1501
X_dev distribution
target_rate frequency
1.9012 0.0530
2.1679 0.2087
2.1765 0.0490
2.0607 0.5390
2.0084 0.1502
------


------
[AutoCarver] Fit HouseAge (6/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 8.000e+00 2.1158 0.0537
8.000e+00 < x <= 1.200e+01 1.8220 0.0477
1.200e+01 < x <= 1.500e+01 1.8590 0.0613
1.500e+01 < x <= 1.600e+01 2.0358 0.0393
1.600e+01 < x <= 1.800e+01 1.9013 0.0596
1.800e+01 < x <= 2.000e+01 1.9399 0.0468
2.000e+01 < x <= 2.200e+01 2.0134 0.0404
2.200e+01 < x <= 2.500e+01 2.1055 0.0705
2.500e+01 < x <= 2.600e+01 2.0977 0.0300
2.600e+01 < x <= 2.800e+01 2.0218 0.0475
2.800e+01 < x <= 3.100e+01 2.0439 0.0682
3.100e+01 < x <= 3.300e+01 2.0275 0.0575
3.300e+01 < x <= 3.400e+01 2.1189 0.0328
3.400e+01 < x <= 3.500e+01 2.0204 0.0395
3.500e+01 < x <= 3.700e+01 2.0750 0.0687
3.700e+01 < x <= 3.900e+01 2.0212 0.0361
3.900e+01 < x <= 4.200e+01 2.0013 0.0450
4.200e+01 < x <= 4.500e+01 2.1301 0.0485
4.500e+01 < x 2.4785 0.1072
X_dev distribution
target_rate frequency
2.0205 0.0526
1.7827 0.0443
1.8780 0.0556
1.9208 0.0335
1.9484 0.0652
1.9517 0.0470
2.1141 0.0421
2.1179 0.0759
2.0888 0.0299
2.2138 0.0443
1.9546 0.0664
2.0512 0.0565
2.1979 0.0346
2.1762 0.0408
2.0747 0.0659
1.9885 0.0388
2.0394 0.0508
2.0015 0.0489
2.4651 0.1069
Grouping modalities   : 100%|██████████| 4047/4047 [00:02<00:00, 1571.72it/s]
Computing associations: 100%|██████████| 4047/4047 [00:14<00:00, 287.71it/s]
Testing robustness    :   2%|▏         | 91/4047 [00:01<01:03, 62.56it/s]

 - [AutoCarver] Carved distribution
X distribution
  target_rate frequency
x <= 2.200e+01 1.9494 0.3486
2.200e+01 < x <= 2.600e+01 2.1032 0.1005
2.600e+01 < x <= 3.300e+01 2.0324 0.1732
3.300e+01 < x <= 4.500e+01 2.0628 0.2705
4.500e+01 < x 2.4785 0.1072
X_dev distribution
target_rate frequency
1.9447 0.3403
2.1097 0.1058
2.0560 0.1672
2.0736 0.2798
2.4651 0.1069
------

AutoCarver analysis

Carving Summary

[17]:
auto_carver.summary()
[17]:
label content
feature dtype
AveBedrms float 0 [x <= 1.058e+00]
float 1 [1.058e+00 < x <= 1.100e+00]
float 2 [1.100e+00 < x <= 1.138e+00]
float 3 [1.138e+00 < x <= 1.273e+00]
float 4 [1.273e+00 < x]
AveOccup float 0 [x <= 2.067e+00]
float 1 [2.067e+00 < x <= 2.898e+00]
float 2 [2.898e+00 < x <= 3.425e+00]
float 3 [3.425e+00 < x <= 3.877e+00]
float 4 [3.877e+00 < x]
AveRooms float 0 [x <= 5.531e+00]
float 1 [5.531e+00 < x <= 5.860e+00]
float 2 [5.860e+00 < x <= 6.273e+00]
float 3 [6.273e+00 < x <= 6.542e+00]
float 4 [6.542e+00 < x]
HouseAge float 0 [x <= 2.200e+01]
float 1 [2.200e+01 < x <= 2.600e+01]
float 2 [2.600e+01 < x <= 3.300e+01]
float 3 [3.300e+01 < x <= 4.500e+01]
float 4 [4.500e+01 < x]
MedInc float 0 [x <= 2.568e+00]
float 1 [2.568e+00 < x <= 3.323e+00]
float 2 [3.323e+00 < x <= 4.461e+00]
float 3 [4.461e+00 < x <= 6.155e+00]
float 4 [6.155e+00 < x]
Population float 0 [x <= 3.530e+02]
float 1 [3.530e+02 < x <= 7.930e+02]
float 2 [7.930e+02 < x <= 8.640e+02]
float 3 [8.640e+02 < x <= 2.152e+03]
float 4 [2.152e+03 < x]
  • As requested with output_dtype="float", output labels are integers of ranks of modalities

  • For quantitative feature Population, the selected combination of modalities groups populations as follows:

    • modality 0: lower or equal to 353 people (content==["x <= 3.530e+02"])

    • modality 1: greater than 353 people and lower or equal to 793 people (content==["3.530e+02 < x <= 7.930e+02"])

    • modality 2: greater than 793 people and lower or equal to 864 people (content==["7.930e+02 < x <= 8.640e+02"])

    • modality 3: greater than 864 people and lower or equal to 2152 people (content==["8.640e+02 < x <= 2.152e+03"])

    • modality 4: higher than 2152 people (content==["2.152e+03 < x "])

Detailed overview of tested combinations

[18]:
auto_carver.history("AveRooms").head(50)
[18]:
combination kruskal viability viability_message grouping_nan
0 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1465.999414 None [Raw X distribution] False
1 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1417.935973 True [Combination robust between X and X_dev] False
2 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1417.241563 None [Not checked] False
3 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1416.624227 None [Not checked] False
4 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1416.389183 None [Not checked] False
5 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1415.929817 None [Not checked] False
6 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1414.416406 None [Not checked] False
7 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1413.546297 None [Not checked] False
8 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1413.104659 None [Not checked] False
9 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1412.980977 None [Not checked] False
10 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1412.129278 None [Not checked] False
11 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1411.946836 None [Not checked] False
12 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1411.669231 None [Not checked] False
13 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1411.423709 None [Not checked] False
14 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1410.635089 None [Not checked] False
15 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1410.209599 None [Not checked] False
16 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1410.111962 None [Not checked] False
17 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... 1409.864040 None [Not checked] False
18 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1409.475866 None [Not checked] False
19 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1408.897852 None [Not checked] False
20 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1408.770297 None [Not checked] False
21 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1408.716395 None [Not checked] False
22 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1408.663351 None [Not checked] False
23 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... 1408.552293 None [Not checked] False
24 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1407.686931 None [Not checked] False
25 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1407.539008 None [Not checked] False
26 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1407.458550 None [Not checked] False
27 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1407.443142 None [Not checked] False
28 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1407.404649 None [Not checked] False
29 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1407.351604 None [Not checked] False
30 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1407.302436 None [Not checked] False
31 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.784261 None [Not checked] False
32 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1406.318805 None [Not checked] False
33 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.305763 None [Not checked] False
34 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.227261 None [Not checked] False
35 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.131395 None [Not checked] False
36 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.089851 None [Not checked] False
37 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1406.017105 None [Not checked] False
38 [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... 1405.990689 None [Not checked] False
39 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.986129 None [Not checked] False
40 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.949694 None [Not checked] False
41 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.611353 None [Not checked] False
42 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... 1405.604135 None [Not checked] False
43 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.322169 None [Not checked] False
44 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.314975 None [Not checked] False
45 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1405.260370 None [Not checked] False
46 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... 1404.738243 None [Not checked] False
47 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1404.705359 None [Not checked] False
48 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... 1404.702540 None [Not checked] False
49 [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... 1404.472730 None [Not checked] False
[19]:
auto_carver.history("Population")["viability_message"][2]
[19]:
['X_dev: inversion of target rates per modality']
  • The most associated combination of feature Population (the first tested out, where viability_message!=["Raw X distribution"]) did not pass the viability tests. When looking in viability_message:

    • "X_dev: inversion of target rates per modality": target rates (mean values of MedHouseVal per grouped modality) are not ranked the same between train_set and dev_set

  • For feature feature Population, the 42nd combination is the first to pass the tests:

    • viability_message!=["Combination robust between X and X_dev"]

    • Kruskal-Wallis’ H with MedHouseVal is 29.050321 for this combination

    • Following combinations (less associated with the target) where not tested: viability_message==["Not checked"]

  • For all combinations grouping_nan==False means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[20]:
import json

# storing as json file
with open('continuous_carver.json', 'w') as my_carver_json:
    json.dump(auto_carver.to_json(), my_carver_json)

Loading

Carvers can safely be loaded from a .json file.

[21]:
import json

from AutoCarver import load_carver

# loading json file
with open('continuous_carver.json', 'r') as my_carver_json:
    auto_carver = load_carver(json.load(my_carver_json))

Applying AutoCarver

[22]:
dev_set_processed = auto_carver.transform(dev_set)
[23]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[23]:
AveBedrms AveRooms MedInc AveOccup Population HouseAge
0.0 0.558280 0.609219 0.255432 0.091897 0.052995 0.340282
1.0 0.200382 0.096888 0.197592 0.464181 0.208749 0.105843
2.0 0.095126 0.097328 0.257780 0.245009 0.049031 0.167205
3.0 0.096447 0.046682 0.187904 0.094686 0.539049 0.279800
4.0 0.049765 0.149883 0.101292 0.104228 0.150176 0.106870

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using ContinuousCarver, hence all features are qualitative.

[24]:
features = qualitative_features + quantitative_features + ordinal_features

Number of features to select

The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[25]:
n_best = 6  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Using Selectors

[26]:
import importlib
import AutoCarver.selectors
importlib.reload(AutoCarver.selectors)
[26]:
<module 'AutoCarver.selectors' from 'c:\\Users\\defra\\Desktop\\git\\PROJECTS\\AutoCarver\\docs\\source\\examples\\ContinuousRegression\\../../../../../AutoCarver\\AutoCarver\\selectors\\__init__.py'>
[27]:
from AutoCarver.selectors import RegressionSelector

# select the most target associated qualitative features
feature_selector = RegressionSelector(
    qualitative_features=features,
    n_best=n_best,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
------
[Selector] Selecting from qualitative features: ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
---

 - [Selector] Association between X and y
  dtype pct_nan pct_mode mode kruskal_measure
MedInc float64 0.0000 0.2500 0.0000 6207.6768
AveRooms float64 0.0000 0.6000 0.0000 1417.9360
AveOccup float64 0.0000 0.4500 1.0000 1026.3004
AveBedrms float64 0.0000 0.5500 0.0000 346.0749
HouseAge float64 0.0000 0.3486 0.0000 164.2102
Population float64 0.0000 0.5498 3.0000 29.0503

 - [Selector] Association between X and y, filtered for inter-feature assocation
  dtype pct_nan pct_mode mode kruskal_measure
MedInc float64 0.0000 0.2500 0.0000 6207.6768
AveRooms float64 0.0000 0.6000 0.0000 1417.9360
AveOccup float64 0.0000 0.4500 1.0000 1026.3004
AveBedrms float64 0.0000 0.5500 0.0000 346.0749
HouseAge float64 0.0000 0.3486 0.0000 164.2102
Population float64 0.0000 0.5498 3.0000 29.0503

 - [Selector] Selected qualitative features: ['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population']
------

  • Feature MedInc is the most associated with the target MedHouseVal:

    • Kruskal-Wallis’ H value is kruskal_measure=6207.67678

    • It has 0 % of NaNs (pct_nan=0.0)

    • Its mode, 0, represents 25 % of observed data (pct_nan=0.2500)

  • Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)