Setting things up
Installation
[1]:
%pip install AutoCarver[jupyter]
Califorinia Housing Prices Data
In this example notebook, we will use the California Housing Prices dataset.
The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.
Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).
[3]:
import pandas as pd
from sklearn import datasets
# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)
# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]
# Display the first few rows of the dataset
housing_data.head()
[3]:
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
| 1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
| 2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
| 3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
| 4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
Target type and Carver selection
[4]:
target = "MedHouseVal"
housing_data[target].describe()
[4]:
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
Name: MedHouseVal, dtype: float64
The target "MedHouseVal" is a continuous target of type float64 used in a regression task. Hence we will use AutoCarver.ContinuousCarver and AutoCarver.selectors.RegressionSelector in following code blocks.
Data Sampling
[5]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)
[6]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[6]:
(2.0666362048018514, 2.072459655020552)
Picking up columns to Carve
[7]:
train_set.head()
[7]:
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
|---|---|---|---|---|---|---|---|---|---|
| 5088 | 0.9809 | 19.0 | 3.187726 | 1.129964 | 726.0 | 2.620939 | 33.98 | -118.28 | 1.214 |
| 17096 | 4.2232 | 33.0 | 6.189696 | 1.086651 | 1015.0 | 2.377049 | 37.46 | -122.23 | 3.637 |
| 5617 | 3.5488 | 42.0 | 4.821577 | 1.095436 | 1044.0 | 4.331950 | 33.79 | -118.26 | 2.056 |
| 20060 | 1.6469 | 24.0 | 4.274194 | 1.048387 | 1686.0 | 4.532258 | 35.87 | -119.26 | 0.476 |
| 895 | 3.9909 | 14.0 | 4.608303 | 1.089350 | 2738.0 | 2.471119 | 37.54 | -121.96 | 2.360 |
[8]:
# column data types
train_set.dtypes
[8]:
MedInc float64
HouseAge float64
AveRooms float64
AveBedrms float64
Population float64
AveOccup float64
Latitude float64
Longitude float64
MedHouseVal float64
dtype: object
All features are quantitative continuous features at the exception of Latitude and Longitude which are geographical featues (not supported by AutoCarver as is). All other features will be added to the list of quantitative_features.
[9]:
# lists of features per data type
quantitative_features = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]
qualitative_features = []
ordinal_features = []
# user-specified ordering for ordinal features
values_orders = {}
Using AutoCarver
AutoCarver settings
Representativness of modalities
The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:
For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.
[10]:
min_freq = 0.05
Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)
Desired number of modalities
The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.
[11]:
max_n_mod = 5
Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)
Association metric
The attribute sort_by allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by Carvers.
[12]:
# Optional for ContinuousCarver, the implemented metric is "kruskal"
sort_by = "kruskal"
Grouping NaNs
The attribute dropna allows one to choose whether or not numpy.nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with numpy.nan.
[13]:
dropna = False # anyway, there are no numpy.nan in this dataset
Optional attributes
Minimal frequency per carved modality
The attribute min_freq_mod allows one to choose the minimum frequency per output modality. It is used by Carvers in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to min_freq/2.
[14]:
min_freq_mod = None # for 0.05, at least 5 % of observations per output modality in train and dev sets
Type of output carved features
The attribute output_dtype allows one to choose the output type:
Use
"float"for integer output (default)Use
"str"for strin output
[15]:
output_dtype = "float" # "str"
Fitting AutoCarver
First, all quantitative features are discretized:
Using
ContinuousDiscretizerfor quantile discretization that keeps track of over-represented values (more frequent thanmin_freq=0.05)Using
OrdinalDiscretizerfor any remaining under-represented values (less frequent thanmin_freq/2=0.025) to be grouped with its closest modality
Second, all features are carved following this recipe:
The raw distribution is printed out on provided
train_setanddev_set. It’s the output of the discretization stepGrouping modalities: all consecutive combinations of modalities are applied to
train_setComputing associations: the association metric (
sort_by="kruskal") is computed with the provided targettrain_set[target]Combinations are sorted in descending order by association value
Testing robustness: finds the first combination that checks the following:
Representativness of modalities on
train_setanddev_set(all should be more frequent thanmin_freq_mod)Distinct target rates per consecutive modalities on
train_setanddev_setNo inversion of target rates between
train_setanddev_set(same ordering of modalities by target rate)
(Optional) If requested via
dropna=True, and if any, all combinations of modalities withnumpy.nanare applied totrain_setand steps 3. and 4. are runThe carved distribution is printed out on provided
train_setanddev_set. It’s the output of the carving step
[16]:
from AutoCarver import ContinuousCarver
# intiating AutoCarver
auto_carver = ContinuousCarver(
quantitative_features=quantitative_features,
qualitative_features=qualitative_features,
ordinal_features=ordinal_features,
values_orders=values_orders,
min_freq=min_freq,
min_freq_mod=min_freq_mod,
max_n_mod=max_n_mod,
dropna=dropna,
sort_by=sort_by,
output_dtype=output_dtype,
verbose=True, # showing statistics
copy=True, # whether or not to return a copy of the input dataset
)
# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
[Discretizer] Fit Quantitative Features
---
- [ContinuousDiscretizer] Fit ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
- [OrdinalDiscretizer] Fit ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
------
------
[AutoCarver] Fit AveBedrms (1/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 9.400e-01 | 2.0684 | 0.0500 |
| 9.400e-01 < x <= 9.672e-01 | 2.0735 | 0.0500 |
| 9.672e-01 < x <= 9.832e-01 | 2.2167 | 0.0501 |
| 9.832e-01 < x <= 9.958e-01 | 2.1706 | 0.0499 |
| 9.958e-01 < x <= 1.007e+00 | 2.1310 | 0.0500 |
| 1.007e+00 < x <= 1.015e+00 | 2.2358 | 0.0500 |
| 1.015e+00 < x <= 1.025e+00 | 2.1668 | 0.0500 |
| 1.025e+00 < x <= 1.033e+00 | 2.2102 | 0.0500 |
| 1.033e+00 < x <= 1.041e+00 | 2.1295 | 0.0500 |
| 1.041e+00 < x <= 1.050e+00 | 2.1548 | 0.0500 |
| 1.050e+00 < x <= 1.058e+00 | 2.1238 | 0.0500 |
| 1.058e+00 < x <= 1.067e+00 | 2.1025 | 0.0500 |
| 1.067e+00 < x <= 1.077e+00 | 2.0704 | 0.0500 |
| 1.077e+00 < x <= 1.088e+00 | 2.0664 | 0.0501 |
| 1.088e+00 < x <= 1.100e+00 | 2.1118 | 0.0499 |
| 1.100e+00 < x <= 1.116e+00 | 1.9937 | 0.0500 |
| 1.116e+00 < x <= 1.138e+00 | 1.9405 | 0.0500 |
| 1.138e+00 < x <= 1.174e+00 | 1.7990 | 0.0500 |
| 1.174e+00 < x <= 1.273e+00 | 1.9162 | 0.0500 |
| 1.273e+00 < x | 1.6515 | 0.0500 |
| target_rate | frequency |
|---|---|
| 2.0416 | 0.0539 |
| 2.2043 | 0.0527 |
| 2.0997 | 0.0482 |
| 2.1835 | 0.0487 |
| 2.2628 | 0.0552 |
| 2.1619 | 0.0480 |
| 2.2295 | 0.0567 |
| 2.1690 | 0.0493 |
| 2.1581 | 0.0528 |
| 2.1202 | 0.0476 |
| 2.1039 | 0.0452 |
| 2.1595 | 0.0509 |
| 2.1037 | 0.0521 |
| 2.0662 | 0.0484 |
| 2.0487 | 0.0489 |
| 1.9543 | 0.0467 |
| 1.8871 | 0.0484 |
| 1.8680 | 0.0499 |
| 1.8371 | 0.0465 |
| 1.7182 | 0.0498 |
Grouping modalities : 100%|██████████| 5035/5035 [00:03<00:00, 1304.49it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 286.82it/s]
Testing robustness : 0%| | 0/5035 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.058e+00 | 2.1528 | 0.5500 |
| 1.058e+00 < x <= 1.100e+00 | 2.0878 | 0.2000 |
| 1.100e+00 < x <= 1.138e+00 | 1.9671 | 0.0999 |
| 1.138e+00 < x <= 1.273e+00 | 1.8575 | 0.1000 |
| 1.273e+00 < x | 1.6515 | 0.0500 |
| target_rate | frequency |
|---|---|
| 2.1597 | 0.5583 |
| 2.0954 | 0.2004 |
| 1.9201 | 0.0951 |
| 1.8531 | 0.0964 |
| 1.7182 | 0.0498 |
------
------
[AutoCarver] Fit AveRooms (2/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 3.441e+00 | 1.9126 | 0.0500 |
| 3.441e+00 < x <= 3.794e+00 | 1.8286 | 0.0500 |
| 3.794e+00 < x <= 4.055e+00 | 1.8169 | 0.0500 |
| 4.055e+00 < x <= 4.279e+00 | 1.8418 | 0.0500 |
| 4.279e+00 < x <= 4.459e+00 | 1.7529 | 0.0500 |
| 4.459e+00 < x <= 4.621e+00 | 1.7915 | 0.0500 |
| 4.621e+00 < x <= 4.791e+00 | 1.8214 | 0.0500 |
| 4.791e+00 < x <= 4.939e+00 | 1.7685 | 0.0500 |
| 4.939e+00 < x <= 5.087e+00 | 1.7466 | 0.0500 |
| 5.087e+00 < x <= 5.232e+00 | 1.7717 | 0.0500 |
| 5.232e+00 < x <= 5.383e+00 | 1.8664 | 0.0500 |
| 5.383e+00 < x <= 5.531e+00 | 1.8472 | 0.0500 |
| 5.531e+00 < x <= 5.694e+00 | 1.9199 | 0.0500 |
| 5.694e+00 < x <= 5.860e+00 | 1.9910 | 0.0500 |
| 5.860e+00 < x <= 6.058e+00 | 2.0870 | 0.0500 |
| 6.058e+00 < x <= 6.273e+00 | 2.1908 | 0.0500 |
| 6.273e+00 < x <= 6.542e+00 | 2.4050 | 0.0500 |
| 6.542e+00 < x <= 6.949e+00 | 2.6874 | 0.0500 |
| 6.949e+00 < x <= 7.652e+00 | 3.1129 | 0.0500 |
| 7.652e+00 < x | 3.1718 | 0.0500 |
| target_rate | frequency |
|---|---|
| 1.8659 | 0.0518 |
| 1.8728 | 0.0505 |
| 1.7627 | 0.0524 |
| 1.8020 | 0.0543 |
| 1.7223 | 0.0552 |
| 1.6802 | 0.0452 |
| 1.7707 | 0.0530 |
| 1.8030 | 0.0443 |
| 1.8209 | 0.0523 |
| 1.8326 | 0.0437 |
| 1.7923 | 0.0550 |
| 1.9388 | 0.0514 |
| 1.9465 | 0.0501 |
| 2.0248 | 0.0468 |
| 2.1049 | 0.0483 |
| 2.2239 | 0.0490 |
| 2.4339 | 0.0467 |
| 2.7667 | 0.0468 |
| 3.1001 | 0.0548 |
| 3.2429 | 0.0483 |
Grouping modalities : 100%|██████████| 5035/5035 [00:04<00:00, 1134.64it/s]
Computing associations: 100%|██████████| 5035/5035 [00:21<00:00, 229.90it/s]
Testing robustness : 0%| | 0/5035 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 5.531e+00 | 1.8138 | 0.6000 |
| 5.531e+00 < x <= 5.860e+00 | 1.9554 | 0.0999 |
| 5.860e+00 < x <= 6.273e+00 | 2.1389 | 0.1000 |
| 6.273e+00 < x <= 6.542e+00 | 2.4050 | 0.0500 |
| 6.542e+00 < x | 2.9907 | 0.1501 |
| target_rate | frequency |
|---|---|
| 1.8055 | 0.6092 |
| 1.9844 | 0.0969 |
| 2.1649 | 0.0973 |
| 2.4339 | 0.0467 |
| 3.0420 | 0.1499 |
------
------
[AutoCarver] Fit MedInc (3/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.602e+00 | 1.1102 | 0.0500 |
| 1.602e+00 < x <= 1.905e+00 | 1.1285 | 0.0500 |
| 1.905e+00 < x <= 2.151e+00 | 1.2198 | 0.0500 |
| 2.151e+00 < x <= 2.355e+00 | 1.3171 | 0.0500 |
| 2.355e+00 < x <= 2.568e+00 | 1.3817 | 0.0500 |
| 2.568e+00 < x <= 2.737e+00 | 1.5409 | 0.0500 |
| 2.737e+00 < x <= 2.975e+00 | 1.6159 | 0.0500 |
| 2.975e+00 < x <= 3.143e+00 | 1.6906 | 0.0499 |
| 3.143e+00 < x <= 3.323e+00 | 1.8232 | 0.0500 |
| 3.323e+00 < x <= 3.539e+00 | 1.9059 | 0.0500 |
| 3.539e+00 < x <= 3.729e+00 | 2.0076 | 0.0502 |
| 3.729e+00 < x <= 3.974e+00 | 2.0271 | 0.0498 |
| 3.974e+00 < x <= 4.179e+00 | 2.1456 | 0.0500 |
| 4.179e+00 < x <= 4.461e+00 | 2.2433 | 0.0500 |
| 4.461e+00 < x <= 4.757e+00 | 2.3621 | 0.0501 |
| 4.757e+00 < x <= 5.116e+00 | 2.3986 | 0.0499 |
| 5.116e+00 < x <= 5.545e+00 | 2.6438 | 0.0500 |
| 5.545e+00 < x <= 6.155e+00 | 2.9324 | 0.0500 |
| 6.155e+00 < x <= 7.316e+00 | 3.4592 | 0.0500 |
| 7.316e+00 < x | 4.3784 | 0.0500 |
| target_rate | frequency |
|---|---|
| 1.1017 | 0.0509 |
| 1.0410 | 0.0502 |
| 1.2407 | 0.0501 |
| 1.2919 | 0.0506 |
| 1.4676 | 0.0536 |
| 1.5605 | 0.0417 |
| 1.6280 | 0.0584 |
| 1.7519 | 0.0471 |
| 1.8443 | 0.0504 |
| 1.8500 | 0.0498 |
| 2.0040 | 0.0533 |
| 2.0890 | 0.0502 |
| 2.1641 | 0.0505 |
| 2.2700 | 0.0540 |
| 2.3768 | 0.0439 |
| 2.5087 | 0.0479 |
| 2.6814 | 0.0483 |
| 2.9805 | 0.0479 |
| 3.3748 | 0.0530 |
| 4.3748 | 0.0483 |
Grouping modalities : 100%|██████████| 5035/5035 [00:04<00:00, 1256.22it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 285.45it/s]
Testing robustness : 0%| | 0/5035 [00:01<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.568e+00 | 1.2314 | 0.2500 |
| 2.568e+00 < x <= 3.323e+00 | 1.6676 | 0.2000 |
| 3.323e+00 < x <= 4.461e+00 | 2.0659 | 0.2499 |
| 4.461e+00 < x <= 6.155e+00 | 2.5843 | 0.2000 |
| 6.155e+00 < x | 3.9191 | 0.1000 |
| target_rate | frequency |
|---|---|
| 1.2315 | 0.2554 |
| 1.6984 | 0.1976 |
| 2.0779 | 0.2578 |
| 2.6424 | 0.1879 |
| 3.8516 | 0.1013 |
------
------
[AutoCarver] Fit AveOccup (4/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.870e+00 | 2.7122 | 0.0500 |
| 1.870e+00 < x <= 2.067e+00 | 2.6633 | 0.0500 |
| 2.067e+00 < x <= 2.225e+00 | 2.3373 | 0.0500 |
| 2.225e+00 < x <= 2.338e+00 | 2.3080 | 0.0500 |
| 2.338e+00 < x <= 2.432e+00 | 2.1976 | 0.0500 |
| 2.432e+00 < x <= 2.513e+00 | 2.2064 | 0.0500 |
| 2.513e+00 < x <= 2.595e+00 | 2.1736 | 0.0500 |
| 2.595e+00 < x <= 2.668e+00 | 2.1862 | 0.0500 |
| 2.668e+00 < x <= 2.743e+00 | 2.1378 | 0.0500 |
| 2.743e+00 < x <= 2.820e+00 | 2.1902 | 0.0500 |
| 2.820e+00 < x <= 2.898e+00 | 2.1824 | 0.0500 |
| 2.898e+00 < x <= 2.984e+00 | 2.0741 | 0.0500 |
| 2.984e+00 < x <= 3.073e+00 | 2.0255 | 0.0501 |
| 3.073e+00 < x <= 3.171e+00 | 1.9914 | 0.0498 |
| 3.171e+00 < x <= 3.282e+00 | 1.8992 | 0.0500 |
| 3.282e+00 < x <= 3.425e+00 | 1.8926 | 0.0500 |
| 3.425e+00 < x <= 3.607e+00 | 1.7085 | 0.0500 |
| 3.607e+00 < x <= 3.877e+00 | 1.5666 | 0.0500 |
| 3.877e+00 < x <= 4.325e+00 | 1.4505 | 0.0500 |
| 4.325e+00 < x | 1.4294 | 0.0500 |
| target_rate | frequency |
|---|---|
| 2.7684 | 0.0484 |
| 2.5334 | 0.0435 |
| 2.3989 | 0.0542 |
| 2.3641 | 0.0533 |
| 2.2272 | 0.0546 |
| 2.2969 | 0.0489 |
| 2.3179 | 0.0508 |
| 2.0793 | 0.0467 |
| 2.1847 | 0.0521 |
| 2.1752 | 0.0504 |
| 2.0762 | 0.0533 |
| 2.0535 | 0.0501 |
| 2.0535 | 0.0528 |
| 1.9477 | 0.0458 |
| 1.8397 | 0.0449 |
| 1.8861 | 0.0514 |
| 1.7301 | 0.0448 |
| 1.6200 | 0.0499 |
| 1.4423 | 0.0527 |
| 1.4596 | 0.0515 |
Grouping modalities : 100%|██████████| 5035/5035 [00:03<00:00, 1272.47it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 294.55it/s]
Testing robustness : 0%| | 0/5035 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.067e+00 | 2.6878 | 0.1000 |
| 2.067e+00 < x <= 2.898e+00 | 2.2133 | 0.4500 |
| 2.898e+00 < x <= 3.425e+00 | 1.9766 | 0.2500 |
| 3.425e+00 < x <= 3.877e+00 | 1.6375 | 0.1000 |
| 3.877e+00 < x | 1.4400 | 0.1000 |
| target_rate | frequency |
|---|---|
| 2.6573 | 0.0919 |
| 2.2376 | 0.4642 |
| 1.9594 | 0.2450 |
| 1.6721 | 0.0947 |
| 1.4509 | 0.1042 |
------
------
[AutoCarver] Fit Population (5/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 3.530e+02 | 1.9859 | 0.0501 |
| 3.530e+02 < x <= 5.140e+02 | 2.1616 | 0.0501 |
| 5.140e+02 < x <= 6.270e+02 | 2.1117 | 0.0501 |
| 6.270e+02 < x <= 7.150e+02 | 2.2819 | 0.0497 |
| 7.150e+02 < x <= 7.930e+02 | 2.0335 | 0.0509 |
| 7.930e+02 < x <= 8.640e+02 | 2.2113 | 0.0492 |
| 8.640e+02 < x <= 9.380e+02 | 2.0772 | 0.0498 |
| 9.380e+02 < x <= 1.015e+03 | 2.1386 | 0.0500 |
| 1.015e+03 < x <= 1.091e+03 | 2.0430 | 0.0503 |
| 1.091e+03 < x <= 1.170e+03 | 2.0506 | 0.0496 |
| 1.170e+03 < x <= 1.264e+03 | 2.0870 | 0.0505 |
| 1.264e+03 < x <= 1.354e+03 | 2.0195 | 0.0497 |
| 1.354e+03 < x <= 1.464e+03 | 2.0004 | 0.0502 |
| 1.464e+03 < x <= 1.583e+03 | 2.1102 | 0.0498 |
| 1.583e+03 < x <= 1.729e+03 | 2.0346 | 0.0500 |
| 1.729e+03 < x <= 1.908e+03 | 1.9139 | 0.0499 |
| 1.908e+03 < x <= 2.152e+03 | 2.0006 | 0.0500 |
| 2.152e+03 < x <= 2.563e+03 | 2.0707 | 0.0500 |
| 2.563e+03 < x <= 3.297e+03 | 1.9614 | 0.0500 |
| 3.297e+03 < x | 2.0428 | 0.0500 |
| target_rate | frequency |
|---|---|
| 1.9012 | 0.0530 |
| 2.1915 | 0.0520 |
| 2.1706 | 0.0523 |
| 2.1062 | 0.0514 |
| 2.2019 | 0.0531 |
| 2.1765 | 0.0490 |
| 2.2025 | 0.0506 |
| 2.1329 | 0.0553 |
| 2.1744 | 0.0437 |
| 2.1319 | 0.0480 |
| 1.9939 | 0.0534 |
| 2.0096 | 0.0465 |
| 1.9569 | 0.0465 |
| 1.9756 | 0.0504 |
| 2.0815 | 0.0496 |
| 2.0272 | 0.0461 |
| 1.9789 | 0.0487 |
| 1.9355 | 0.0496 |
| 2.0714 | 0.0518 |
| 2.0157 | 0.0487 |
Grouping modalities : 100%|██████████| 5035/5035 [00:05<00:00, 981.35it/s]
Computing associations: 100%|██████████| 5035/5035 [00:16<00:00, 300.33it/s]
Testing robustness : 1%| | 41/5035 [00:00<01:53, 43.83it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 3.530e+02 | 1.9859 | 0.0501 |
| 3.530e+02 < x <= 7.930e+02 | 2.1464 | 0.2008 |
| 7.930e+02 < x <= 8.640e+02 | 2.2113 | 0.0492 |
| 8.640e+02 < x <= 2.152e+03 | 2.0433 | 0.5498 |
| 2.152e+03 < x | 2.0250 | 0.1501 |
| target_rate | frequency |
|---|---|
| 1.9012 | 0.0530 |
| 2.1679 | 0.2087 |
| 2.1765 | 0.0490 |
| 2.0607 | 0.5390 |
| 2.0084 | 0.1502 |
------
------
[AutoCarver] Fit HouseAge (6/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 8.000e+00 | 2.1158 | 0.0537 |
| 8.000e+00 < x <= 1.200e+01 | 1.8220 | 0.0477 |
| 1.200e+01 < x <= 1.500e+01 | 1.8590 | 0.0613 |
| 1.500e+01 < x <= 1.600e+01 | 2.0358 | 0.0393 |
| 1.600e+01 < x <= 1.800e+01 | 1.9013 | 0.0596 |
| 1.800e+01 < x <= 2.000e+01 | 1.9399 | 0.0468 |
| 2.000e+01 < x <= 2.200e+01 | 2.0134 | 0.0404 |
| 2.200e+01 < x <= 2.500e+01 | 2.1055 | 0.0705 |
| 2.500e+01 < x <= 2.600e+01 | 2.0977 | 0.0300 |
| 2.600e+01 < x <= 2.800e+01 | 2.0218 | 0.0475 |
| 2.800e+01 < x <= 3.100e+01 | 2.0439 | 0.0682 |
| 3.100e+01 < x <= 3.300e+01 | 2.0275 | 0.0575 |
| 3.300e+01 < x <= 3.400e+01 | 2.1189 | 0.0328 |
| 3.400e+01 < x <= 3.500e+01 | 2.0204 | 0.0395 |
| 3.500e+01 < x <= 3.700e+01 | 2.0750 | 0.0687 |
| 3.700e+01 < x <= 3.900e+01 | 2.0212 | 0.0361 |
| 3.900e+01 < x <= 4.200e+01 | 2.0013 | 0.0450 |
| 4.200e+01 < x <= 4.500e+01 | 2.1301 | 0.0485 |
| 4.500e+01 < x | 2.4785 | 0.1072 |
| target_rate | frequency |
|---|---|
| 2.0205 | 0.0526 |
| 1.7827 | 0.0443 |
| 1.8780 | 0.0556 |
| 1.9208 | 0.0335 |
| 1.9484 | 0.0652 |
| 1.9517 | 0.0470 |
| 2.1141 | 0.0421 |
| 2.1179 | 0.0759 |
| 2.0888 | 0.0299 |
| 2.2138 | 0.0443 |
| 1.9546 | 0.0664 |
| 2.0512 | 0.0565 |
| 2.1979 | 0.0346 |
| 2.1762 | 0.0408 |
| 2.0747 | 0.0659 |
| 1.9885 | 0.0388 |
| 2.0394 | 0.0508 |
| 2.0015 | 0.0489 |
| 2.4651 | 0.1069 |
Grouping modalities : 100%|██████████| 4047/4047 [00:02<00:00, 1571.72it/s]
Computing associations: 100%|██████████| 4047/4047 [00:14<00:00, 287.71it/s]
Testing robustness : 2%|▏ | 91/4047 [00:01<01:03, 62.56it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.200e+01 | 1.9494 | 0.3486 |
| 2.200e+01 < x <= 2.600e+01 | 2.1032 | 0.1005 |
| 2.600e+01 < x <= 3.300e+01 | 2.0324 | 0.1732 |
| 3.300e+01 < x <= 4.500e+01 | 2.0628 | 0.2705 |
| 4.500e+01 < x | 2.4785 | 0.1072 |
| target_rate | frequency |
|---|---|
| 1.9447 | 0.3403 |
| 2.1097 | 0.1058 |
| 2.0560 | 0.1672 |
| 2.0736 | 0.2798 |
| 2.4651 | 0.1069 |
------
AutoCarver analysis
Carving Summary
[17]:
auto_carver.summary()
[17]:
| label | content | ||
|---|---|---|---|
| feature | dtype | ||
| AveBedrms | float | 0 | [x <= 1.058e+00] |
| float | 1 | [1.058e+00 < x <= 1.100e+00] | |
| float | 2 | [1.100e+00 < x <= 1.138e+00] | |
| float | 3 | [1.138e+00 < x <= 1.273e+00] | |
| float | 4 | [1.273e+00 < x] | |
| AveOccup | float | 0 | [x <= 2.067e+00] |
| float | 1 | [2.067e+00 < x <= 2.898e+00] | |
| float | 2 | [2.898e+00 < x <= 3.425e+00] | |
| float | 3 | [3.425e+00 < x <= 3.877e+00] | |
| float | 4 | [3.877e+00 < x] | |
| AveRooms | float | 0 | [x <= 5.531e+00] |
| float | 1 | [5.531e+00 < x <= 5.860e+00] | |
| float | 2 | [5.860e+00 < x <= 6.273e+00] | |
| float | 3 | [6.273e+00 < x <= 6.542e+00] | |
| float | 4 | [6.542e+00 < x] | |
| HouseAge | float | 0 | [x <= 2.200e+01] |
| float | 1 | [2.200e+01 < x <= 2.600e+01] | |
| float | 2 | [2.600e+01 < x <= 3.300e+01] | |
| float | 3 | [3.300e+01 < x <= 4.500e+01] | |
| float | 4 | [4.500e+01 < x] | |
| MedInc | float | 0 | [x <= 2.568e+00] |
| float | 1 | [2.568e+00 < x <= 3.323e+00] | |
| float | 2 | [3.323e+00 < x <= 4.461e+00] | |
| float | 3 | [4.461e+00 < x <= 6.155e+00] | |
| float | 4 | [6.155e+00 < x] | |
| Population | float | 0 | [x <= 3.530e+02] |
| float | 1 | [3.530e+02 < x <= 7.930e+02] | |
| float | 2 | [7.930e+02 < x <= 8.640e+02] | |
| float | 3 | [8.640e+02 < x <= 2.152e+03] | |
| float | 4 | [2.152e+03 < x] |
As requested with
output_dtype="float", output labels are integers of ranks of modalitiesFor quantitative feature
Population, the selected combination of modalities groups populations as follows:modality
0: lower or equal to 353 people (content==["x <= 3.530e+02"])modality
1: greater than 353 people and lower or equal to 793 people (content==["3.530e+02 < x <= 7.930e+02"])modality
2: greater than 793 people and lower or equal to 864 people (content==["7.930e+02 < x <= 8.640e+02"])modality
3: greater than 864 people and lower or equal to 2152 people (content==["8.640e+02 < x <= 2.152e+03"])modality
4: higher than 2152 people (content==["2.152e+03 < x "])
Detailed overview of tested combinations
[18]:
auto_carver.history("AveRooms").head(50)
[18]:
| combination | kruskal | viability | viability_message | grouping_nan | |
|---|---|---|---|---|---|
| 0 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1465.999414 | None | [Raw X distribution] | False |
| 1 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1417.935973 | True | [Combination robust between X and X_dev] | False |
| 2 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1417.241563 | None | [Not checked] | False |
| 3 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1416.624227 | None | [Not checked] | False |
| 4 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1416.389183 | None | [Not checked] | False |
| 5 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1415.929817 | None | [Not checked] | False |
| 6 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1414.416406 | None | [Not checked] | False |
| 7 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1413.546297 | None | [Not checked] | False |
| 8 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1413.104659 | None | [Not checked] | False |
| 9 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1412.980977 | None | [Not checked] | False |
| 10 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1412.129278 | None | [Not checked] | False |
| 11 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1411.946836 | None | [Not checked] | False |
| 12 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1411.669231 | None | [Not checked] | False |
| 13 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1411.423709 | None | [Not checked] | False |
| 14 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1410.635089 | None | [Not checked] | False |
| 15 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1410.209599 | None | [Not checked] | False |
| 16 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1410.111962 | None | [Not checked] | False |
| 17 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... | 1409.864040 | None | [Not checked] | False |
| 18 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1409.475866 | None | [Not checked] | False |
| 19 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1408.897852 | None | [Not checked] | False |
| 20 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1408.770297 | None | [Not checked] | False |
| 21 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1408.716395 | None | [Not checked] | False |
| 22 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1408.663351 | None | [Not checked] | False |
| 23 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... | 1408.552293 | None | [Not checked] | False |
| 24 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1407.686931 | None | [Not checked] | False |
| 25 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1407.539008 | None | [Not checked] | False |
| 26 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1407.458550 | None | [Not checked] | False |
| 27 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1407.443142 | None | [Not checked] | False |
| 28 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1407.404649 | None | [Not checked] | False |
| 29 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1407.351604 | None | [Not checked] | False |
| 30 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1407.302436 | None | [Not checked] | False |
| 31 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.784261 | None | [Not checked] | False |
| 32 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1406.318805 | None | [Not checked] | False |
| 33 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.305763 | None | [Not checked] | False |
| 34 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.227261 | None | [Not checked] | False |
| 35 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.131395 | None | [Not checked] | False |
| 36 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.089851 | None | [Not checked] | False |
| 37 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1406.017105 | None | [Not checked] | False |
| 38 | [[x <= 3.441e+00], [3.441e+00 < x <= 3.794e+00... | 1405.990689 | None | [Not checked] | False |
| 39 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.986129 | None | [Not checked] | False |
| 40 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.949694 | None | [Not checked] | False |
| 41 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.611353 | None | [Not checked] | False |
| 42 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... | 1405.604135 | None | [Not checked] | False |
| 43 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.322169 | None | [Not checked] | False |
| 44 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.314975 | None | [Not checked] | False |
| 45 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1405.260370 | None | [Not checked] | False |
| 46 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... | 1404.738243 | None | [Not checked] | False |
| 47 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1404.705359 | None | [Not checked] | False |
| 48 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00],... | 1404.702540 | None | [Not checked] | False |
| 49 | [[x <= 3.441e+00, 3.441e+00 < x <= 3.794e+00, ... | 1404.472730 | None | [Not checked] | False |
[19]:
auto_carver.history("Population")["viability_message"][2]
[19]:
['X_dev: inversion of target rates per modality']
The most associated combination of feature
Population(the first tested out, whereviability_message!=["Raw X distribution"]) did not pass the viability tests. When looking inviability_message:"X_dev: inversion of target rates per modality": target rates (mean values ofMedHouseValper grouped modality) are not ranked the same betweentrain_setanddev_set
For feature feature
Population, the 42nd combination is the first to pass the tests:viability_message!=["Combination robust between X and X_dev"]Kruskal-Wallis’ H with
MedHouseValis29.050321for this combinationFollowing combinations (less associated with the target) where not tested:
viability_message==["Not checked"]
For all combinations
grouping_nan==Falsemeans that it is not a combination in which NaNs are being groupedwith other modalities (as requested withdropna=False)
Saving and Loading AutoCarver
Saving
All Carvers can safely be stored as a .json file.
[20]:
import json
# storing as json file
with open('continuous_carver.json', 'w') as my_carver_json:
json.dump(auto_carver.to_json(), my_carver_json)
Loading
Carvers can safely be loaded from a .json file.
[21]:
import json
from AutoCarver import load_carver
# loading json file
with open('continuous_carver.json', 'r') as my_carver_json:
auto_carver = load_carver(json.load(my_carver_json))
Applying AutoCarver
[22]:
dev_set_processed = auto_carver.transform(dev_set)
[23]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[23]:
| AveBedrms | AveRooms | MedInc | AveOccup | Population | HouseAge | |
|---|---|---|---|---|---|---|
| 0.0 | 0.558280 | 0.609219 | 0.255432 | 0.091897 | 0.052995 | 0.340282 |
| 1.0 | 0.200382 | 0.096888 | 0.197592 | 0.464181 | 0.208749 | 0.105843 |
| 2.0 | 0.095126 | 0.097328 | 0.257780 | 0.245009 | 0.049031 | 0.167205 |
| 3.0 | 0.096447 | 0.046682 | 0.187904 | 0.094686 | 0.539049 | 0.279800 |
| 4.0 | 0.049765 | 0.149883 | 0.101292 | 0.104228 | 0.150176 | 0.106870 |
Feature Selection
Selectors settings
Features to select from
Here all features have been carved using ContinuousCarver, hence all features are qualitative.
[24]:
features = qualitative_features + quantitative_features + ordinal_features
Number of features to select
The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).
[25]:
n_best = 6 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics
Using Selectors
[26]:
import importlib
import AutoCarver.selectors
importlib.reload(AutoCarver.selectors)
[26]:
<module 'AutoCarver.selectors' from 'c:\\Users\\defra\\Desktop\\git\\PROJECTS\\AutoCarver\\docs\\source\\examples\\ContinuousRegression\\../../../../../AutoCarver\\AutoCarver\\selectors\\__init__.py'>
[27]:
from AutoCarver.selectors import RegressionSelector
# select the most target associated qualitative features
feature_selector = RegressionSelector(
qualitative_features=features,
n_best=n_best,
verbose=True, # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
------
[Selector] Selecting from qualitative features: ['AveBedrms', 'AveRooms', 'MedInc', 'AveOccup', 'Population', 'HouseAge']
---
- [Selector] Association between X and y
| dtype | pct_nan | pct_mode | mode | kruskal_measure | |
|---|---|---|---|---|---|
| MedInc | float64 | 0.0000 | 0.2500 | 0.0000 | 6207.6768 |
| AveRooms | float64 | 0.0000 | 0.6000 | 0.0000 | 1417.9360 |
| AveOccup | float64 | 0.0000 | 0.4500 | 1.0000 | 1026.3004 |
| AveBedrms | float64 | 0.0000 | 0.5500 | 0.0000 | 346.0749 |
| HouseAge | float64 | 0.0000 | 0.3486 | 0.0000 | 164.2102 |
| Population | float64 | 0.0000 | 0.5498 | 3.0000 | 29.0503 |
- [Selector] Association between X and y, filtered for inter-feature assocation
| dtype | pct_nan | pct_mode | mode | kruskal_measure | |
|---|---|---|---|---|---|
| MedInc | float64 | 0.0000 | 0.2500 | 0.0000 | 6207.6768 |
| AveRooms | float64 | 0.0000 | 0.6000 | 0.0000 | 1417.9360 |
| AveOccup | float64 | 0.0000 | 0.4500 | 1.0000 | 1026.3004 |
| AveBedrms | float64 | 0.0000 | 0.5500 | 0.0000 | 346.0749 |
| HouseAge | float64 | 0.0000 | 0.3486 | 0.0000 | 164.2102 |
| Population | float64 | 0.0000 | 0.5498 | 3.0000 | 29.0503 |
- [Selector] Selected qualitative features: ['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population']
------
Feature
MedIncis the most associated with the targetMedHouseVal:Kruskal-Wallis’ H value is
kruskal_measure=6207.67678It has 0 % of NaNs (
pct_nan=0.0)Its mode,
0, represents 25 % of observed data (pct_nan=0.2500)
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)