Setting things up

About this notebook

In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ContinuousCarver pipeline. Renowned for its association-maximizing discretization, ContinuousCarver is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.

The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ContinuousCarver, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.

Throughout this notebook, we’ll explore the intricacies of ContinuousCarver’s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ContinuousCarver ensures that each feature is finely tuned for our regression tasks.

Join us in this exploration as we leverage the power of ContinuousCarver to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.

Let’s dive in and uncover the potential of ContinuousCarver in transforming the California Housing Prices Dataset for optimal predictive modeling.

Installation

[1]:

# %pip install AutoCarver[jupyter]

[2]:

import sys
import os

print(os.listdir('../../../../../../AutoCarver'))

sys.path.append('../../../../../../AutoCarver')
sys.path.append('../../../../../../AutoCarver/AutoCarver')
sys.path.append('../../../../../../AutoCarver/AutoCarver/discretizers')
sys.path.append('../../../../../../AutoCarver/AutoCarver/discretizers/utils')
import AutoCarver

['.coverage', '.git', '.github', '.gitignore', '.ipynb_checkpoints', '.pytest_cache', '.readthedocs.yaml', 'AutoCarver', 'AutoCarver.egg-info', 'dist', 'docs', 'LICENSE', 'pyproject.toml', 'README.md', 'requirements.txt', 'setup.cfg', 'setup.py', 'tests', 'test_package.ipynb']

Califorinia Housing Prices Data

In this example notebook, we will use the California Housing Prices dataset.

The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.

Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).

[3]:

from sklearn import datasets

# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)

# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]

# Display the first few rows of the dataset
housing_data.head()

[3]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Target type and Carver selection

[4]:

target = "MedHouseVal"

housing_data[target].describe()

[4]:

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target "MedHouseVal" is a continuous target of type float64 used in a regression task. Hence we will use AutoCarver.ContinuousCarver and AutoCarver.selectors.RegressionSelector in following code blocks.

Data Sampling

[5]:

from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)

[6]:

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

[6]:

(2.0666362048018514, 2.072459655020552)

Picking up columns to Carve

[7]:

train_set.head()

[7]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
5088	0.9809	19.0	3.187726	1.129964	726.0	2.620939	33.98	-118.28	1.214
17096	4.2232	33.0	6.189696	1.086651	1015.0	2.377049	37.46	-122.23	3.637
5617	3.5488	42.0	4.821577	1.095436	1044.0	4.331950	33.79	-118.26	2.056
20060	1.6469	24.0	4.274194	1.048387	1686.0	4.532258	35.87	-119.26	0.476
895	3.9909	14.0	4.608303	1.089350	2738.0	2.471119	37.54	-121.96	2.360

[8]:

# column data types
train_set.dtypes

[8]:

MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object

All features are quantitative continuous features at the exception of Latitude and Longitude which are geographical featues (not supported by AutoCarver as is). All other features will be added to the list of quantitative_features.

[9]:

# lists of features per data type
quantitative_features = ["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"]
qualitative_features = []
ordinal_features = []

# user-specified ordering for ordinal features
values_orders = {}

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:

For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

[10]:

min_freq = 0.05

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[11]:

max_n_mod = 5

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Association metric

The attribute sort_by allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by Carvers.

[12]:

# Optional for ContinuousCarver, the implemented metric is "kruskal"
sort_by = "kruskal"

Grouping NaNs

The attribute dropna allows one to choose whether or not numpy.nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with numpy.nan.

[13]:

dropna = False  # anyway, there are no numpy.nan in this dataset

Optional attributes

Minimal frequency per carved modality

The attribute min_freq_mod allows one to choose the minimum frequency per output modality. It is used by Carvers in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to min_freq/2.

[14]:

min_freq_mod = None  # for 0.05,  at least 5 % of observations per output modality in train and dev sets

Type of output carved features

The attribute output_dtype allows one to choose the output type:

Use "float" for integer output (default)
Use "str" for string output

[15]:

output_dtype = "float"  # "str"

Fitting AutoCarver

First, all quantitative features are discretized:
1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq=0.05)
2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2=0.025) to be grouped with its closest modality
Second, all features are carved following this recipe:
1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step
2. Grouping modalities: all consecutive combinations of modalities are applied to train_set
3. Computing associations: the association metric (sort_by="kruskal") is computed with the provided target train_set[target]
4. Combinations are sorted in descending order by association value
5. Testing robustness: finds the first combination that checks the following:
  - Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq_mod)
  - Distinct target rates per consecutive modalities on train_set and dev_set
  - No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)
6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with numpy.nan are applied to train_set and steps 3. and 4. are run
7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[16]:

from AutoCarver import ContinuousCarver

# intiating AutoCarver
auto_carver = ContinuousCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    min_freq_mod=min_freq_mod,
    max_n_mod=max_n_mod,
    dropna=dropna,
    sort_by=sort_by,
    output_dtype=output_dtype,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['MedInc', 'AveBedrms', 'Population', 'AveRooms', 'HouseAge', 'AveOccup']
 - [OrdinalDiscretizer] Fit ['MedInc', 'AveBedrms', 'Population', 'AveRooms', 'HouseAge', 'AveOccup']
------


------
[AutoCarver] Fit MedInc (1/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 1.602e+00	1.1102	0.0500
1.602e+00 < x <= 1.905e+00	1.1285	0.0500
1.905e+00 < x <= 2.151e+00	1.2198	0.0500
2.151e+00 < x <= 2.355e+00	1.3171	0.0500
2.355e+00 < x <= 2.568e+00	1.3817	0.0500
2.568e+00 < x <= 2.737e+00	1.5409	0.0500
2.737e+00 < x <= 2.975e+00	1.6159	0.0500
2.975e+00 < x <= 3.143e+00	1.6906	0.0499
3.143e+00 < x <= 3.323e+00	1.8232	0.0500
3.323e+00 < x <= 3.539e+00	1.9059	0.0500
3.539e+00 < x <= 3.729e+00	2.0076	0.0502
3.729e+00 < x <= 3.974e+00	2.0271	0.0498
3.974e+00 < x <= 4.179e+00	2.1456	0.0500
4.179e+00 < x <= 4.461e+00	2.2433	0.0500
4.461e+00 < x <= 4.757e+00	2.3621	0.0501
4.757e+00 < x <= 5.116e+00	2.3986	0.0499
5.116e+00 < x <= 5.545e+00	2.6438	0.0500
5.545e+00 < x <= 6.155e+00	2.9324	0.0500
6.155e+00 < x <= 7.316e+00	3.4592	0.0500
7.316e+00 < x	4.3784	0.0500

X_dev distribution
target_rate	frequency
1.1017	0.0509
1.0410	0.0502
1.2407	0.0501
1.2919	0.0506
1.4676	0.0536
1.5605	0.0417
1.6280	0.0584
1.7519	0.0471
1.8443	0.0504
1.8500	0.0498
2.0040	0.0533
2.0890	0.0502
2.1641	0.0505
2.2700	0.0540
2.3768	0.0439
2.5087	0.0479
2.6814	0.0483
2.9805	0.0479
3.3748	0.0530
4.3748	0.0483

Grouping modalities   : 100%|██████████| 5035/5035 [00:03<00:00, 1271.51it/s]
Computing associations: 100%|██████████| 5035/5035 [00:17<00:00, 288.37it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 2.568e+00	1.2314	0.2500
2.568e+00 < x <= 3.323e+00	1.6676	0.2000
3.323e+00 < x <= 4.461e+00	2.0659	0.2499
4.461e+00 < x <= 6.155e+00	2.5843	0.2000
6.155e+00 < x	3.9191	0.1000

X_dev distribution
target_rate	frequency
1.2315	0.2554
1.6984	0.1976
2.0779	0.2578
2.6424	0.1879
3.8516	0.1013

------


------
[AutoCarver] Fit AveBedrms (2/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 9.400e-01	2.0684	0.0500
9.400e-01 < x <= 9.672e-01	2.0735	0.0500
9.672e-01 < x <= 9.832e-01	2.2167	0.0501
9.832e-01 < x <= 9.958e-01	2.1706	0.0499
9.958e-01 < x <= 1.007e+00	2.1310	0.0500
1.007e+00 < x <= 1.015e+00	2.2358	0.0500
1.015e+00 < x <= 1.025e+00	2.1668	0.0500
1.025e+00 < x <= 1.033e+00	2.2102	0.0500
1.033e+00 < x <= 1.041e+00	2.1295	0.0500
1.041e+00 < x <= 1.050e+00	2.1548	0.0500
1.050e+00 < x <= 1.058e+00	2.1238	0.0500
1.058e+00 < x <= 1.067e+00	2.1025	0.0500
1.067e+00 < x <= 1.077e+00	2.0704	0.0500
1.077e+00 < x <= 1.088e+00	2.0664	0.0501
1.088e+00 < x <= 1.100e+00	2.1118	0.0499
1.100e+00 < x <= 1.116e+00	1.9937	0.0500
1.116e+00 < x <= 1.138e+00	1.9405	0.0500
1.138e+00 < x <= 1.174e+00	1.7990	0.0500
1.174e+00 < x <= 1.273e+00	1.9162	0.0500
1.273e+00 < x	1.6515	0.0500

X_dev distribution
target_rate	frequency
2.0416	0.0539
2.2043	0.0527
2.0997	0.0482
2.1835	0.0487
2.2628	0.0552
2.1619	0.0480
2.2295	0.0567
2.1690	0.0493
2.1581	0.0528
2.1202	0.0476
2.1039	0.0452
2.1595	0.0509
2.1037	0.0521
2.0662	0.0484
2.0487	0.0489
1.9543	0.0467
1.8871	0.0484
1.8680	0.0499
1.8371	0.0465
1.7182	0.0498

Grouping modalities   : 100%|██████████| 5035/5035 [00:04<00:00, 1236.63it/s]
Computing associations: 100%|██████████| 5035/5035 [00:18<00:00, 274.27it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 1.058e+00	2.1528	0.5500
1.058e+00 < x <= 1.100e+00	2.0878	0.2000
1.100e+00 < x <= 1.138e+00	1.9671	0.0999
1.138e+00 < x <= 1.273e+00	1.8575	0.1000
1.273e+00 < x	1.6515	0.0500

X_dev distribution
target_rate	frequency
2.1597	0.5583
2.0954	0.2004
1.9201	0.0951
1.8531	0.0964
1.7182	0.0498

------


------
[AutoCarver] Fit Population (3/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 3.530e+02	1.9859	0.0501
3.530e+02 < x <= 5.140e+02	2.1616	0.0501
5.140e+02 < x <= 6.270e+02	2.1117	0.0501
6.270e+02 < x <= 7.150e+02	2.2819	0.0497
7.150e+02 < x <= 7.930e+02	2.0335	0.0509
7.930e+02 < x <= 8.640e+02	2.2113	0.0492
8.640e+02 < x <= 9.380e+02	2.0772	0.0498
9.380e+02 < x <= 1.015e+03	2.1386	0.0500
1.015e+03 < x <= 1.091e+03	2.0430	0.0503
1.091e+03 < x <= 1.170e+03	2.0506	0.0496
1.170e+03 < x <= 1.264e+03	2.0870	0.0505
1.264e+03 < x <= 1.354e+03	2.0195	0.0497
1.354e+03 < x <= 1.464e+03	2.0004	0.0502
1.464e+03 < x <= 1.583e+03	2.1102	0.0498
1.583e+03 < x <= 1.729e+03	2.0346	0.0500
1.729e+03 < x <= 1.908e+03	1.9139	0.0499
1.908e+03 < x <= 2.152e+03	2.0006	0.0500
2.152e+03 < x <= 2.563e+03	2.0707	0.0500
2.563e+03 < x <= 3.297e+03	1.9614	0.0500
3.297e+03 < x	2.0428	0.0500

X_dev distribution
target_rate	frequency
1.9012	0.0530
2.1915	0.0520
2.1706	0.0523
2.1062	0.0514
2.2019	0.0531
2.1765	0.0490
2.2025	0.0506
2.1329	0.0553
2.1744	0.0437
2.1319	0.0480
1.9939	0.0534
2.0096	0.0465
1.9569	0.0465
1.9756	0.0504
2.0815	0.0496
2.0272	0.0461
1.9789	0.0487
1.9355	0.0496
2.0714	0.0518
2.0157	0.0487

Grouping modalities   : 100%|██████████| 5035/5035 [00:03<00:00, 1387.23it/s]
Computing associations: 100%|██████████| 5035/5035 [00:16<00:00, 298.47it/s]
Testing robustness    :   1%|          | 41/5035 [00:01<02:41, 30.84it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 3.530e+02	1.9859	0.0501
3.530e+02 < x <= 7.930e+02	2.1464	0.2008
7.930e+02 < x <= 8.640e+02	2.2113	0.0492
8.640e+02 < x <= 2.152e+03	2.0433	0.5498
2.152e+03 < x	2.0250	0.1501

X_dev distribution
target_rate	frequency
1.9012	0.0530
2.1679	0.2087
2.1765	0.0490
2.0607	0.5390
2.0084	0.1502

------


------
[AutoCarver] Fit AveRooms (4/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 3.441e+00	1.9126	0.0500
3.441e+00 < x <= 3.794e+00	1.8286	0.0500
3.794e+00 < x <= 4.055e+00	1.8169	0.0500
4.055e+00 < x <= 4.279e+00	1.8418	0.0500
4.279e+00 < x <= 4.459e+00	1.7529	0.0500
4.459e+00 < x <= 4.621e+00	1.7915	0.0500
4.621e+00 < x <= 4.791e+00	1.8214	0.0500
4.791e+00 < x <= 4.939e+00	1.7685	0.0500
4.939e+00 < x <= 5.087e+00	1.7466	0.0500
5.087e+00 < x <= 5.232e+00	1.7717	0.0500
5.232e+00 < x <= 5.383e+00	1.8664	0.0500
5.383e+00 < x <= 5.531e+00	1.8472	0.0500
5.531e+00 < x <= 5.694e+00	1.9199	0.0500
5.694e+00 < x <= 5.860e+00	1.9910	0.0500
5.860e+00 < x <= 6.058e+00	2.0870	0.0500
6.058e+00 < x <= 6.273e+00	2.1908	0.0500
6.273e+00 < x <= 6.542e+00	2.4050	0.0500
6.542e+00 < x <= 6.949e+00	2.6874	0.0500
6.949e+00 < x <= 7.652e+00	3.1129	0.0500
7.652e+00 < x	3.1718	0.0500

X_dev distribution
target_rate	frequency
1.8659	0.0518
1.8728	0.0505
1.7627	0.0524
1.8020	0.0543
1.7223	0.0552
1.6802	0.0452
1.7707	0.0530
1.8030	0.0443
1.8209	0.0523
1.8326	0.0437
1.7923	0.0550
1.9388	0.0514
1.9465	0.0501
2.0248	0.0468
2.1049	0.0483
2.2239	0.0490
2.4339	0.0467
2.7667	0.0468
3.1001	0.0548
3.2429	0.0483

Grouping modalities   : 100%|██████████| 5035/5035 [00:03<00:00, 1325.64it/s]
Computing associations: 100%|██████████| 5035/5035 [00:18<00:00, 271.23it/s]
Testing robustness    :   0%|          | 0/5035 [00:00<?, ?it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 5.531e+00	1.8138	0.6000
5.531e+00 < x <= 5.860e+00	1.9554	0.0999
5.860e+00 < x <= 6.273e+00	2.1389	0.1000
6.273e+00 < x <= 6.542e+00	2.4050	0.0500
6.542e+00 < x	2.9907	0.1501

X_dev distribution
target_rate	frequency
1.8055	0.6092
1.9844	0.0969
2.1649	0.0973
2.4339	0.0467
3.0420	0.1499

------


------
[AutoCarver] Fit HouseAge (5/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 8.000e+00	2.1158	0.0537
8.000e+00 < x <= 1.200e+01	1.8220	0.0477
1.200e+01 < x <= 1.500e+01	1.8590	0.0613
1.500e+01 < x <= 1.600e+01	2.0358	0.0393
1.600e+01 < x <= 1.800e+01	1.9013	0.0596
1.800e+01 < x <= 2.000e+01	1.9399	0.0468
2.000e+01 < x <= 2.200e+01	2.0134	0.0404
2.200e+01 < x <= 2.500e+01	2.1055	0.0705
2.500e+01 < x <= 2.600e+01	2.0977	0.0300
2.600e+01 < x <= 2.800e+01	2.0218	0.0475
2.800e+01 < x <= 3.100e+01	2.0439	0.0682
3.100e+01 < x <= 3.300e+01	2.0275	0.0575
3.300e+01 < x <= 3.400e+01	2.1189	0.0328
3.400e+01 < x <= 3.500e+01	2.0204	0.0395
3.500e+01 < x <= 3.700e+01	2.0750	0.0687
3.700e+01 < x <= 3.900e+01	2.0212	0.0361
3.900e+01 < x <= 4.200e+01	2.0013	0.0450
4.200e+01 < x <= 4.500e+01	2.1301	0.0485
4.500e+01 < x	2.4785	0.1072

X_dev distribution
target_rate	frequency
2.0205	0.0526
1.7827	0.0443
1.8780	0.0556
1.9208	0.0335
1.9484	0.0652
1.9517	0.0470
2.1141	0.0421
2.1179	0.0759
2.0888	0.0299
2.2138	0.0443
1.9546	0.0664
2.0512	0.0565
2.1979	0.0346
2.1762	0.0408
2.0747	0.0659
1.9885	0.0388
2.0394	0.0508
2.0015	0.0489
2.4651	0.1069

Grouping modalities   : 100%|██████████| 4047/4047 [00:02<00:00, 1400.93it/s]
Computing associations: 100%|██████████| 4047/4047 [00:14<00:00, 273.20it/s]
Testing robustness    :   2%|▏         | 91/4047 [00:01<01:00, 64.85it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 2.200e+01	1.9494	0.3486
2.200e+01 < x <= 2.600e+01	2.1032	0.1005
2.600e+01 < x <= 3.300e+01	2.0324	0.1732
3.300e+01 < x <= 4.500e+01	2.0628	0.2705
4.500e+01 < x	2.4785	0.1072

X_dev distribution
target_rate	frequency
1.9447	0.3403
2.1097	0.1058
2.0560	0.1672
2.0736	0.2798
2.4651	0.1069

------


------
[AutoCarver] Fit AveOccup (6/6)
---

 - [AutoCarver] Raw distribution

X distribution
	target_rate	frequency
x <= 1.870e+00	2.7122	0.0500
1.870e+00 < x <= 2.067e+00	2.6633	0.0500
2.067e+00 < x <= 2.225e+00	2.3373	0.0500
2.225e+00 < x <= 2.338e+00	2.3080	0.0500
2.338e+00 < x <= 2.432e+00	2.1976	0.0500
2.432e+00 < x <= 2.513e+00	2.2064	0.0500
2.513e+00 < x <= 2.595e+00	2.1736	0.0500
2.595e+00 < x <= 2.668e+00	2.1862	0.0500
2.668e+00 < x <= 2.743e+00	2.1378	0.0500
2.743e+00 < x <= 2.820e+00	2.1902	0.0500
2.820e+00 < x <= 2.898e+00	2.1824	0.0500
2.898e+00 < x <= 2.984e+00	2.0741	0.0500
2.984e+00 < x <= 3.073e+00	2.0255	0.0501
3.073e+00 < x <= 3.171e+00	1.9914	0.0498
3.171e+00 < x <= 3.282e+00	1.8992	0.0500
3.282e+00 < x <= 3.425e+00	1.8926	0.0500
3.425e+00 < x <= 3.607e+00	1.7085	0.0500
3.607e+00 < x <= 3.877e+00	1.5666	0.0500
3.877e+00 < x <= 4.325e+00	1.4505	0.0500
4.325e+00 < x	1.4294	0.0500

X_dev distribution
target_rate	frequency
2.7684	0.0484
2.5334	0.0435
2.3989	0.0542
2.3641	0.0533
2.2272	0.0546
2.2969	0.0489
2.3179	0.0508
2.0793	0.0467
2.1847	0.0521
2.1752	0.0504
2.0762	0.0533
2.0535	0.0501
2.0535	0.0528
1.9477	0.0458
1.8397	0.0449
1.8861	0.0514
1.7301	0.0448
1.6200	0.0499
1.4423	0.0527
1.4596	0.0515

Grouping modalities   : 100%|██████████| 5035/5035 [00:04<00:00, 1088.53it/s]
Computing associations: 100%|██████████| 5035/5035 [00:18<00:00, 269.98it/s]
Testing robustness    :   0%|          | 0/5035 [00:01<?, ?it/s]


 - [AutoCarver] Carved distribution

X distribution
	target_rate	frequency
x <= 2.067e+00	2.6878	0.1000
2.067e+00 < x <= 2.898e+00	2.2133	0.4500
2.898e+00 < x <= 3.425e+00	1.9766	0.2500
3.425e+00 < x <= 3.877e+00	1.6375	0.1000
3.877e+00 < x	1.4400	0.1000

X_dev distribution
target_rate	frequency
2.6573	0.0919
2.2376	0.4642
1.9594	0.2450
1.6721	0.0947
1.4509	0.1042

------

AutoCarver analysis

Carving Summary

[17]:

auto_carver.summary()

[17]:

		label	content
feature	dtype
AveBedrms	float	0	[x <= 1.058e+00]
	float	1	[1.058e+00 < x <= 1.100e+00]
	float	2	[1.100e+00 < x <= 1.138e+00]
	float	3	[1.138e+00 < x <= 1.273e+00]
	float	4	[1.273e+00 < x]
AveOccup	float	0	[x <= 2.067e+00]
	float	1	[2.067e+00 < x <= 2.898e+00]
	float	2	[2.898e+00 < x <= 3.425e+00]
	float	3	[3.425e+00 < x <= 3.877e+00]
	float	4	[3.877e+00 < x]
AveRooms	float	0	[x <= 5.531e+00]
	float	1	[5.531e+00 < x <= 5.860e+00]
	float	2	[5.860e+00 < x <= 6.273e+00]
	float	3	[6.273e+00 < x <= 6.542e+00]
	float	4	[6.542e+00 < x]
HouseAge	float	0	[x <= 2.200e+01]
	float	1	[2.200e+01 < x <= 2.600e+01]
	float	2	[2.600e+01 < x <= 3.300e+01]
	float	3	[3.300e+01 < x <= 4.500e+01]
	float	4	[4.500e+01 < x]
MedInc	float	0	[x <= 2.568e+00]
	float	1	[2.568e+00 < x <= 3.323e+00]
	float	2	[3.323e+00 < x <= 4.461e+00]
	float	3	[4.461e+00 < x <= 6.155e+00]
	float	4	[6.155e+00 < x]
Population	float	0	[x <= 3.530e+02]
	float	1	[3.530e+02 < x <= 7.930e+02]
	float	2	[7.930e+02 < x <= 8.640e+02]
	float	3	[8.640e+02 < x <= 2.152e+03]
	float	4	[2.152e+03 < x]

As requested with output_dtype="float", output labels are integers of ranks of modalities
For quantitative feature Population, the selected combination of modalities groups populations as follows:
- modality 0: lower or equal to 353 people (content==["x <= 3.530e+02"])
- modality 1: greater than 353 people and lower or equal to 793 people (content==["3.530e+02 < x <= 7.930e+02"])
- modality 2: greater than 793 people and lower or equal to 864 people (content==["7.930e+02 < x <= 8.640e+02"])
- modality 3: greater than 864 people and lower or equal to 2152 people (content==["8.640e+02 < x <= 2.152e+03"])
- modality 4: higher than 2152 people (content==["2.152e+03 < x "])

Detailed overview of tested combinations

[18]:

auto_carver.history("Population").head(50)

[18]:

	combination	kruskal	viability	viability_message	grouping_nan
0	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	61.608669	None	[Raw X distribution]	False
1	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	38.595886	False	[X_dev: inversion of target rates per modality]	False
2	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	37.509588	False	[X_dev: inversion of target rates per modality]	False
3	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	36.950627	False	[X_dev: inversion of target rates per modality]	False
4	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	36.516263	False	[X_dev: inversion of target rates per modality]	False
5	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	36.285155	False	[X_dev: inversion of target rates per modality]	False
6	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.834724	False	[X_dev: inversion of target rates per modality]	False
7	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.722272	False	[X_dev: inversion of target rates per modality]	False
8	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.697342	False	[X_dev: inversion of target rates per modality]	False
9	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.694201	False	[X_dev: inversion of target rates per modality]	False
10	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.604611	False	[X_dev: inversion of target rates per modality]	False
11	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.496366	False	[X_dev: inversion of target rates per modality]	False
12	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.465201	False	[X_dev: inversion of target rates per modality]	False
13	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.396630	False	[X_dev: inversion of target rates per modality]	False
14	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.367801	False	[X_dev: inversion of target rates per modality]	False
15	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.352993	False	[X_dev: inversion of target rates per modality]	False
16	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.347358	False	[X_dev: inversion of target rates per modality]	False
17	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	35.290339	False	[X_dev: inversion of target rates per modality]	False
18	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	34.054800	False	[X_dev: inversion of target rates per modality]	False
19	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	32.727624	False	[X_dev: inversion of target rates per modality]	False
20	[[x <= 3.530e+02, 3.530e+02 < x <= 5.140e+02, ...	32.081343	False	[X_dev: inversion of target rates per modality]	False
21	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	31.944529	False	[X_dev: inversion of target rates per modality]	False
22	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	31.862812	False	[X_dev: inversion of target rates per modality]	False
23	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	31.616060	False	[X_dev: inversion of target rates per modality]	False
24	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	31.159450	False	[X_dev: inversion of target rates per modality]	False
25	[[x <= 3.530e+02, 3.530e+02 < x <= 5.140e+02, ...	30.754167	False	[X_dev: inversion of target rates per modality]	False
26	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	30.410371	False	[X_dev: inversion of target rates per modality]	False
27	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	30.390508	False	[X_dev: inversion of target rates per modality]	False
28	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	30.294638	False	[X_dev: inversion of target rates per modality]	False
29	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	30.234690	False	[X_dev: inversion of target rates per modality]	False
30	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	30.047886	False	[X_dev: inversion of target rates per modality]	False
31	[[x <= 3.530e+02, 3.530e+02 < x <= 5.140e+02, ...	29.971072	False	[X_dev: inversion of target rates per modality]	False
32	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.964223	False	[X_dev: inversion of target rates per modality]	False
33	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.710897	False	[X_dev: inversion of target rates per modality]	False
34	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.543128	False	[X_dev: inversion of target rates per modality]	False
35	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.463808	False	[X_dev: inversion of target rates per modality]	False
36	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.409778	False	[X_dev: inversion of target rates per modality]	False
37	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.297466	False	[X_dev: inversion of target rates per modality]	False
38	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.199974	False	[X_dev: inversion of target rates per modality]	False
39	[[x <= 3.530e+02, 3.530e+02 < x <= 5.140e+02, ...	29.185993	False	[X_dev: inversion of target rates per modality]	False
40	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.157786	False	[X_dev: inversion of target rates per modality]	False
41	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.073034	False	[X_dev: inversion of target rates per modality]	False
42	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	29.050321	True	[Combination robust between X and X_dev]	False
43	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.949148	None	[Not checked]	False
44	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.887201	None	[Not checked]	False
45	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.861708	None	[Not checked]	False
46	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.842197	None	[Not checked]	False
47	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.839507	None	[Not checked]	False
48	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.822334	None	[Not checked]	False
49	[[x <= 3.530e+02], [3.530e+02 < x <= 5.140e+02...	28.795493	None	[Not checked]	False

[19]:

auto_carver.history("Population")["viability_message"][2]

[19]:

['X_dev: inversion of target rates per modality']

The most associated combination of feature Population (the first tested out, where viability_message!=["Raw X distribution"]) did not pass the viability tests. When looking in viability_message:
- "X_dev: inversion of target rates per modality": target rates (mean values of MedHouseVal per grouped modality) are not ranked the same between train_set and dev_set
For feature feature Population, the 42nd combination is the first to pass the tests:
- viability_message!=["Combination robust between X and X_dev"]
- Kruskal-Wallis’ H with MedHouseVal is 29.050321 for this combination
- Following combinations (less associated with the target) where not tested: viability_message==["Not checked"]
For all combinations grouping_nan==False means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[20]:

import json

# storing as json file
with open('continuous_carver.json', 'w') as my_carver_json:
    json.dump(auto_carver.to_json(), my_carver_json)

Loading

Carvers can safely be loaded from a .json file.

[21]:

import json

from AutoCarver import load_carver

# loading json file
with open('continuous_carver.json', 'r') as my_carver_json:
    auto_carver = load_carver(json.load(my_carver_json))

Applying AutoCarver

[22]:

dev_set_processed = auto_carver.transform(dev_set)

[23]:

dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

[23]:

	MedInc	AveBedrms	Population	AveRooms	HouseAge	AveOccup
0.0	0.255432	0.558280	0.052995	0.609219	0.340282	0.091897
1.0	0.197592	0.200382	0.208749	0.096888	0.105843	0.464181
2.0	0.257780	0.095126	0.049031	0.097328	0.167205	0.245009
3.0	0.187904	0.096447	0.539049	0.046682	0.279800	0.094686
4.0	0.101292	0.049765	0.150176	0.149883	0.106870	0.104228

Updating/Adjusting AutoCarver

Updating group thresholds/values

Let’s say one wants to adjust the values of a carved quantitative feature for better readability. Take feature "Population" for example, one could prefer group thresholds rounded to the nearest ten.

[24]:

feature = "Population"
thresholds = auto_carver.values_orders[feature]
print(f"Upper thresholds for groups of feature {feature}: {thresholds}")

Upper thresholds for groups of feature Population: [353.0, 793.0, 864.0, 2152.0, inf]

[25]:

# grouping values into closest rounded hundred
for threshold in thresholds[:-1]:  # be carefull not to round the np.inf and to keep it whatsoever
    rounded_threshold = 10 * round(threshold / 10, 0)
    auto_carver.update_discretizer(feature, "replace", threshold, rounded_threshold)

auto_carver.summary(feature)

[25]:

		label	content
feature	dtype
Population	float	0	[x <= 3.500e+02]
	float	1	[3.500e+02 < x <= 7.900e+02]
	float	2	[7.900e+02 < x <= 8.600e+02]
	float	3	[8.600e+02 < x <= 2.150e+03]
	float	4	[2.150e+03 < x]

Using ContinuousCarver.update_discretizer() we managed to round the thresholds of our groups. Be careful with changes to Carvers as no viability checks are performed.

Grouping built modalities

Let’s say one thinks there are to many modalities for carved feature "Population". Using ContinuousCarver.update_discretizer() we can easily group two existing groups.

[26]:

feature = "Population"
thresholds = auto_carver.values_orders[feature]
print(f"Upper thresholds for groups of feature {feature}: {thresholds}")

Upper thresholds for groups of feature Population: [350.0, 790.0, 860.0, 2150.0, inf]

Threshold values 790 and 860 seem quite close compared to 350 and 2150. Let’s group them:

[27]:

auto_carver.update_discretizer(feature, "group", 790, 860)  # be careful to respect the real number ordering, otherwise labels and discretizations will be wrong
auto_carver.summary(feature)

[27]:

		label	content
feature	dtype
Population	float	0	[x <= 3.500e+02]
	float	1	[3.500e+02 < x <= 8.600e+02]
	float	2	[8.600e+02 < x <= 2.150e+03]
	float	3	[2.150e+03 < x]

Adding new/unexpected modalities

Let’s say a new modality or missing values appear when using Carvers on new data. One can easily add these modalities to an existing group of its choice, once again, using ContinuousCarver.update_discretizer().

[28]:

feature = "Population"
thresholds = auto_carver.values_orders[feature]
print(f"Group thresholds for feature {feature}: {thresholds}")

Group thresholds for feature Population: [350.0, 860.0, 2150.0, inf]

Let’s add missing values nan to the highest group (upper threshold at inf):

[29]:

from numpy import nan, inf

auto_carver.update_discretizer(feature, "group", nan, inf)  # be careful to respect the real number ordering, otherwise labels and discretizations will be wrong
auto_carver.summary(feature)

[29]:

		label	content
feature	dtype
Population	float	0	[x <= 3.500e+02]
	float	1	[3.500e+02 < x <= 8.600e+02]
	float	2	[8.600e+02 < x <= 2.150e+03]
	float	3	[2.150e+03 < x, __NAN__]

nan has been added to the highest modality has the default missing value in Carvers (str_nan="__NAN__"). For the sake of the example, we will simulate some randomly missing values in the feature and apply the ContinuousCarver discretization:

[30]:

# inserting missing values
dev_set_with_nan = dev_set.copy()
dev_set_with_nan.loc[dev_set_with_nan.sample(frac=0.1).index, feature] = nan

[31]:

# processing datasets
dev_set_processed = auto_carver.transform(dev_set)
dev_set_with_nan_processed = auto_carver.transform(dev_set_with_nan)

[32]:

print("Distribution of Population that has no missing values (raw dev_set)\n", dev_set_processed[feature].value_counts(dropna=False, normalize=True).sort_index())
print("Distribution of Population that has missing values (dev_set_with_nan)\n", dev_set_with_nan_processed[feature].value_counts(dropna=False, normalize=True).sort_index())

Distribution of Population that has no missing values (raw dev_set)
 Population
0.0    0.052554
1.0    0.254844
2.0    0.542278
3.0    0.150323
Name: proportion, dtype: float64
Distribution of Population that has missing values (dev_set_with_nan)
 Population
0.0    0.047123
1.0    0.230916
2.0    0.489430
3.0    0.232531
Name: proportion, dtype: float64

As requested, the missing values have been grouped with the modality labeled 3!

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using ContinuousCarver, hence all features are qualitative.

[33]:

features = qualitative_features + quantitative_features + ordinal_features

Number of features to select

The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[34]:

n_best = 6  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Using Selectors

[35]:

from AutoCarver.selectors import RegressionSelector

# select the most target associated qualitative features
feature_selector = RegressionSelector(
    qualitative_features=features,
    n_best=n_best,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])

------
[Selector] Selecting from qualitative features: ['MedInc', 'AveBedrms', 'Population', 'AveRooms', 'HouseAge', 'AveOccup']
---

 - [Selector] Association between X and y

	dtype	pct_mode	mode	kruskal_measure
MedInc	float64	0.2500	0.0000	6207.6768
AveRooms	float64	0.6000	0.0000	1417.9360
AveOccup	float64	0.4500	1.0000	1026.3004
AveBedrms	float64	0.5500	0.0000	346.0749
HouseAge	float64	0.3486	0.0000	164.2102
Population	float64	0.5498	3.0000	29.0503


 - [Selector] Association between X and y, filtered for inter-feature assocation

	dtype	pct_mode	mode	kruskal_measure
MedInc	float64	0.2500	0.0000	6207.6768
AveRooms	float64	0.6000	0.0000	1417.9360
AveOccup	float64	0.4500	1.0000	1026.3004
AveBedrms	float64	0.5500	0.0000	346.0749
HouseAge	float64	0.3486	0.0000	164.2102
Population	float64	0.5498	3.0000	29.0503


 - [Selector] Selected qualitative features: ['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population']
------

Feature MedInc is the most associated with the target MedHouseVal:
- Kruskal-Wallis’ H value is kruskal_measure=6207.67678
- It has 0 % of NaNs (pct_nan=0.0)
- Its mode, 0, represents 25 % of observed data (pct_nan=0.2500)
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

What’s next?

Thanks to Carvers all of your features are now optimally processed for your regression task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in continuous regression tasks shines through in your meticulous use of AutoCarver’s ContinuousCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The ContinuousCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in continuous regression tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.