Setting things up

About this notebook

In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ContinuousCarver pipeline. Renowned for its association-maximizing discretization, ContinuousCarver is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.

The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ContinuousCarver, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.

Throughout this notebook, we’ll explore the intricacies of ContinuousCarver’s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ContinuousCarver ensures that each feature is finely tuned for our regression tasks.

Join us in this exploration as we leverage the power of ContinuousCarver to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.

Let’s dive in and uncover the potential of ContinuousCarver in transforming the California Housing Prices Dataset for optimal predictive modeling.

Installation

[1]:

# %pip install AutoCarver[jupyter]

Califorinia Housing Prices Data

In this example notebook, we will use the California Housing Prices dataset.

The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.

Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).

[1]:

from sklearn import datasets

# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)

# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]

# Display the first few rows of the dataset
housing_data.head()

[1]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Target type and Carver selection

[2]:

target = "MedHouseVal"

housing_data[target].describe()

[2]:

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target "MedHouseVal" is a continuous target of type float64 used in a regression task. Hence we will use AutoCarver.ContinuousCarver and AutoCarver.selectors.RegressionSelector in following code blocks.

Data Sampling

[3]:

from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

[3]:

(np.float64(2.0666362048018514), np.float64(2.072459655020552))

Picking up columns to Carve

[4]:

train_set.head()

[4]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
5088	0.9809	19.0	3.187726	1.129964	726.0	2.620939	33.98	-118.28	1.214
17096	4.2232	33.0	6.189696	1.086651	1015.0	2.377049	37.46	-122.23	3.637
5617	3.5488	42.0	4.821577	1.095436	1044.0	4.331950	33.79	-118.26	2.056
20060	1.6469	24.0	4.274194	1.048387	1686.0	4.532258	35.87	-119.26	0.476
895	3.9909	14.0	4.608303	1.089350	2738.0	2.471119	37.54	-121.96	2.360

[5]:

# column data types
train_set.dtypes

[5]:

MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object

All features are quantitative continuous features at the exception of Latitude and Longitude which are geographical featues (not supported by AutoCarver as is). All other features will be added to the list of quantitative_features.

[6]:

from AutoCarver import Features

# lists of features per data type
features = Features(numericals=["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"])

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used:

For quantitative features, to define the number of quantiles to initialy discretize the features with.
For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality.

[7]:

min_freq = 0.1

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[8]:

max_n_mod = 4

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Grouping NaNs

The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-nan values, and then test out all possible combinations with nan.

[9]:

dropna = False  # anyway, there are no nan in this dataset

Type of output carved features

The attribute ordinal_encoding allows one to choose the output type:

Use True for integer output of ranked modalities (default)
Use False for string output of modalities

[10]:

ordinal_encoding = True

Fitting AutoCarver

First, all quantitative features are discretized:
1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq)
2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2) to be grouped with its closest modality
Second, all features are carved following this recipe, for all classes of train_set[target] (except one):
1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step
2. Grouping modalities: all consecutive combinations of modalities are applied to train_set
3. Computing associations: the association metric (Krsuskal-Wallis’ statistic, by default) is computed with the provided train_set[target]
4. Combinations are sorted in descending order by association value
5. Testing robustness: finds the first combination that checks the following:
  - Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq/2)
  - Distinct target rates per consecutive modalities on train_set and dev_set
  - No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)
6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with nan are applied to train_set and steps 3. and 4. are run
7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[11]:

from AutoCarver import ContinuousCarver
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# intiating AutoCarver
auto_carver = ContinuousCarver(
    features=features,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    config=ProcessingConfig(dropna=dropna, ordinal_encoding=ordinal_encoding, verbose=True, copy=True),
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
--- [QuantitativeDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [ContinuousDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [OrdinalDiscretizer] Fit Features(['HouseAge'])
------

---------
------ [ContinuousCarver] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
--- [ContinuousCarver] Fit Quantitative('MedInc') (1/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.60e+00	1.1102	0.0500	692
1.60e+00 < x <= 1.91e+00	1.1285	0.0500	691
1.91e+00 < x <= 2.15e+00	1.2198	0.0500	692
2.15e+00 < x <= 2.35e+00	1.3171	0.0500	691
2.35e+00 < x <= 2.57e+00	1.3817	0.0500	691
2.57e+00 < x <= 2.74e+00	1.5409	0.0500	692
2.74e+00 < x <= 2.98e+00	1.6159	0.0500	692
2.98e+00 < x <= 3.14e+00	1.6906	0.0499	690
3.14e+00 < x <= 3.32e+00	1.8232	0.0500	692
3.32e+00 < x <= 3.54e+00	1.9059	0.0500	691
3.54e+00 < x <= 3.73e+00	2.0076	0.0502	694
3.73e+00 < x <= 3.97e+00	2.0271	0.0498	689
3.97e+00 < x <= 4.18e+00	2.1456	0.0500	691
4.18e+00 < x <= 4.46e+00	2.2433	0.0500	691
4.46e+00 < x <= 4.76e+00	2.3621	0.0501	693
4.76e+00 < x <= 5.12e+00	2.3986	0.0499	690
5.12e+00 < x <= 5.54e+00	2.6438	0.0500	691
5.54e+00 < x <= 6.16e+00	2.9324	0.0500	692
6.16e+00 < x <= 7.32e+00	3.4592	0.0500	691
7.32e+00 < x	4.3784	0.0500	692

X_dev distribution
target_mean	frequency	count
1.1017	0.0509	347
1.0410	0.0502	342
1.2407	0.0501	341
1.2919	0.0506	345
1.4676	0.0536	365
1.5605	0.0417	284
1.6280	0.0584	398
1.7519	0.0471	321
1.8443	0.0504	343
1.8500	0.0498	339
2.0040	0.0533	363
2.0890	0.0502	342
2.1641	0.0505	344
2.2700	0.0540	368
2.3768	0.0439	299
2.5087	0.0479	326
2.6814	0.0483	329
2.9805	0.0479	326
3.3748	0.0530	361
4.3748	0.0483	329

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 2.57e+00	1.2314	0.2500	3457
2.57e+00 < x <= 3.97e+00	1.8016	0.3500	4840
3.97e+00 < x <= 5.54e+00	2.3587	0.2499	3456
5.54e+00 < x	3.5900	0.1501	2075

X_dev distribution
target_mean	frequency	count
1.2315	0.2554	1740
1.8222	0.3509	2390
2.3953	0.2446	1666
3.5721	0.1491	1016

--- [ContinuousCarver] Fit Quantitative('HouseAge') (2/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 8.00e+00	2.1158	0.0537	742
8.00e+00 < x <= 1.20e+01	1.8220	0.0477	659
1.20e+01 < x <= 1.50e+01	1.8590	0.0613	847
1.50e+01 < x <= 1.80e+01	1.9547	0.0989	1367
1.80e+01 < x <= 2.20e+01	1.9739	0.0871	1205
2.20e+01 < x <= 2.50e+01	2.1055	0.0705	975
2.50e+01 < x <= 2.80e+01	2.0512	0.0775	1072
2.80e+01 < x <= 3.10e+01	2.0439	0.0682	943
3.10e+01 < x <= 3.30e+01	2.0275	0.0575	795
3.30e+01 < x <= 3.50e+01	2.0651	0.0722	999
3.50e+01 < x <= 3.70e+01	2.0750	0.0687	950
3.70e+01 < x <= 4.20e+01	2.0102	0.0811	1121
4.20e+01 < x <= 4.50e+01	2.1301	0.0485	670
4.50e+01 < x	2.4785	0.1072	1483

X_dev distribution
target_mean	frequency	count
2.0205	0.0526	358
1.7827	0.0443	302
1.8780	0.0556	379
1.9391	0.0986	672
2.0285	0.0891	607
2.1179	0.0759	517
2.1634	0.0743	506
1.9546	0.0664	452
2.0512	0.0565	385
2.1862	0.0755	514
2.0747	0.0659	449
2.0174	0.0895	610
2.0015	0.0489	333
2.4651	0.1069	728

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 2.20e+01	1.9494	0.3486	4820
2.20e+01 < x <= 3.70e+01	2.0623	0.4147	5734
3.70e+01 < x <= 4.50e+01	2.0550	0.1295	1791
4.50e+01 < x	2.4785	0.1072	1483

X_dev distribution
target_mean	frequency	count
1.9447	0.3403	2318
2.0964	0.4144	2823
2.0118	0.1384	943
2.4651	0.1069	728

--- [ContinuousCarver] Fit Quantitative('AveRooms') (3/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 3.44e+00	1.9126	0.0500	692
3.44e+00 < x <= 3.79e+00	1.8286	0.0500	691
3.79e+00 < x <= 4.06e+00	1.8169	0.0500	692
4.06e+00 < x <= 4.28e+00	1.8418	0.0500	691
4.28e+00 < x <= 4.46e+00	1.7529	0.0500	691
4.46e+00 < x <= 4.62e+00	1.7915	0.0500	692
4.62e+00 < x <= 4.79e+00	1.8214	0.0500	691
4.79e+00 < x <= 4.94e+00	1.7685	0.0500	691
4.94e+00 < x <= 5.09e+00	1.7466	0.0500	692
5.09e+00 < x <= 5.23e+00	1.7717	0.0500	691
5.23e+00 < x <= 5.38e+00	1.8664	0.0500	691
5.38e+00 < x <= 5.53e+00	1.8472	0.0500	692
5.53e+00 < x <= 5.69e+00	1.9199	0.0500	691
5.69e+00 < x <= 5.86e+00	1.9910	0.0500	691
5.86e+00 < x <= 6.06e+00	2.0870	0.0500	692
6.06e+00 < x <= 6.27e+00	2.1908	0.0500	691
6.27e+00 < x <= 6.54e+00	2.4050	0.0500	691
6.54e+00 < x <= 6.95e+00	2.6874	0.0500	692
6.95e+00 < x <= 7.65e+00	3.1129	0.0500	691
7.65e+00 < x	3.1718	0.0500	692

X_dev distribution
target_mean	frequency	count
1.8659	0.0518	353
1.8728	0.0505	344
1.7627	0.0524	357
1.8020	0.0543	370
1.7223	0.0552	376
1.6802	0.0452	308
1.7707	0.0530	361
1.8030	0.0443	302
1.8209	0.0523	356
1.8326	0.0437	298
1.7923	0.0550	375
1.9388	0.0514	350
1.9465	0.0501	341
2.0248	0.0468	319
2.1049	0.0483	329
2.2239	0.0490	334
2.4339	0.0467	318
2.7667	0.0468	319
3.1001	0.0548	373
3.2429	0.0483	329

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 5.69e+00	1.8220	0.6500	8988
5.69e+00 < x <= 6.27e+00	2.0896	0.1500	2074
6.27e+00 < x <= 6.95e+00	2.5463	0.1000	1383
6.95e+00 < x	3.1424	0.1000	1383

X_dev distribution
target_mean	frequency	count
1.8162	0.6593	4491
2.1194	0.1442	982
2.6006	0.0935	637
3.1670	0.1031	702

--- [ContinuousCarver] Fit Quantitative('AveBedrms') (4/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 9.4000e-01	2.0684	0.0500	692
9.4000e-01 < x <= 9.6724e-01	2.0735	0.0500	691
9.6724e-01 < x <= 9.8319e-01	2.2167	0.0501	693
9.8319e-01 < x <= 9.9576e-01	2.1706	0.0499	690
9.9576e-01 < x <= 1.0066e+00	2.1310	0.0500	692
1.0066e+00 < x <= 1.0154e+00	2.2358	0.0500	691
1.0154e+00 < x <= 1.0247e+00	2.1668	0.0500	691
1.0247e+00 < x <= 1.0331e+00	2.2102	0.0500	692
1.0331e+00 < x <= 1.0412e+00	2.1295	0.0500	691
1.0412e+00 < x <= 1.0495e+00	2.1548	0.0500	691
1.0495e+00 < x <= 1.0576e+00	2.1238	0.0500	692
1.0576e+00 < x <= 1.0665e+00	2.1025	0.0500	691
1.0665e+00 < x <= 1.0768e+00	2.0704	0.0500	691
1.0768e+00 < x <= 1.0878e+00	2.0664	0.0501	693
1.0878e+00 < x <= 1.1003e+00	2.1118	0.0499	690
1.1003e+00 < x <= 1.1161e+00	1.9937	0.0500	691
1.1161e+00 < x <= 1.1382e+00	1.9405	0.0500	691
1.1382e+00 < x <= 1.1738e+00	1.7990	0.0500	692
1.1738e+00 < x <= 1.2732e+00	1.9162	0.0500	691
1.2732e+00 < x	1.6515	0.0500	692

X_dev distribution
target_mean	frequency	count
2.0416	0.0539	367
2.2043	0.0527	359
2.0997	0.0482	328
2.1835	0.0487	332
2.2628	0.0552	376
2.1619	0.0480	327
2.2295	0.0567	386
2.1690	0.0493	336
2.1581	0.0528	360
2.1202	0.0476	324
2.1039	0.0452	308
2.1595	0.0509	347
2.1037	0.0521	355
2.0662	0.0484	330
2.0487	0.0489	333
1.9543	0.0467	318
1.8871	0.0484	330
1.8680	0.0499	340
1.8371	0.0465	317
1.7182	0.0498	339

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 1.058e+00	2.1528	0.5500	7606
1.058e+00 < x <= 1.100e+00	2.0878	0.2000	2765
1.100e+00 < x <= 1.138e+00	1.9671	0.0999	1382
1.138e+00 < x	1.7888	0.1501	2075

X_dev distribution
target_mean	frequency	count
2.1597	0.5583	3803
2.0954	0.2004	1365
1.9201	0.0951	648
1.8072	0.1462	996

--- [ContinuousCarver] Fit Quantitative('Population') (5/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 3.53e+02	1.9859	0.0501	693
3.53e+02 < x <= 5.14e+02	2.1616	0.0501	693
5.14e+02 < x <= 6.27e+02	2.1117	0.0501	693
6.27e+02 < x <= 7.15e+02	2.2819	0.0497	687
7.15e+02 < x <= 7.93e+02	2.0335	0.0509	704
7.93e+02 < x <= 8.64e+02	2.2113	0.0492	681
8.64e+02 < x <= 9.38e+02	2.0772	0.0498	689
9.38e+02 < x <= 1.02e+03	2.1386	0.0500	692
1.02e+03 < x <= 1.09e+03	2.0430	0.0503	696
1.09e+03 < x <= 1.17e+03	2.0506	0.0496	686
1.17e+03 < x <= 1.26e+03	2.0870	0.0505	698
1.26e+03 < x <= 1.35e+03	2.0195	0.0497	687
1.35e+03 < x <= 1.46e+03	2.0004	0.0502	694
1.46e+03 < x <= 1.58e+03	2.1102	0.0498	688
1.58e+03 < x <= 1.73e+03	2.0346	0.0500	691
1.73e+03 < x <= 1.91e+03	1.9139	0.0499	690
1.91e+03 < x <= 2.15e+03	2.0006	0.0500	691
2.15e+03 < x <= 2.56e+03	2.0707	0.0500	692
2.56e+03 < x <= 3.30e+03	1.9614	0.0500	691
3.30e+03 < x	2.0428	0.0500	692

X_dev distribution
target_mean	frequency	count
1.9012	0.0530	361
2.1915	0.0520	354
2.1706	0.0523	356
2.1062	0.0514	350
2.2019	0.0531	362
2.1765	0.0490	334
2.2025	0.0506	345
2.1329	0.0553	377
2.1744	0.0437	298
2.1319	0.0480	327
1.9939	0.0534	364
2.0096	0.0465	317
1.9569	0.0465	317
1.9756	0.0504	343
2.0815	0.0496	338
2.0272	0.0461	314
1.9789	0.0487	332
1.9355	0.0496	338
2.0714	0.0518	353
2.0157	0.0487	332

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 6.27e+02	2.0864	0.1503	2079
6.27e+02 < x <= 8.64e+02	2.1743	0.1498	2072
8.64e+02 < x <= 2.15e+03	2.0433	0.5498	7602
2.15e+03 < x	2.0250	0.1501	2075

X_dev distribution
target_mean	frequency	count
2.0867	0.1572	1071
2.1618	0.1536	1046
2.0607	0.5390	3672
2.0084	0.1502	1023

--- [ContinuousCarver] Fit Quantitative('AveOccup') (6/6)
 [ContinuousCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.870e+00	2.7122	0.0500	692
1.870e+00 < x <= 2.067e+00	2.6633	0.0500	691
2.067e+00 < x <= 2.225e+00	2.3373	0.0500	692
2.225e+00 < x <= 2.338e+00	2.3080	0.0500	691
2.338e+00 < x <= 2.432e+00	2.1976	0.0500	691
2.432e+00 < x <= 2.513e+00	2.2064	0.0500	692
2.513e+00 < x <= 2.595e+00	2.1736	0.0500	691
2.595e+00 < x <= 2.668e+00	2.1862	0.0500	691
2.668e+00 < x <= 2.743e+00	2.1378	0.0500	692
2.743e+00 < x <= 2.820e+00	2.1902	0.0500	691
2.820e+00 < x <= 2.898e+00	2.1824	0.0500	691
2.898e+00 < x <= 2.984e+00	2.0741	0.0500	692
2.984e+00 < x <= 3.073e+00	2.0255	0.0501	693
3.073e+00 < x <= 3.171e+00	1.9914	0.0498	689
3.171e+00 < x <= 3.282e+00	1.8992	0.0500	692
3.282e+00 < x <= 3.425e+00	1.8926	0.0500	691
3.425e+00 < x <= 3.607e+00	1.7085	0.0500	691
3.607e+00 < x <= 3.877e+00	1.5666	0.0500	692
3.877e+00 < x <= 4.325e+00	1.4505	0.0500	691
4.325e+00 < x	1.4294	0.0500	692

X_dev distribution
target_mean	frequency	count
2.7684	0.0484	330
2.5334	0.0435	296
2.3989	0.0542	369
2.3641	0.0533	363
2.2272	0.0546	372
2.2969	0.0489	333
2.3179	0.0508	346
2.0793	0.0467	318
2.1847	0.0521	355
2.1752	0.0504	343
2.0762	0.0533	363
2.0535	0.0501	341
2.0535	0.0528	360
1.9477	0.0458	312
1.8397	0.0449	306
1.8861	0.0514	350
1.7301	0.0448	305
1.6200	0.0499	340
1.4423	0.0527	359
1.4596	0.0515	351

 [ContinuousCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 2.22e+00	2.5709	0.1501	2075
2.22e+00 < x <= 3.07e+00	2.1681	0.5001	6915
3.07e+00 < x <= 3.61e+00	1.8729	0.1998	2763
3.61e+00 < x	1.4822	0.1501	2075

X_dev distribution
target_mean	frequency	count
2.5615	0.1461	995
2.1836	0.5129	3494
1.8527	0.1869	1273
1.5056	0.1541	1050

AutoCarver analysis

Carving Summary

[20]:

auto_carver.summary

[20]:

					content	target_mean	frequency	dropped	dropped_reason
feature	count	kruskal	n_mod	label
Quantitative('MedInc')	3457.0	6037.182135	4	0	x <= 2.57e+00	1.231421	0.250000	False	None
	4840.0	6037.182135	4	1	2.57e+00 < x <= 3.97e+00	1.801562	0.350014	False	None
	3456.0	6037.182135	4	2	3.97e+00 < x <= 5.54e+00	2.358660	0.249928	False	None
	2075.0	6037.182135	4	3	5.54e+00 < x	3.590040	0.150058	False	None
Quantitative('HouseAge')	4820.0	160.599610	4	0	x <= 2.20e+01	1.949361	0.348568	False	None
	5734.0	160.599610	4	1	2.20e+01 < x <= 3.70e+01	2.062306	0.414666	False	None
	1791.0	160.599610	4	2	3.70e+01 < x <= 4.50e+01	2.055043	0.129520	False	None
	1483.0	160.599610	4	3	4.50e+01 < x	2.478542	0.107246	False	None
Quantitative('AveRooms')	8988.0	1401.052572	4	0	x <= 5.69e+00	1.821999	0.649986	False	None
	2074.0	1401.052572	4	1	5.69e+00 < x <= 6.27e+00	2.089595	0.149986	False	None
	1383.0	1401.052572	4	2	6.27e+00 < x <= 6.95e+00	2.546315	0.100014	False	None
	1383.0	1401.052572	4	3	6.95e+00 < x	3.142406	0.100014	False	None
Quantitative('AveBedrms')	7606.0	320.789845	4	0	x <= 1.058e+00	2.152832	0.550043	False	None
	2765.0	320.789845	4	1	1.058e+00 < x <= 1.100e+00	2.087773	0.199957	False	None
	1382.0	320.789845	4	2	1.100e+00 < x <= 1.138e+00	1.967066	0.099942	False	None
	2075.0	320.789845	4	3	1.138e+00 < x	1.788831	0.150058	False	None
Quantitative('Population')	2079.0	16.109709	4	0	x <= 6.27e+02	2.086394	0.150347	False	None
	2072.0	16.109709	4	1	6.27e+02 < x <= 8.64e+02	2.174297	0.149841	False	None
	7602.0	16.109709	4	2	8.64e+02 < x <= 2.15e+03	2.043255	0.549754	False	None
	2075.0	16.109709	4	3	2.15e+03 < x	2.024995	0.150058	False	None
Quantitative('AveOccup')	2075.0	991.408301	4	0	x <= 2.22e+00	2.570888	0.150058	False	None
	6915.0	991.408301	4	1	2.22e+00 < x <= 3.07e+00	2.168126	0.500072	False	None
	2763.0	991.408301	4	2	3.07e+00 < x <= 3.61e+00	1.872867	0.199812	False	None
	2075.0	991.408301	4	3	3.61e+00 < x	1.482183	0.150058	False	None

As requested with ordinal_encoding=True, output labels are integers of modalities
For quantitative feature Population, the selected combination of modalities groups populations as follows:
- label 0: lower or equal to 630 people (content="x <= 6.3e+02")
- label 1: greater than 630 people and lower or equal to 860 people (content="6.3e+02 < x <= 8.6e+02")
- label 2: greater than 860 people and lower or equal to 2200 people (content="8.6e+02 < x <= 2.2e+03")
- label 3: higher than 2200 people (content="2.2e+03 < x")

Detailed overview of tested combinations

[21]:

features["AveOccup"].history.head(7)

[21]:

	info	kruskal	combination	n_mod	dropna	train	viable	dev
0	Raw distribution (n_mod=20>max_n_mod=4)	1062.072498	{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0...	20	False	NaN	NaN	NaN
1	Not viable	994.514410	{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0...	4	False	{'viable': True, 'info': ''}	False	{'viable': False, 'info': 'Non-representative ...
2	Not viable	994.504665	{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0...	4	False	{'viable': True, 'info': ''}	False	{'viable': False, 'info': 'Non-representative ...
3	Not viable	991.504255	{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0...	4	False	{'viable': True, 'info': ''}	False	{'viable': False, 'info': 'Non-representative ...
4	Best for kruskal and max_n_mod=4	991.408301	{'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0...	4	False	{'viable': True, 'info': ''}	True	{'viable': True, 'info': ''}

[22]:

features["AveOccup"].history.dev[1]

[22]:

{'viable': False, 'info': 'Non-representative modality for min_freq=10.00%'}

The most associated combination of feature AveOccup (the first tested out, where info!="Raw distribution") did not pass the viability tests. When looking in history.dev:
- "Non-representative modality for min_freq=10.00%": tells us that a modality is unstable between train_set and dev_set
For feature feature AveOccup, the 4th combination is the first to pass tests:
- viabe=True
- info="Best for kruskal and max_n_mod=4"
- Kruskal-Wallis’ H with MedHouseVal is 991.408301 for this combination
- Following combinations (less associated with the target) where not tested: info="Not checked"
For all combinations dropna=False means that it is not a combination in which nans are being grouped with other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[23]:

auto_carver.save("continuous_carver.json")

Loading

Carvers can safely be loaded from a .json file.

[24]:

from AutoCarver import ContinuousCarver

# loading json file
auto_carver = ContinuousCarver.load('continuous_carver.json')

Applying AutoCarver

[25]:

dev_set_processed = auto_carver.transform(dev_set)

[27]:

dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

[27]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup
0.0	0.255432	0.340282	0.659278	0.558280	0.157223	0.146066
1.0	0.350851	0.414416	0.144157	0.200382	0.153553	0.512918
2.0	0.244568	0.138432	0.093511	0.095126	0.539049	0.186876
3.0	0.149149	0.106870	0.103053	0.146213	0.150176	0.154140

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using ContinuousCarver, hence all features are qualitative.

Number of features to select

The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[28]:

n_best_per_type = 6

Using Selectors

[29]:

from AutoCarver.selectors import RegressionSelector
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# select the most target associated qualitative features
feature_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set_processed, train_set_processed[target]).selected_features
best_features

 [RegressionSelector] Selected Qualitative Features

	feature	Mode	KruskalEtaSquaredMeasure	KruskalEtaSquaredRank	TschuprowtFilter	TschuprowtWith
0	Quantitative('MedInc')	0.3500	0.4365	0.0000	0.0000	itself
2	Quantitative('AveRooms')	0.6500	0.1011	1.0000	0.3854	MedInc
5	Quantitative('AveOccup')	0.5001	0.0715	2.0000	0.1620	AveRooms
3	Quantitative('AveBedrms')	0.5500	0.0230	3.0000	0.1395	MedInc
1	Quantitative('HouseAge')	0.4147	0.0114	4.0000	0.1345	AveRooms
4	Quantitative('Population')	0.5498	0.0009	5.0000	0.1464	HouseAge

[29]:

Features(['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population'])

[30]:

train_set_processed[best_features].head()

[30]:

	MedInc	AveRooms	AveOccup	AveBedrms	HouseAge	Population
5088	0.0	0.0	1.0	2.0	0.0	1.0
17096	2.0	1.0	1.0	1.0	1.0	2.0
5617	1.0	0.0	3.0	1.0	2.0	2.0
20060	0.0	0.0	3.0	0.0	1.0	2.0
895	2.0	0.0	1.0	1.0	0.0	3.0

Feature MedInc is the most associated with the target MedHouseVal:
- Kruskal-Wallis’ H value is KruskalMeasure=6037.1821
- It has 0 % of NaNs (NanMeasure=0.0000)
- Its mode represents 35 % of observed data (ModeMeasure=0.3500)
Feature AveRooms is strongly associated to feature MedInc:
- Tschuprow’s T value is TschuprowtFilter=0.4015 for TschuprowtWith=MedInc
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

Modeling

Fitting model on train data

[31]:

from xgboost import XGBRegressor

model = XGBRegressor()
model.fit(train_set_processed[best_features], train_set_processed[target])

[31]:

XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=None,
             n_jobs=None, num_parallel_tree=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Saving model

[24]:

model.save_model("regression_xgboost.json")

Prediction on dev dataset and performance

[32]:

from sklearn.metrics import root_mean_squared_error

dev_pred = model.predict(dev_set_processed[best_features])
root_mean_squared_error(dev_set_processed[target], dev_pred)

[32]:

0.7807707240408291

What’s next?

Thanks to Carvers all of your features are now optimally processed for your regression task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in continuous regression tasks shines through in your meticulous use of AutoCarver’s ContinuousCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The ContinuousCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in continuous regression tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.