Setting things up

About this notebook

In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ContinuousCarver pipeline. Renowned for its association-maximizing discretization, ContinuousCarver is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.

The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ContinuousCarver, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.

Throughout this notebook, we’ll explore the intricacies of ContinuousCarver’s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ContinuousCarver ensures that each feature is finely tuned for our regression tasks.

Join us in this exploration as we leverage the power of ContinuousCarver to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.

Let’s dive in and uncover the potential of ContinuousCarver in transforming the California Housing Prices Dataset for optimal predictive modeling.

Installation

[1]:
# %pip install AutoCarver[jupyter]

Califorinia Housing Prices Data

In this example notebook, we will use the California Housing Prices dataset.

The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.

Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression).

[1]:
from sklearn import datasets

# Load dataset directly from sklearn
housing = datasets.fetch_california_housing(as_frame=True)

# conversion to pandas
housing_data = housing["data"]
housing_data[housing["target_names"][0]] = housing["target"]

# Display the first few rows of the dataset
housing_data.head()
[1]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Target type and Carver selection

[2]:
target = "MedHouseVal"

housing_data[target].describe()
[2]:
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target "MedHouseVal" is a continuous target of type float64 used in a regression task. Hence we will use AutoCarver.ContinuousCarver and AutoCarver.selectors.RegressionSelector in following code blocks.

Data Sampling

[3]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[3]:
(np.float64(2.0666362048018514), np.float64(2.072459655020552))

Picking up columns to Carve

[4]:
train_set.head()
[4]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
5088 0.9809 19.0 3.187726 1.129964 726.0 2.620939 33.98 -118.28 1.214
17096 4.2232 33.0 6.189696 1.086651 1015.0 2.377049 37.46 -122.23 3.637
5617 3.5488 42.0 4.821577 1.095436 1044.0 4.331950 33.79 -118.26 2.056
20060 1.6469 24.0 4.274194 1.048387 1686.0 4.532258 35.87 -119.26 0.476
895 3.9909 14.0 4.608303 1.089350 2738.0 2.471119 37.54 -121.96 2.360
[5]:
# column data types
train_set.dtypes
[5]:
MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object

All features are quantitative continuous features at the exception of Latitude and Longitude which are geographical featues (not supported by AutoCarver as is). All other features will be added to the list of quantitative_features.

[6]:
from AutoCarver import Features

# lists of features per data type
features = Features(numericals=["MedInc", "HouseAge", "AveRooms", "AveBedrms", "Population", "AveOccup"])

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used:

  • For quantitative features, to define the number of quantiles to initialy discretize the features with.

  • For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality.

[7]:
min_freq = 0.1

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[8]:
max_n_mod = 4

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Grouping NaNs

The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-nan values, and then test out all possible combinations with nan.

[9]:
dropna = False  # anyway, there are no nan in this dataset

Type of output carved features

The attribute ordinal_encoding allows one to choose the output type:

  • Use True for integer output of ranked modalities (default)

  • Use False for string output of modalities

[10]:
ordinal_encoding = True

Fitting AutoCarver

  • First, all quantitative features are discretized:

    1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq)

    2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2) to be grouped with its closest modality

  • Second, all features are carved following this recipe, for all classes of train_set[target] (except one):

    1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step

    2. Grouping modalities: all consecutive combinations of modalities are applied to train_set

    3. Computing associations: the association metric (Krsuskal-Wallis’ statistic, by default) is computed with the provided train_set[target]

    4. Combinations are sorted in descending order by association value

    5. Testing robustness: finds the first combination that checks the following:

      • Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq/2)

      • Distinct target rates per consecutive modalities on train_set and dev_set

      • No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)

    6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with nan are applied to train_set and steps 3. and 4. are run

    7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[11]:
from AutoCarver import ContinuousCarver
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# intiating AutoCarver
auto_carver = ContinuousCarver(
    features=features,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    config=ProcessingConfig(dropna=dropna, ordinal_encoding=ordinal_encoding, verbose=True, copy=True),
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
--- [QuantitativeDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [ContinuousDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
 - [OrdinalDiscretizer] Fit Features(['HouseAge'])
------

---------
------ [ContinuousCarver] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
--- [ContinuousCarver] Fit Quantitative('MedInc') (1/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.60e+00 1.1102 0.0500 692
1.60e+00 < x <= 1.91e+00 1.1285 0.0500 691
1.91e+00 < x <= 2.15e+00 1.2198 0.0500 692
2.15e+00 < x <= 2.35e+00 1.3171 0.0500 691
2.35e+00 < x <= 2.57e+00 1.3817 0.0500 691
2.57e+00 < x <= 2.74e+00 1.5409 0.0500 692
2.74e+00 < x <= 2.98e+00 1.6159 0.0500 692
2.98e+00 < x <= 3.14e+00 1.6906 0.0499 690
3.14e+00 < x <= 3.32e+00 1.8232 0.0500 692
3.32e+00 < x <= 3.54e+00 1.9059 0.0500 691
3.54e+00 < x <= 3.73e+00 2.0076 0.0502 694
3.73e+00 < x <= 3.97e+00 2.0271 0.0498 689
3.97e+00 < x <= 4.18e+00 2.1456 0.0500 691
4.18e+00 < x <= 4.46e+00 2.2433 0.0500 691
4.46e+00 < x <= 4.76e+00 2.3621 0.0501 693
4.76e+00 < x <= 5.12e+00 2.3986 0.0499 690
5.12e+00 < x <= 5.54e+00 2.6438 0.0500 691
5.54e+00 < x <= 6.16e+00 2.9324 0.0500 692
6.16e+00 < x <= 7.32e+00 3.4592 0.0500 691
7.32e+00 < x 4.3784 0.0500 692
X_dev distribution
target_mean frequency count
1.1017 0.0509 347
1.0410 0.0502 342
1.2407 0.0501 341
1.2919 0.0506 345
1.4676 0.0536 365
1.5605 0.0417 284
1.6280 0.0584 398
1.7519 0.0471 321
1.8443 0.0504 343
1.8500 0.0498 339
2.0040 0.0533 363
2.0890 0.0502 342
2.1641 0.0505 344
2.2700 0.0540 368
2.3768 0.0439 299
2.5087 0.0479 326
2.6814 0.0483 329
2.9805 0.0479 326
3.3748 0.0530 361
4.3748 0.0483 329
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.57e+00 1.2314 0.2500 3457
2.57e+00 < x <= 3.97e+00 1.8016 0.3500 4840
3.97e+00 < x <= 5.54e+00 2.3587 0.2499 3456
5.54e+00 < x 3.5900 0.1501 2075
X_dev distribution
target_mean frequency count
1.2315 0.2554 1740
1.8222 0.3509 2390
2.3953 0.2446 1666
3.5721 0.1491 1016
--- [ContinuousCarver] Fit Quantitative('HouseAge') (2/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 8.00e+00 2.1158 0.0537 742
8.00e+00 < x <= 1.20e+01 1.8220 0.0477 659
1.20e+01 < x <= 1.50e+01 1.8590 0.0613 847
1.50e+01 < x <= 1.80e+01 1.9547 0.0989 1367
1.80e+01 < x <= 2.20e+01 1.9739 0.0871 1205
2.20e+01 < x <= 2.50e+01 2.1055 0.0705 975
2.50e+01 < x <= 2.80e+01 2.0512 0.0775 1072
2.80e+01 < x <= 3.10e+01 2.0439 0.0682 943
3.10e+01 < x <= 3.30e+01 2.0275 0.0575 795
3.30e+01 < x <= 3.50e+01 2.0651 0.0722 999
3.50e+01 < x <= 3.70e+01 2.0750 0.0687 950
3.70e+01 < x <= 4.20e+01 2.0102 0.0811 1121
4.20e+01 < x <= 4.50e+01 2.1301 0.0485 670
4.50e+01 < x 2.4785 0.1072 1483
X_dev distribution
target_mean frequency count
2.0205 0.0526 358
1.7827 0.0443 302
1.8780 0.0556 379
1.9391 0.0986 672
2.0285 0.0891 607
2.1179 0.0759 517
2.1634 0.0743 506
1.9546 0.0664 452
2.0512 0.0565 385
2.1862 0.0755 514
2.0747 0.0659 449
2.0174 0.0895 610
2.0015 0.0489 333
2.4651 0.1069 728
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.20e+01 1.9494 0.3486 4820
2.20e+01 < x <= 3.70e+01 2.0623 0.4147 5734
3.70e+01 < x <= 4.50e+01 2.0550 0.1295 1791
4.50e+01 < x 2.4785 0.1072 1483
X_dev distribution
target_mean frequency count
1.9447 0.3403 2318
2.0964 0.4144 2823
2.0118 0.1384 943
2.4651 0.1069 728
--- [ContinuousCarver] Fit Quantitative('AveRooms') (3/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 3.44e+00 1.9126 0.0500 692
3.44e+00 < x <= 3.79e+00 1.8286 0.0500 691
3.79e+00 < x <= 4.06e+00 1.8169 0.0500 692
4.06e+00 < x <= 4.28e+00 1.8418 0.0500 691
4.28e+00 < x <= 4.46e+00 1.7529 0.0500 691
4.46e+00 < x <= 4.62e+00 1.7915 0.0500 692
4.62e+00 < x <= 4.79e+00 1.8214 0.0500 691
4.79e+00 < x <= 4.94e+00 1.7685 0.0500 691
4.94e+00 < x <= 5.09e+00 1.7466 0.0500 692
5.09e+00 < x <= 5.23e+00 1.7717 0.0500 691
5.23e+00 < x <= 5.38e+00 1.8664 0.0500 691
5.38e+00 < x <= 5.53e+00 1.8472 0.0500 692
5.53e+00 < x <= 5.69e+00 1.9199 0.0500 691
5.69e+00 < x <= 5.86e+00 1.9910 0.0500 691
5.86e+00 < x <= 6.06e+00 2.0870 0.0500 692
6.06e+00 < x <= 6.27e+00 2.1908 0.0500 691
6.27e+00 < x <= 6.54e+00 2.4050 0.0500 691
6.54e+00 < x <= 6.95e+00 2.6874 0.0500 692
6.95e+00 < x <= 7.65e+00 3.1129 0.0500 691
7.65e+00 < x 3.1718 0.0500 692
X_dev distribution
target_mean frequency count
1.8659 0.0518 353
1.8728 0.0505 344
1.7627 0.0524 357
1.8020 0.0543 370
1.7223 0.0552 376
1.6802 0.0452 308
1.7707 0.0530 361
1.8030 0.0443 302
1.8209 0.0523 356
1.8326 0.0437 298
1.7923 0.0550 375
1.9388 0.0514 350
1.9465 0.0501 341
2.0248 0.0468 319
2.1049 0.0483 329
2.2239 0.0490 334
2.4339 0.0467 318
2.7667 0.0468 319
3.1001 0.0548 373
3.2429 0.0483 329
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 5.69e+00 1.8220 0.6500 8988
5.69e+00 < x <= 6.27e+00 2.0896 0.1500 2074
6.27e+00 < x <= 6.95e+00 2.5463 0.1000 1383
6.95e+00 < x 3.1424 0.1000 1383
X_dev distribution
target_mean frequency count
1.8162 0.6593 4491
2.1194 0.1442 982
2.6006 0.0935 637
3.1670 0.1031 702
--- [ContinuousCarver] Fit Quantitative('AveBedrms') (4/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 9.4000e-01 2.0684 0.0500 692
9.4000e-01 < x <= 9.6724e-01 2.0735 0.0500 691
9.6724e-01 < x <= 9.8319e-01 2.2167 0.0501 693
9.8319e-01 < x <= 9.9576e-01 2.1706 0.0499 690
9.9576e-01 < x <= 1.0066e+00 2.1310 0.0500 692
1.0066e+00 < x <= 1.0154e+00 2.2358 0.0500 691
1.0154e+00 < x <= 1.0247e+00 2.1668 0.0500 691
1.0247e+00 < x <= 1.0331e+00 2.2102 0.0500 692
1.0331e+00 < x <= 1.0412e+00 2.1295 0.0500 691
1.0412e+00 < x <= 1.0495e+00 2.1548 0.0500 691
1.0495e+00 < x <= 1.0576e+00 2.1238 0.0500 692
1.0576e+00 < x <= 1.0665e+00 2.1025 0.0500 691
1.0665e+00 < x <= 1.0768e+00 2.0704 0.0500 691
1.0768e+00 < x <= 1.0878e+00 2.0664 0.0501 693
1.0878e+00 < x <= 1.1003e+00 2.1118 0.0499 690
1.1003e+00 < x <= 1.1161e+00 1.9937 0.0500 691
1.1161e+00 < x <= 1.1382e+00 1.9405 0.0500 691
1.1382e+00 < x <= 1.1738e+00 1.7990 0.0500 692
1.1738e+00 < x <= 1.2732e+00 1.9162 0.0500 691
1.2732e+00 < x 1.6515 0.0500 692
X_dev distribution
target_mean frequency count
2.0416 0.0539 367
2.2043 0.0527 359
2.0997 0.0482 328
2.1835 0.0487 332
2.2628 0.0552 376
2.1619 0.0480 327
2.2295 0.0567 386
2.1690 0.0493 336
2.1581 0.0528 360
2.1202 0.0476 324
2.1039 0.0452 308
2.1595 0.0509 347
2.1037 0.0521 355
2.0662 0.0484 330
2.0487 0.0489 333
1.9543 0.0467 318
1.8871 0.0484 330
1.8680 0.0499 340
1.8371 0.0465 317
1.7182 0.0498 339
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 1.058e+00 2.1528 0.5500 7606
1.058e+00 < x <= 1.100e+00 2.0878 0.2000 2765
1.100e+00 < x <= 1.138e+00 1.9671 0.0999 1382
1.138e+00 < x 1.7888 0.1501 2075
X_dev distribution
target_mean frequency count
2.1597 0.5583 3803
2.0954 0.2004 1365
1.9201 0.0951 648
1.8072 0.1462 996
--- [ContinuousCarver] Fit Quantitative('Population') (5/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 3.53e+02 1.9859 0.0501 693
3.53e+02 < x <= 5.14e+02 2.1616 0.0501 693
5.14e+02 < x <= 6.27e+02 2.1117 0.0501 693
6.27e+02 < x <= 7.15e+02 2.2819 0.0497 687
7.15e+02 < x <= 7.93e+02 2.0335 0.0509 704
7.93e+02 < x <= 8.64e+02 2.2113 0.0492 681
8.64e+02 < x <= 9.38e+02 2.0772 0.0498 689
9.38e+02 < x <= 1.02e+03 2.1386 0.0500 692
1.02e+03 < x <= 1.09e+03 2.0430 0.0503 696
1.09e+03 < x <= 1.17e+03 2.0506 0.0496 686
1.17e+03 < x <= 1.26e+03 2.0870 0.0505 698
1.26e+03 < x <= 1.35e+03 2.0195 0.0497 687
1.35e+03 < x <= 1.46e+03 2.0004 0.0502 694
1.46e+03 < x <= 1.58e+03 2.1102 0.0498 688
1.58e+03 < x <= 1.73e+03 2.0346 0.0500 691
1.73e+03 < x <= 1.91e+03 1.9139 0.0499 690
1.91e+03 < x <= 2.15e+03 2.0006 0.0500 691
2.15e+03 < x <= 2.56e+03 2.0707 0.0500 692
2.56e+03 < x <= 3.30e+03 1.9614 0.0500 691
3.30e+03 < x 2.0428 0.0500 692
X_dev distribution
target_mean frequency count
1.9012 0.0530 361
2.1915 0.0520 354
2.1706 0.0523 356
2.1062 0.0514 350
2.2019 0.0531 362
2.1765 0.0490 334
2.2025 0.0506 345
2.1329 0.0553 377
2.1744 0.0437 298
2.1319 0.0480 327
1.9939 0.0534 364
2.0096 0.0465 317
1.9569 0.0465 317
1.9756 0.0504 343
2.0815 0.0496 338
2.0272 0.0461 314
1.9789 0.0487 332
1.9355 0.0496 338
2.0714 0.0518 353
2.0157 0.0487 332
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 6.27e+02 2.0864 0.1503 2079
6.27e+02 < x <= 8.64e+02 2.1743 0.1498 2072
8.64e+02 < x <= 2.15e+03 2.0433 0.5498 7602
2.15e+03 < x 2.0250 0.1501 2075
X_dev distribution
target_mean frequency count
2.0867 0.1572 1071
2.1618 0.1536 1046
2.0607 0.5390 3672
2.0084 0.1502 1023
--- [ContinuousCarver] Fit Quantitative('AveOccup') (6/6)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.870e+00 2.7122 0.0500 692
1.870e+00 < x <= 2.067e+00 2.6633 0.0500 691
2.067e+00 < x <= 2.225e+00 2.3373 0.0500 692
2.225e+00 < x <= 2.338e+00 2.3080 0.0500 691
2.338e+00 < x <= 2.432e+00 2.1976 0.0500 691
2.432e+00 < x <= 2.513e+00 2.2064 0.0500 692
2.513e+00 < x <= 2.595e+00 2.1736 0.0500 691
2.595e+00 < x <= 2.668e+00 2.1862 0.0500 691
2.668e+00 < x <= 2.743e+00 2.1378 0.0500 692
2.743e+00 < x <= 2.820e+00 2.1902 0.0500 691
2.820e+00 < x <= 2.898e+00 2.1824 0.0500 691
2.898e+00 < x <= 2.984e+00 2.0741 0.0500 692
2.984e+00 < x <= 3.073e+00 2.0255 0.0501 693
3.073e+00 < x <= 3.171e+00 1.9914 0.0498 689
3.171e+00 < x <= 3.282e+00 1.8992 0.0500 692
3.282e+00 < x <= 3.425e+00 1.8926 0.0500 691
3.425e+00 < x <= 3.607e+00 1.7085 0.0500 691
3.607e+00 < x <= 3.877e+00 1.5666 0.0500 692
3.877e+00 < x <= 4.325e+00 1.4505 0.0500 691
4.325e+00 < x 1.4294 0.0500 692
X_dev distribution
target_mean frequency count
2.7684 0.0484 330
2.5334 0.0435 296
2.3989 0.0542 369
2.3641 0.0533 363
2.2272 0.0546 372
2.2969 0.0489 333
2.3179 0.0508 346
2.0793 0.0467 318
2.1847 0.0521 355
2.1752 0.0504 343
2.0762 0.0533 363
2.0535 0.0501 341
2.0535 0.0528 360
1.9477 0.0458 312
1.8397 0.0449 306
1.8861 0.0514 350
1.7301 0.0448 305
1.6200 0.0499 340
1.4423 0.0527 359
1.4596 0.0515 351
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.22e+00 2.5709 0.1501 2075
2.22e+00 < x <= 3.07e+00 2.1681 0.5001 6915
3.07e+00 < x <= 3.61e+00 1.8729 0.1998 2763
3.61e+00 < x 1.4822 0.1501 2075
X_dev distribution
target_mean frequency count
2.5615 0.1461 995
2.1836 0.5129 3494
1.8527 0.1869 1273
1.5056 0.1541 1050

AutoCarver analysis

Carving Summary

[20]:
auto_carver.summary
[20]:
content target_mean frequency dropped dropped_reason
feature count kruskal n_mod label
Quantitative('MedInc') 3457.0 6037.182135 4 0 x <= 2.57e+00 1.231421 0.250000 False None
4840.0 6037.182135 4 1 2.57e+00 < x <= 3.97e+00 1.801562 0.350014 False None
3456.0 6037.182135 4 2 3.97e+00 < x <= 5.54e+00 2.358660 0.249928 False None
2075.0 6037.182135 4 3 5.54e+00 < x 3.590040 0.150058 False None
Quantitative('HouseAge') 4820.0 160.599610 4 0 x <= 2.20e+01 1.949361 0.348568 False None
5734.0 160.599610 4 1 2.20e+01 < x <= 3.70e+01 2.062306 0.414666 False None
1791.0 160.599610 4 2 3.70e+01 < x <= 4.50e+01 2.055043 0.129520 False None
1483.0 160.599610 4 3 4.50e+01 < x 2.478542 0.107246 False None
Quantitative('AveRooms') 8988.0 1401.052572 4 0 x <= 5.69e+00 1.821999 0.649986 False None
2074.0 1401.052572 4 1 5.69e+00 < x <= 6.27e+00 2.089595 0.149986 False None
1383.0 1401.052572 4 2 6.27e+00 < x <= 6.95e+00 2.546315 0.100014 False None
3 6.95e+00 < x 3.142406 0.100014 False None
Quantitative('AveBedrms') 7606.0 320.789845 4 0 x <= 1.058e+00 2.152832 0.550043 False None
2765.0 320.789845 4 1 1.058e+00 < x <= 1.100e+00 2.087773 0.199957 False None
1382.0 320.789845 4 2 1.100e+00 < x <= 1.138e+00 1.967066 0.099942 False None
2075.0 320.789845 4 3 1.138e+00 < x 1.788831 0.150058 False None
Quantitative('Population') 2079.0 16.109709 4 0 x <= 6.27e+02 2.086394 0.150347 False None
2072.0 16.109709 4 1 6.27e+02 < x <= 8.64e+02 2.174297 0.149841 False None
7602.0 16.109709 4 2 8.64e+02 < x <= 2.15e+03 2.043255 0.549754 False None
2075.0 16.109709 4 3 2.15e+03 < x 2.024995 0.150058 False None
Quantitative('AveOccup') 2075.0 991.408301 4 0 x <= 2.22e+00 2.570888 0.150058 False None
6915.0 991.408301 4 1 2.22e+00 < x <= 3.07e+00 2.168126 0.500072 False None
2763.0 991.408301 4 2 3.07e+00 < x <= 3.61e+00 1.872867 0.199812 False None
2075.0 991.408301 4 3 3.61e+00 < x 1.482183 0.150058 False None
  • As requested with ordinal_encoding=True, output labels are integers of modalities

  • For quantitative feature Population, the selected combination of modalities groups populations as follows:

    • label 0: lower or equal to 630 people (content="x <= 6.3e+02")

    • label 1: greater than 630 people and lower or equal to 860 people (content="6.3e+02 < x <= 8.6e+02")

    • label 2: greater than 860 people and lower or equal to 2200 people (content="8.6e+02 < x <= 2.2e+03")

    • label 3: higher than 2200 people (content="2.2e+03 < x")

Detailed overview of tested combinations

[21]:
features["AveOccup"].history.head(7)
[21]:
info kruskal combination n_mod dropna train viable dev
0 Raw distribution (n_mod=20>max_n_mod=4) 1062.072498 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 20 False NaN NaN NaN
1 Not viable 994.514410 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False {'viable': True, 'info': ''} False {'viable': False, 'info': 'Non-representative ...
2 Not viable 994.504665 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False {'viable': True, 'info': ''} False {'viable': False, 'info': 'Non-representative ...
3 Not viable 991.504255 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False {'viable': True, 'info': ''} False {'viable': False, 'info': 'Non-representative ...
4 Best for kruskal and max_n_mod=4 991.408301 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False {'viable': True, 'info': ''} True {'viable': True, 'info': ''}
[22]:
features["AveOccup"].history.dev[1]
[22]:
{'viable': False, 'info': 'Non-representative modality for min_freq=10.00%'}
  • The most associated combination of feature AveOccup (the first tested out, where info!="Raw distribution") did not pass the viability tests. When looking in history.dev:

    • "Non-representative modality for min_freq=10.00%": tells us that a modality is unstable between train_set and dev_set

  • For feature feature AveOccup, the 4th combination is the first to pass tests:

    • viabe=True

    • info="Best for kruskal and max_n_mod=4"

    • Kruskal-Wallis’ H with MedHouseVal is 991.408301 for this combination

    • Following combinations (less associated with the target) where not tested: info="Not checked"

  • For all combinations dropna=False means that it is not a combination in which nans are being grouped with other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[23]:
auto_carver.save("continuous_carver.json")

Loading

Carvers can safely be loaded from a .json file.

[24]:
from AutoCarver import ContinuousCarver

# loading json file
auto_carver = ContinuousCarver.load('continuous_carver.json')

Applying AutoCarver

[25]:
dev_set_processed = auto_carver.transform(dev_set)
[27]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[27]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup
0.0 0.255432 0.340282 0.659278 0.558280 0.157223 0.146066
1.0 0.350851 0.414416 0.144157 0.200382 0.153553 0.512918
2.0 0.244568 0.138432 0.093511 0.095126 0.539049 0.186876
3.0 0.149149 0.106870 0.103053 0.146213 0.150176 0.154140

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using ContinuousCarver, hence all features are qualitative.

Number of features to select

The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[28]:
n_best_per_type = 6

Using Selectors

[29]:
from AutoCarver.selectors import RegressionSelector
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# select the most target associated qualitative features
feature_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set_processed, train_set_processed[target]).selected_features
best_features
 [RegressionSelector] Selected Qualitative Features
  feature Nan Mode KruskalEtaSquaredMeasure KruskalEtaSquaredRank TschuprowtFilter TschuprowtWith
0 Quantitative('MedInc') 0.0000 0.3500 0.4365 0.0000 0.0000 itself
2 Quantitative('AveRooms') 0.0000 0.6500 0.1011 1.0000 0.3854 MedInc
5 Quantitative('AveOccup') 0.0000 0.5001 0.0715 2.0000 0.1620 AveRooms
3 Quantitative('AveBedrms') 0.0000 0.5500 0.0230 3.0000 0.1395 MedInc
1 Quantitative('HouseAge') 0.0000 0.4147 0.0114 4.0000 0.1345 AveRooms
4 Quantitative('Population') 0.0000 0.5498 0.0009 5.0000 0.1464 HouseAge
[29]:
Features(['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population'])
[30]:
train_set_processed[best_features].head()
[30]:
MedInc AveRooms AveOccup AveBedrms HouseAge Population
5088 0.0 0.0 1.0 2.0 0.0 1.0
17096 2.0 1.0 1.0 1.0 1.0 2.0
5617 1.0 0.0 3.0 1.0 2.0 2.0
20060 0.0 0.0 3.0 0.0 1.0 2.0
895 2.0 0.0 1.0 1.0 0.0 3.0
  • Feature MedInc is the most associated with the target MedHouseVal:

    • Kruskal-Wallis’ H value is KruskalMeasure=6037.1821

    • It has 0 % of NaNs (NanMeasure=0.0000)

    • Its mode represents 35 % of observed data (ModeMeasure=0.3500)

  • Feature AveRooms is strongly associated to feature MedInc:

    • Tschuprow’s T value is TschuprowtFilter=0.4015 for TschuprowtWith=MedInc

  • Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

Modeling

Fitting model on train data

[31]:
from xgboost import XGBRegressor

model = XGBRegressor()
model.fit(train_set_processed[best_features], train_set_processed[target])
[31]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             feature_weights=None, gamma=None, grow_policy=None,
             importance_type=None, interaction_constraints=None,
             learning_rate=None, max_bin=None, max_cat_threshold=None,
             max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
             max_leaves=None, min_child_weight=None, missing=nan,
             monotone_constraints=None, multi_strategy=None, n_estimators=None,
             n_jobs=None, num_parallel_tree=None, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Saving model

[24]:
model.save_model("regression_xgboost.json")

Prediction on dev dataset and performance

[32]:
from sklearn.metrics import root_mean_squared_error

dev_pred = model.predict(dev_set_processed[best_features])
root_mean_squared_error(dev_set_processed[target], dev_pred)
[32]:
0.7807707240408291

What’s next?

  • Thanks to Carvers all of your features are now optimally processed for your regression task!

  • As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in continuous regression tasks shines through in your meticulous use of AutoCarver’s ContinuousCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The ContinuousCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in continuous regression tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.