Setting things up

About this notebook

In this notebook we use `RegressionSelector <https://autocarver.readthedocs.io/en/latest/selectors.html#regression-tasks>`__ to quickly rank and select the features most associated with a continuous target — here the median house value of the California Housing dataset. Unlike a full carving pass, the selector is a lightweight, association-centric step: it scores every feature against the target, ranks them, and drops those too correlated with a better-ranked feature.

RegressionSelector scores every feature exactly (no sampling) yet stays fast: each measure is computed for all features of a type in a single vectorized pass. By default it uses Spearman’s rho for quantitative features and the Kruskal-Wallis eta-squared effect size for qualitative ones (the latter via a reversed test, since the target is continuous).

Installation

[1]:
# %pip install AutoCarver[jupyter]

California Housing data

The California Housing dataset ships with scikit-learn. Each row is a census block group; the target MedHouseVal is the median house value (in $100,000s) — a continuous regression target.

[1]:
from sklearn import datasets

housing = datasets.fetch_california_housing(as_frame=True).frame
housing.head()
[1]:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude MedHouseVal
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Target type and Selector selection

[2]:
target = "MedHouseVal"

housing[target].describe()
[2]:
count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target MedHouseVal is a continuous float64 used in a regression task. Hence we use AutoCarver.selectors.RegressionSelector in the following code blocks.

Deriving a few qualitative features

The raw dataset is entirely numeric. To also illustrate qualitative feature selection, we derive two categorical features from the geographic and age columns:

  • Region — an unordered categorical built from the latitude/longitude quadrant (NW, NE, SW, SE).

  • HouseAgeBand — an ordinal built from tertiles of HouseAge (recent < established < old).

[3]:
import numpy as np
import pandas as pd

housing["Region"] = (
    np.where(housing["Latitude"] >= housing["Latitude"].median(), "N", "S")
    + np.where(housing["Longitude"] >= housing["Longitude"].median(), "E", "W")
)
housing["HouseAgeBand"] = pd.qcut(
    housing["HouseAge"], 3, labels=["recent", "established", "old"]
).astype(str)

housing[["Region", "HouseAgeBand"]].value_counts()
[3]:
Region  HouseAgeBand
NW      recent          3741
SE      established     3650
        old             3238
NW      established     3120
        old             2974
SE      recent          2942
NE      recent           276
SW      established      227
        recent           179
NE      established      161
SW      old               77
NE      old               55
Name: count, dtype: int64

Data sampling

[4]:
from sklearn.model_selection import train_test_split

train_set, dev_set = train_test_split(housing, test_size=0.33, random_state=42)
train_set.shape, dev_set.shape
[4]:
((13828, 11), (6812, 11))

Setting up Features to select

We declare the quantitative, categorical and ordinal features to select from. MedInc, AveRooms, AveBedrms, Population and AveOccup are quantitative; Region is categorical; HouseAgeBand is ordinal (its ordering is provided explicitly).

[5]:
from AutoCarver import Features

features = Features(
    numericals=["MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup"],
    categoricals=["Region"],
    ordinals={"HouseAgeBand": ["recent", "established", "old"]},
)
features
[5]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])

Feature selection

Selector settings

Number of features to select

n_best_per_type sets how many features to keep per data type (quantitative and qualitative).

[6]:
n_best_per_type = 3

Using the Selector with default measures

With no measures/filters provided, RegressionSelector uses its defaults:

  • Spearman’s rho ranks each quantitative feature against the target,

  • Kruskal-Wallis eta-squared ranks each qualitative feature (reversed test: the feature defines the groups, the continuous target is ranked),

  • NaN / Mode gates discard degenerate features, and Spearman/Tschuprow filters drop redundant ones.

Behavioral toggles such as verbose live in ProcessingConfig, exactly as for the carvers.

[7]:
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig
from AutoCarver import RegressionSelector

feature_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set, train_set[target]).selected_features
best_features
 [RegressionSelector] Selected Quantitative Features
  feature Nan Mode SpearmanMeasure SpearmanRank SpearmanFilter SpearmanWith
0 Quantitative('MedInc') 0.0000 0.0027 0.6765 0.0000 0.0000 itself
1 Quantitative('AveRooms') 0.0000 0.0013 0.2557 1.0000 0.6398 MedInc
4 Quantitative('AveOccup') 0.0000 0.0017 -0.2552 2.0000 -0.0390 MedInc
2 Quantitative('AveBedrms') 0.0000 0.0132 -0.1277 3.0000 -0.2550 MedInc
3 Quantitative('Population') 0.0000 0.0014 0.0044 4.0000 0.2377 AveOccup
 [RegressionSelector] Selected Qualitative Features
  feature Nan Mode KruskalEtaSquaredMeasure KruskalEtaSquaredRank TschuprowtFilter TschuprowtWith
0 Categorical('Region') 0.0000 0.4798 0.0459 0.0000 0.0000 itself
1 Ordinal('HouseAgeBand') 0.0000 0.3486 0.0047 1.0000 0.0842 Region
[7]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveOccup'])

select returns the selected Features; equivalently, feature_selector.transform(train_set) returns train_set restricted to the selected columns. Each feature also carries the computed statistics — for example the reversed Kruskal-Wallis eta-squared used to rank the qualitative Region:

[8]:
features("Region").measures
[8]:
{'Nan': {'value': np.float64(0.0),
  'threshold': 1.0,
  'valid': np.True_,
  'info': {'higher_is_better': False,
   'correlation_with': 'itself',
   'is_default': True,
   'is_absolute': False}},
 'Mode': {'value': np.float64(0.4797512293896442),
  'threshold': 1.0,
  'valid': np.True_,
  'info': {'higher_is_better': False,
   'correlation_with': 'itself',
   'is_default': True,
   'is_absolute': False}},
 'KruskalEtaSquaredMeasure': {'value': 0.04593188463880526,
  'threshold': 0.0,
  'valid': True,
  'info': {'higher_is_better': True,
   'correlation_with': 'target',
   'is_default': False,
   'is_absolute': False}},
 'KruskalEtaSquaredRank': {'value': 0,
  'threshold': -1,
  'valid': True,
  'info': {'is_default': False, 'higher_is_better': False}}}

Optional: choosing the measures and filters

The measures and filters are the swappable decision boundary — provide your own to change how features are ranked and de-correlated. See the available measures and filters.

Here we:

  • rank quantitative features with Pearson’s r (instead of Spearman),

  • keep the Kruskal-Wallis eta-squared for qualitative features,

  • drop features with more than 30% missing values (NaN) or 30% outliers (Zscore),

  • de-correlate with Pearson (quantitative) and Tschuprow’s T (qualitative) filters.

[9]:
from AutoCarver.selectors import (
    KruskalEtaSquaredMeasure,
    NanMeasure,
    PearsonMeasure,
    ZscoreOutlierMeasure,
    PearsonFilter,
    TschuprowtFilter,
)

measures = [
    NanMeasure(threshold=0.3),
    ZscoreOutlierMeasure(threshold=0.3),
    PearsonMeasure(),
    KruskalEtaSquaredMeasure(),
]
filters = [PearsonFilter(threshold=0.25), TschuprowtFilter(threshold=0.25)]
[10]:
custom_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    measures=measures,
    filters=filters,
    config=ProcessingConfig(verbose=True),
)
custom_selector.fit(train_set, train_set[target]).selected_features
 [RegressionSelector] Selected Quantitative Features
  feature Mode Nan ZScore PearsonMeasure PearsonRank PearsonFilter PearsonWith
0 Quantitative('MedInc') 0.0027 0.0000 0.0163 0.6884 0.0000 0.0000 itself
2 Quantitative('AveBedrms') 0.0132 0.0000 0.0076 -0.0489 1.0000 -0.0713 MedInc
3 Quantitative('Population') 0.0014 0.0000 0.0158 -0.0244 2.0000 -0.0716 AveBedrms
4 Quantitative('AveOccup') 0.0017 0.0000 0.0004 -0.0206 3.0000 0.0759 Population
1 Quantitative('AveRooms') 0.0013 0.0000 0.0064 0.1520 nan 0.3234 MedInc
 [RegressionSelector] Selected Qualitative Features
  feature Mode Nan KruskalEtaSquaredMeasure KruskalEtaSquaredRank TschuprowtFilter TschuprowtWith
0 Categorical('Region') 0.4798 0.0000 0.0459 0.0000 0.0000 itself
1 Ordinal('HouseAgeBand') 0.3486 0.0000 0.0047 1.0000 0.0842 Region
[10]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveBedrms', 'Population'])

What’s next?

  • You’ve selected the features most associated with your regression target!

  • Head over to the Carvers Examples — in particular the Continuous Regression example — to maximize the predictive power of the selected features.