Setting things up

About this notebook

In this notebook we use `RegressionSelector <https://autocarver.readthedocs.io/en/latest/selectors.html#regression-tasks>`__ to quickly rank and select the features most associated with a continuous target — here the median house value of the California Housing dataset. Unlike a full carving pass, the selector is a lightweight, association-centric step: it scores every feature against the target, ranks them, and drops those too correlated with a better-ranked feature.

RegressionSelector scores every feature exactly (no sampling) yet stays fast: each measure is computed for all features of a type in a single vectorized pass. By default it uses Spearman’s rho for quantitative features and the Kruskal-Wallis eta-squared effect size for qualitative ones (the latter via a reversed test, since the target is continuous).

Installation

[1]:

# %pip install AutoCarver[jupyter]

California Housing data

The California Housing dataset ships with scikit-learn. Each row is a census block group; the target MedHouseVal is the median house value (in $100,000s) — a continuous regression target.

[1]:

from sklearn import datasets

housing = datasets.fetch_california_housing(as_frame=True).frame
housing.head()

[1]:

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude	MedHouseVal
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23	4.526
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22	3.585
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24	3.521
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25	3.413
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25	3.422

Target type and Selector selection

[2]:

target = "MedHouseVal"

housing[target].describe()

[2]:

count    20640.000000
mean         2.068558
std          1.153956
min          0.149990
25%          1.196000
50%          1.797000
75%          2.647250
max          5.000010
Name: MedHouseVal, dtype: float64

The target MedHouseVal is a continuous float64 used in a regression task. Hence we use AutoCarver.selectors.RegressionSelector in the following code blocks.

Deriving a few qualitative features

The raw dataset is entirely numeric. To also illustrate qualitative feature selection, we derive two categorical features from the geographic and age columns:

Region — an unordered categorical built from the latitude/longitude quadrant (NW, NE, SW, SE).
HouseAgeBand — an ordinal built from tertiles of HouseAge (recent < established < old).

[3]:

import numpy as np
import pandas as pd

housing["Region"] = (
    np.where(housing["Latitude"] >= housing["Latitude"].median(), "N", "S")
    + np.where(housing["Longitude"] >= housing["Longitude"].median(), "E", "W")
)
housing["HouseAgeBand"] = pd.qcut(
    housing["HouseAge"], 3, labels=["recent", "established", "old"]
).astype(str)

housing[["Region", "HouseAgeBand"]].value_counts()

[3]:

Region  HouseAgeBand
NW      recent          3741
SE      established     3650
        old             3238
NW      established     3120
        old             2974
SE      recent          2942
NE      recent           276
SW      established      227
        recent           179
NE      established      161
SW      old               77
NE      old               55
Name: count, dtype: int64

Data sampling

[4]:

from sklearn.model_selection import train_test_split

train_set, dev_set = train_test_split(housing, test_size=0.33, random_state=42)
train_set.shape, dev_set.shape

[4]:

((13828, 11), (6812, 11))

Setting up Features to select

We declare the quantitative, categorical and ordinal features to select from. MedInc, AveRooms, AveBedrms, Population and AveOccup are quantitative; Region is categorical; HouseAgeBand is ordinal (its ordering is provided explicitly).

[5]:

from AutoCarver import Features

features = Features(
    numericals=["MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup"],
    categoricals=["Region"],
    ordinals={"HouseAgeBand": ["recent", "established", "old"]},
)
features

[5]:

Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])

Feature selection

Selector settings

Number of features to select

n_best_per_type sets how many features to keep per data type (quantitative and qualitative).

[6]:

n_best_per_type = 3

Using the Selector with default measures

With no measures/filters provided, RegressionSelector uses its defaults:

Spearman’s rho ranks each quantitative feature against the target,
Kruskal-Wallis eta-squared ranks each qualitative feature (reversed test: the feature defines the groups, the continuous target is ranked),
NaN / Mode gates discard degenerate features, and Spearman/Tschuprow filters drop redundant ones.

Behavioral toggles such as verbose live in ProcessingConfig, exactly as for the carvers.

[7]:

from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig
from AutoCarver import RegressionSelector

feature_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set, train_set[target]).selected_features
best_features

 [RegressionSelector] Selected Quantitative Features

	feature	Mode	SpearmanMeasure	SpearmanRank	SpearmanFilter	SpearmanWith
0	Quantitative('MedInc')	0.0027	0.6765	0.0000	0.0000	itself
1	Quantitative('AveRooms')	0.0013	0.2557	1.0000	0.6398	MedInc
4	Quantitative('AveOccup')	0.0017	-0.2552	2.0000	-0.0390	MedInc
2	Quantitative('AveBedrms')	0.0132	-0.1277	3.0000	-0.2550	MedInc
3	Quantitative('Population')	0.0014	0.0044	4.0000	0.2377	AveOccup

 [RegressionSelector] Selected Qualitative Features

	feature	Nan	Mode	KruskalEtaSquaredMeasure	KruskalEtaSquaredRank	TschuprowtFilter	TschuprowtWith
0	Categorical('Region')	0.0000	0.4798	0.0459	0.0000	0.0000	itself
1	Ordinal('HouseAgeBand')	0.0000	0.3486	0.0047	1.0000	0.0842	Region

[7]:

Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveOccup'])

select returns the selected Features; equivalently, feature_selector.transform(train_set) returns train_set restricted to the selected columns. Each feature also carries the computed statistics — for example the reversed Kruskal-Wallis eta-squared used to rank the qualitative Region:

[8]:

features("Region").measures

[8]:

{'Nan': {'value': np.float64(0.0),
  'threshold': 1.0,
  'valid': np.True_,
  'info': {'higher_is_better': False,
   'correlation_with': 'itself',
   'is_default': True,
   'is_absolute': False}},
 'Mode': {'value': np.float64(0.4797512293896442),
  'threshold': 1.0,
  'valid': np.True_,
  'info': {'higher_is_better': False,
   'correlation_with': 'itself',
   'is_default': True,
   'is_absolute': False}},
 'KruskalEtaSquaredMeasure': {'value': 0.04593188463880526,
  'threshold': 0.0,
  'valid': True,
  'info': {'higher_is_better': True,
   'correlation_with': 'target',
   'is_default': False,
   'is_absolute': False}},
 'KruskalEtaSquaredRank': {'value': 0,
  'threshold': -1,
  'valid': True,
  'info': {'is_default': False, 'higher_is_better': False}}}

Optional: choosing the measures and filters

The measures and filters are the swappable decision boundary — provide your own to change how features are ranked and de-correlated. See the available measures and filters.

Here we:

rank quantitative features with Pearson’s r (instead of Spearman),
keep the Kruskal-Wallis eta-squared for qualitative features,
drop features with more than 30% missing values (NaN) or 30% outliers (Zscore),
de-correlate with Pearson (quantitative) and Tschuprow’s T (qualitative) filters.

[9]:

from AutoCarver.selectors import (
    KruskalEtaSquaredMeasure,
    NanMeasure,
    PearsonMeasure,
    ZscoreOutlierMeasure,
    PearsonFilter,
    TschuprowtFilter,
)

measures = [
    NanMeasure(threshold=0.3),
    ZscoreOutlierMeasure(threshold=0.3),
    PearsonMeasure(),
    KruskalEtaSquaredMeasure(),
]
filters = [PearsonFilter(threshold=0.25), TschuprowtFilter(threshold=0.25)]

[10]:

custom_selector = RegressionSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    measures=measures,
    filters=filters,
    config=ProcessingConfig(verbose=True),
)
custom_selector.fit(train_set, train_set[target]).selected_features

 [RegressionSelector] Selected Quantitative Features

	feature	Mode	ZScore	PearsonMeasure	PearsonRank	PearsonFilter	PearsonWith
0	Quantitative('MedInc')	0.0027	0.0163	0.6884	0.0000	0.0000	itself
2	Quantitative('AveBedrms')	0.0132	0.0076	-0.0489	1.0000	-0.0713	MedInc
3	Quantitative('Population')	0.0014	0.0158	-0.0244	2.0000	-0.0716	AveBedrms
4	Quantitative('AveOccup')	0.0017	0.0004	-0.0206	3.0000	0.0759	Population
1	Quantitative('AveRooms')	0.0013	0.0064	0.1520	nan	0.3234	MedInc

 [RegressionSelector] Selected Qualitative Features

	feature	Mode	Nan	KruskalEtaSquaredMeasure	KruskalEtaSquaredRank	TschuprowtFilter	TschuprowtWith
0	Categorical('Region')	0.4798	0.0000	0.0459	0.0000	0.0000	itself
1	Ordinal('HouseAgeBand')	0.3486	0.0000	0.0047	1.0000	0.0842	Region

[10]:

Features(['Region', 'HouseAgeBand', 'MedInc', 'AveBedrms', 'Population'])

What’s next?

You’ve selected the features most associated with your regression target!
Head over to the Carvers Examples — in particular the Continuous Regression example — to maximize the predictive power of the selected features.