Setting things up
About this notebook
In this notebook we use `RegressionSelector <https://autocarver.readthedocs.io/en/latest/selectors.html#regression-tasks>`__ to quickly rank and select the features most associated with a continuous target — here the median house value of the California Housing dataset. Unlike a full carving pass, the selector is a lightweight, association-centric step: it scores every feature against the target, ranks them, and drops those too correlated with a better-ranked feature.
RegressionSelector scores every feature exactly (no sampling) yet stays fast: each measure is computed for all features of a type in a single vectorized pass. By default it uses Spearman’s rho for quantitative features and the Kruskal-Wallis eta-squared effect size for qualitative ones (the latter via a reversed test, since the target is continuous).
Installation
[1]:
# %pip install AutoCarver[jupyter]
California Housing data
The California Housing dataset ships with scikit-learn. Each row is a census block group; the target MedHouseVal is the median house value (in $100,000s) — a continuous regression target.
[1]:
from sklearn import datasets
housing = datasets.fetch_california_housing(as_frame=True).frame
housing.head()
[1]:
| MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | MedHouseVal | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 | 4.526 |
| 1 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 | 3.585 |
| 2 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 | 3.521 |
| 3 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 | 3.413 |
| 4 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 | 3.422 |
Target type and Selector selection
[2]:
target = "MedHouseVal"
housing[target].describe()
[2]:
count 20640.000000
mean 2.068558
std 1.153956
min 0.149990
25% 1.196000
50% 1.797000
75% 2.647250
max 5.000010
Name: MedHouseVal, dtype: float64
The target MedHouseVal is a continuous float64 used in a regression task. Hence we use AutoCarver.selectors.RegressionSelector in the following code blocks.
Deriving a few qualitative features
The raw dataset is entirely numeric. To also illustrate qualitative feature selection, we derive two categorical features from the geographic and age columns:
Region— an unordered categorical built from the latitude/longitude quadrant (NW,NE,SW,SE).HouseAgeBand— an ordinal built from tertiles ofHouseAge(recent<established<old).
[3]:
import numpy as np
import pandas as pd
housing["Region"] = (
np.where(housing["Latitude"] >= housing["Latitude"].median(), "N", "S")
+ np.where(housing["Longitude"] >= housing["Longitude"].median(), "E", "W")
)
housing["HouseAgeBand"] = pd.qcut(
housing["HouseAge"], 3, labels=["recent", "established", "old"]
).astype(str)
housing[["Region", "HouseAgeBand"]].value_counts()
[3]:
Region HouseAgeBand
NW recent 3741
SE established 3650
old 3238
NW established 3120
old 2974
SE recent 2942
NE recent 276
SW established 227
recent 179
NE established 161
SW old 77
NE old 55
Name: count, dtype: int64
Data sampling
[4]:
from sklearn.model_selection import train_test_split
train_set, dev_set = train_test_split(housing, test_size=0.33, random_state=42)
train_set.shape, dev_set.shape
[4]:
((13828, 11), (6812, 11))
Setting up Features to select
We declare the quantitative, categorical and ordinal features to select from. MedInc, AveRooms, AveBedrms, Population and AveOccup are quantitative; Region is categorical; HouseAgeBand is ordinal (its ordering is provided explicitly).
[5]:
from AutoCarver import Features
features = Features(
numericals=["MedInc", "AveRooms", "AveBedrms", "Population", "AveOccup"],
categoricals=["Region"],
ordinals={"HouseAgeBand": ["recent", "established", "old"]},
)
features
[5]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])
Feature selection
Selector settings
Number of features to select
n_best_per_type sets how many features to keep per data type (quantitative and qualitative).
[6]:
n_best_per_type = 3
Using the Selector with default measures
With no measures/filters provided, RegressionSelector uses its defaults:
Spearman’s rho ranks each quantitative feature against the target,
Kruskal-Wallis eta-squared ranks each qualitative feature (reversed test: the feature defines the groups, the continuous target is ranked),
NaN/Modegates discard degenerate features, and Spearman/Tschuprow filters drop redundant ones.
Behavioral toggles such as verbose live in ProcessingConfig, exactly as for the carvers.
[7]:
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig
from AutoCarver import RegressionSelector
feature_selector = RegressionSelector(
features=features,
n_best_per_type=n_best_per_type,
config=ProcessingConfig(verbose=True), # displays statistics
)
best_features = feature_selector.fit(train_set, train_set[target]).selected_features
best_features
[RegressionSelector] Selected Quantitative Features
| feature | Nan | Mode | SpearmanMeasure | SpearmanRank | SpearmanFilter | SpearmanWith | |
|---|---|---|---|---|---|---|---|
| 0 | Quantitative('MedInc') | 0.0000 | 0.0027 | 0.6765 | 0.0000 | 0.0000 | itself |
| 1 | Quantitative('AveRooms') | 0.0000 | 0.0013 | 0.2557 | 1.0000 | 0.6398 | MedInc |
| 4 | Quantitative('AveOccup') | 0.0000 | 0.0017 | -0.2552 | 2.0000 | -0.0390 | MedInc |
| 2 | Quantitative('AveBedrms') | 0.0000 | 0.0132 | -0.1277 | 3.0000 | -0.2550 | MedInc |
| 3 | Quantitative('Population') | 0.0000 | 0.0014 | 0.0044 | 4.0000 | 0.2377 | AveOccup |
[RegressionSelector] Selected Qualitative Features
| feature | Nan | Mode | KruskalEtaSquaredMeasure | KruskalEtaSquaredRank | TschuprowtFilter | TschuprowtWith | |
|---|---|---|---|---|---|---|---|
| 0 | Categorical('Region') | 0.0000 | 0.4798 | 0.0459 | 0.0000 | 0.0000 | itself |
| 1 | Ordinal('HouseAgeBand') | 0.0000 | 0.3486 | 0.0047 | 1.0000 | 0.0842 | Region |
[7]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveOccup'])
select returns the selected Features; equivalently, feature_selector.transform(train_set) returns train_set restricted to the selected columns. Each feature also carries the computed statistics — for example the reversed Kruskal-Wallis eta-squared used to rank the qualitative Region:
[8]:
features("Region").measures
[8]:
{'Nan': {'value': np.float64(0.0),
'threshold': 1.0,
'valid': np.True_,
'info': {'higher_is_better': False,
'correlation_with': 'itself',
'is_default': True,
'is_absolute': False}},
'Mode': {'value': np.float64(0.4797512293896442),
'threshold': 1.0,
'valid': np.True_,
'info': {'higher_is_better': False,
'correlation_with': 'itself',
'is_default': True,
'is_absolute': False}},
'KruskalEtaSquaredMeasure': {'value': 0.04593188463880526,
'threshold': 0.0,
'valid': True,
'info': {'higher_is_better': True,
'correlation_with': 'target',
'is_default': False,
'is_absolute': False}},
'KruskalEtaSquaredRank': {'value': 0,
'threshold': -1,
'valid': True,
'info': {'is_default': False, 'higher_is_better': False}}}
Optional: choosing the measures and filters
The measures and filters are the swappable decision boundary — provide your own to change how features are ranked and de-correlated. See the available measures and filters.
Here we:
rank quantitative features with Pearson’s r (instead of Spearman),
keep the Kruskal-Wallis eta-squared for qualitative features,
drop features with more than 30% missing values (
NaN) or 30% outliers (Zscore),de-correlate with Pearson (quantitative) and Tschuprow’s T (qualitative) filters.
[9]:
from AutoCarver.selectors import (
KruskalEtaSquaredMeasure,
NanMeasure,
PearsonMeasure,
ZscoreOutlierMeasure,
PearsonFilter,
TschuprowtFilter,
)
measures = [
NanMeasure(threshold=0.3),
ZscoreOutlierMeasure(threshold=0.3),
PearsonMeasure(),
KruskalEtaSquaredMeasure(),
]
filters = [PearsonFilter(threshold=0.25), TschuprowtFilter(threshold=0.25)]
[10]:
custom_selector = RegressionSelector(
features=features,
n_best_per_type=n_best_per_type,
measures=measures,
filters=filters,
config=ProcessingConfig(verbose=True),
)
custom_selector.fit(train_set, train_set[target]).selected_features
[RegressionSelector] Selected Quantitative Features
| feature | Mode | Nan | ZScore | PearsonMeasure | PearsonRank | PearsonFilter | PearsonWith | |
|---|---|---|---|---|---|---|---|---|
| 0 | Quantitative('MedInc') | 0.0027 | 0.0000 | 0.0163 | 0.6884 | 0.0000 | 0.0000 | itself |
| 2 | Quantitative('AveBedrms') | 0.0132 | 0.0000 | 0.0076 | -0.0489 | 1.0000 | -0.0713 | MedInc |
| 3 | Quantitative('Population') | 0.0014 | 0.0000 | 0.0158 | -0.0244 | 2.0000 | -0.0716 | AveBedrms |
| 4 | Quantitative('AveOccup') | 0.0017 | 0.0000 | 0.0004 | -0.0206 | 3.0000 | 0.0759 | Population |
| 1 | Quantitative('AveRooms') | 0.0013 | 0.0000 | 0.0064 | 0.1520 | nan | 0.3234 | MedInc |
[RegressionSelector] Selected Qualitative Features
| feature | Mode | Nan | KruskalEtaSquaredMeasure | KruskalEtaSquaredRank | TschuprowtFilter | TschuprowtWith | |
|---|---|---|---|---|---|---|---|
| 0 | Categorical('Region') | 0.4798 | 0.0000 | 0.0459 | 0.0000 | 0.0000 | itself |
| 1 | Ordinal('HouseAgeBand') | 0.3486 | 0.0000 | 0.0047 | 1.0000 | 0.0842 | Region |
[10]:
Features(['Region', 'HouseAgeBand', 'MedInc', 'AveBedrms', 'Population'])
What’s next?
You’ve selected the features most associated with your regression target!
Head over to the Carvers Examples — in particular the Continuous Regression example — to maximize the predictive power of the selected features.