{ "cells": [ { "cell_type": "markdown", "id": "3d6bfa70", "metadata": {}, "source": [ "# Setting things up\n", "\n", "## About this notebook\n", "\n", "In this notebook we use [`RegressionSelector`](https://autocarver.readthedocs.io/en/latest/selectors.html#regression-tasks) to quickly rank and select the features most associated with a **continuous** target — here the median house value of the California Housing dataset. Unlike a full carving pass, the selector is a lightweight, association-centric step: it scores every feature against the target, ranks them, and drops those too correlated with a better-ranked feature.\n", "\n", "`RegressionSelector` scores **every** feature exactly (no sampling) yet stays fast: each measure is computed for all features of a type in a single vectorized pass. By default it uses **Spearman's rho** for quantitative features and the **Kruskal-Wallis eta-squared** effect size for qualitative ones (the latter via a reversed test, since the target is continuous)." ] }, { "cell_type": "markdown", "id": "25639069", "metadata": {}, "source": [ "## Installation" ] }, { "cell_type": "code", "execution_count": 1, "id": "64e9d727", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:14.027810Z", "iopub.status.busy": "2026-06-14T00:59:14.026764Z", "iopub.status.idle": "2026-06-14T00:59:14.033781Z", "shell.execute_reply": "2026-06-14T00:59:14.032779Z" } }, "outputs": [], "source": [ "# %pip install AutoCarver[jupyter]" ] }, { "cell_type": "markdown", "id": "841d8358", "metadata": {}, "source": [ "## California Housing data\n", "\n", "The [California Housing dataset](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) ships with `scikit-learn`. Each row is a census block group; the target `MedHouseVal` is the median house value (in $100,000s) — a continuous regression target." ] }, { "cell_type": "code", "execution_count": 1, "id": "d737ce96", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:14.038284Z", "iopub.status.busy": "2026-06-14T00:59:14.037241Z", "iopub.status.idle": "2026-06-14T00:59:15.479375Z", "shell.execute_reply": "2026-06-14T00:59:15.479375Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudeMedHouseVal
08.325241.06.9841271.023810322.02.55555637.88-122.234.526
18.301421.06.2381370.9718802401.02.10984237.86-122.223.585
27.257452.08.2881361.073446496.02.80226037.85-122.243.521
35.643152.05.8173521.073059558.02.54794537.85-122.253.413
43.846252.06.2818531.081081565.02.18146737.85-122.253.422
\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n", "1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n", "2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n", "3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n", "4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n", "\n", " Longitude MedHouseVal \n", "0 -122.23 4.526 \n", "1 -122.22 3.585 \n", "2 -122.24 3.521 \n", "3 -122.25 3.413 \n", "4 -122.25 3.422 " ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import datasets\n", "\n", "housing = datasets.fetch_california_housing(as_frame=True).frame\n", "housing.head()" ] }, { "cell_type": "markdown", "id": "ff2edc47", "metadata": {}, "source": [ "## Target type and Selector selection" ] }, { "cell_type": "code", "execution_count": 2, "id": "d9ca0754", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.482610Z", "iopub.status.busy": "2026-06-14T00:59:15.481574Z", "iopub.status.idle": "2026-06-14T00:59:15.488694Z", "shell.execute_reply": "2026-06-14T00:59:15.488694Z" } }, "outputs": [ { "data": { "text/plain": [ "count 20640.000000\n", "mean 2.068558\n", "std 1.153956\n", "min 0.149990\n", "25% 1.196000\n", "50% 1.797000\n", "75% 2.647250\n", "max 5.000010\n", "Name: MedHouseVal, dtype: float64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target = \"MedHouseVal\"\n", "\n", "housing[target].describe()" ] }, { "cell_type": "markdown", "id": "b4dd4e95", "metadata": {}, "source": [ "The target `MedHouseVal` is a continuous `float64` used in a regression task. Hence we use `AutoCarver.selectors.RegressionSelector` in the following code blocks." ] }, { "cell_type": "markdown", "id": "524fd69c", "metadata": {}, "source": [ "## Deriving a few qualitative features\n", "\n", "The raw dataset is entirely numeric. To also illustrate **qualitative** feature selection, we derive two categorical features from the geographic and age columns:\n", "\n", "* `Region` — an unordered **categorical** built from the latitude/longitude quadrant (`NW`, `NE`, `SW`, `SE`).\n", "* `HouseAgeBand` — an **ordinal** built from tertiles of `HouseAge` (`recent` < `established` < `old`)." ] }, { "cell_type": "code", "execution_count": 3, "id": "085e9437", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.492313Z", "iopub.status.busy": "2026-06-14T00:59:15.490850Z", "iopub.status.idle": "2026-06-14T00:59:15.507706Z", "shell.execute_reply": "2026-06-14T00:59:15.507165Z" } }, "outputs": [ { "data": { "text/plain": [ "Region HouseAgeBand\n", "NW recent 3741\n", "SE established 3650\n", " old 3238\n", "NW established 3120\n", " old 2974\n", "SE recent 2942\n", "NE recent 276\n", "SW established 227\n", " recent 179\n", "NE established 161\n", "SW old 77\n", "NE old 55\n", "Name: count, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "\n", "housing[\"Region\"] = (\n", " np.where(housing[\"Latitude\"] >= housing[\"Latitude\"].median(), \"N\", \"S\")\n", " + np.where(housing[\"Longitude\"] >= housing[\"Longitude\"].median(), \"E\", \"W\")\n", ")\n", "housing[\"HouseAgeBand\"] = pd.qcut(\n", " housing[\"HouseAge\"], 3, labels=[\"recent\", \"established\", \"old\"]\n", ").astype(str)\n", "\n", "housing[[\"Region\", \"HouseAgeBand\"]].value_counts()" ] }, { "cell_type": "markdown", "id": "02440537", "metadata": {}, "source": [ "## Data sampling" ] }, { "cell_type": "code", "execution_count": 4, "id": "7f17dbd1", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.510711Z", "iopub.status.busy": "2026-06-14T00:59:15.510711Z", "iopub.status.idle": "2026-06-14T00:59:15.546893Z", "shell.execute_reply": "2026-06-14T00:59:15.546893Z" } }, "outputs": [ { "data": { "text/plain": [ "((13828, 11), (6812, 11))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "train_set, dev_set = train_test_split(housing, test_size=0.33, random_state=42)\n", "train_set.shape, dev_set.shape" ] }, { "cell_type": "markdown", "id": "36046fc8", "metadata": {}, "source": [ "## Setting up Features to select\n", "\n", "We declare the quantitative, categorical and ordinal features to select from. `MedInc`, `AveRooms`, `AveBedrms`, `Population` and `AveOccup` are quantitative; `Region` is categorical; `HouseAgeBand` is ordinal (its ordering is provided explicitly)." ] }, { "cell_type": "code", "execution_count": 5, "id": "1e1575b5", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.549472Z", "iopub.status.busy": "2026-06-14T00:59:15.549472Z", "iopub.status.idle": "2026-06-14T00:59:15.873847Z", "shell.execute_reply": "2026-06-14T00:59:15.872757Z" } }, "outputs": [ { "data": { "text/plain": [ "Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AutoCarver import Features\n", "\n", "features = Features(\n", " numericals=[\"MedInc\", \"AveRooms\", \"AveBedrms\", \"Population\", \"AveOccup\"],\n", " categoricals=[\"Region\"],\n", " ordinals={\"HouseAgeBand\": [\"recent\", \"established\", \"old\"]},\n", ")\n", "features" ] }, { "cell_type": "markdown", "id": "a7171733", "metadata": {}, "source": [ "# Feature selection\n", "## Selector settings" ] }, { "cell_type": "markdown", "id": "cc12d73d", "metadata": {}, "source": [ "### Number of features to select\n", "\n", "`n_best_per_type` sets how many features to keep **per data type** (quantitative and qualitative)." ] }, { "cell_type": "code", "execution_count": 6, "id": "ebb829eb", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.877506Z", "iopub.status.busy": "2026-06-14T00:59:15.876419Z", "iopub.status.idle": "2026-06-14T00:59:15.881480Z", "shell.execute_reply": "2026-06-14T00:59:15.880454Z" } }, "outputs": [], "source": [ "n_best_per_type = 3" ] }, { "cell_type": "markdown", "id": "a09d4b5a", "metadata": {}, "source": [ "## Using the Selector with default measures\n", "\n", "With no `measures`/`filters` provided, `RegressionSelector` uses its defaults:\n", "\n", "* **Spearman's rho** ranks each quantitative feature against the target,\n", "* **Kruskal-Wallis eta-squared** ranks each qualitative feature (reversed test: the feature defines the groups, the continuous target is ranked),\n", "* `NaN` / `Mode` gates discard degenerate features, and Spearman/Tschuprow filters drop redundant ones.\n", "\n", "Behavioral toggles such as `verbose` live in `ProcessingConfig`, exactly as for the carvers." ] }, { "cell_type": "code", "execution_count": 7, "id": "cd50899f", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:15.884499Z", "iopub.status.busy": "2026-06-14T00:59:15.884499Z", "iopub.status.idle": "2026-06-14T00:59:16.002786Z", "shell.execute_reply": "2026-06-14T00:59:16.002786Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [RegressionSelector] Selected Quantitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureNanModeSpearmanMeasureSpearmanRankSpearmanFilterSpearmanWith
0Quantitative('MedInc')0.00000.00270.67650.00000.0000itself
1Quantitative('AveRooms')0.00000.00130.25571.00000.6398MedInc
4Quantitative('AveOccup')0.00000.0017-0.25522.0000-0.0390MedInc
2Quantitative('AveBedrms')0.00000.0132-0.12773.0000-0.2550MedInc
3Quantitative('Population')0.00000.00140.00444.00000.2377AveOccup
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " [RegressionSelector] Selected Qualitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureNanModeKruskalEtaSquaredMeasureKruskalEtaSquaredRankTschuprowtFilterTschuprowtWith
0Categorical('Region')0.00000.47980.04590.00000.0000itself
1Ordinal('HouseAgeBand')0.00000.34860.00471.00000.0842Region
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Features(['Region', 'HouseAgeBand', 'MedInc', 'AveRooms', 'AveOccup'])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig\n", "from AutoCarver import RegressionSelector\n", "\n", "feature_selector = RegressionSelector(\n", " features=features,\n", " n_best_per_type=n_best_per_type,\n", " config=ProcessingConfig(verbose=True), # displays statistics\n", ")\n", "best_features = feature_selector.fit(train_set, train_set[target]).selected_features\n", "best_features" ] }, { "cell_type": "markdown", "id": "4f4a79bb", "metadata": {}, "source": [ "`select` returns the selected `Features`; equivalently, `feature_selector.transform(train_set)` returns `train_set` restricted to the selected columns. Each feature also carries the computed statistics — for example the reversed Kruskal-Wallis eta-squared used to rank the qualitative `Region`:" ] }, { "cell_type": "code", "execution_count": 8, "id": "5734afef", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:16.005177Z", "iopub.status.busy": "2026-06-14T00:59:16.005177Z", "iopub.status.idle": "2026-06-14T00:59:16.011320Z", "shell.execute_reply": "2026-06-14T00:59:16.011320Z" } }, "outputs": [ { "data": { "text/plain": [ "{'Nan': {'value': np.float64(0.0),\n", " 'threshold': 1.0,\n", " 'valid': np.True_,\n", " 'info': {'higher_is_better': False,\n", " 'correlation_with': 'itself',\n", " 'is_default': True,\n", " 'is_absolute': False}},\n", " 'Mode': {'value': np.float64(0.4797512293896442),\n", " 'threshold': 1.0,\n", " 'valid': np.True_,\n", " 'info': {'higher_is_better': False,\n", " 'correlation_with': 'itself',\n", " 'is_default': True,\n", " 'is_absolute': False}},\n", " 'KruskalEtaSquaredMeasure': {'value': 0.04593188463880526,\n", " 'threshold': 0.0,\n", " 'valid': True,\n", " 'info': {'higher_is_better': True,\n", " 'correlation_with': 'target',\n", " 'is_default': False,\n", " 'is_absolute': False}},\n", " 'KruskalEtaSquaredRank': {'value': 0,\n", " 'threshold': -1,\n", " 'valid': True,\n", " 'info': {'is_default': False, 'higher_is_better': False}}}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features(\"Region\").measures" ] }, { "cell_type": "markdown", "id": "13fa78ac", "metadata": {}, "source": [ "## *Optional:* choosing the measures and filters\n", "\n", "The measures and filters are the swappable *decision boundary* — provide your own to change how features are ranked and de-correlated. See the available [measures](https://autocarver.readthedocs.io/en/latest/selectors.html#association-measures-x-by-y) and [filters](https://autocarver.readthedocs.io/en/latest/selectors.html#association-filters-x-by-x).\n", "\n", "Here we:\n", "\n", "* rank quantitative features with **Pearson's r** (instead of Spearman),\n", "* keep the **Kruskal-Wallis eta-squared** for qualitative features,\n", "* drop features with more than 30% missing values (`NaN`) or 30% outliers (`Zscore`),\n", "* de-correlate with **Pearson** (quantitative) and **Tschuprow's T** (qualitative) filters." ] }, { "cell_type": "code", "execution_count": 9, "id": "a8b9f136", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:16.014336Z", "iopub.status.busy": "2026-06-14T00:59:16.014336Z", "iopub.status.idle": "2026-06-14T00:59:16.019105Z", "shell.execute_reply": "2026-06-14T00:59:16.018599Z" } }, "outputs": [], "source": [ "from AutoCarver.selectors import (\n", " KruskalEtaSquaredMeasure,\n", " NanMeasure,\n", " PearsonMeasure,\n", " ZscoreOutlierMeasure,\n", " PearsonFilter,\n", " TschuprowtFilter,\n", ")\n", "\n", "measures = [\n", " NanMeasure(threshold=0.3),\n", " ZscoreOutlierMeasure(threshold=0.3),\n", " PearsonMeasure(),\n", " KruskalEtaSquaredMeasure(),\n", "]\n", "filters = [PearsonFilter(threshold=0.25), TschuprowtFilter(threshold=0.25)]" ] }, { "cell_type": "code", "execution_count": 10, "id": "ce803f20", "metadata": { "execution": { "iopub.execute_input": "2026-06-14T00:59:16.022306Z", "iopub.status.busy": "2026-06-14T00:59:16.021308Z", "iopub.status.idle": "2026-06-14T00:59:16.077330Z", "shell.execute_reply": "2026-06-14T00:59:16.077330Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [RegressionSelector] Selected Quantitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureModeNanZScorePearsonMeasurePearsonRankPearsonFilterPearsonWith
0Quantitative('MedInc')0.00270.00000.01630.68840.00000.0000itself
2Quantitative('AveBedrms')0.01320.00000.0076-0.04891.0000-0.0713MedInc
3Quantitative('Population')0.00140.00000.0158-0.02442.0000-0.0716AveBedrms
4Quantitative('AveOccup')0.00170.00000.0004-0.02063.00000.0759Population
1Quantitative('AveRooms')0.00130.00000.00640.1520nan0.3234MedInc
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " [RegressionSelector] Selected Qualitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureModeNanKruskalEtaSquaredMeasureKruskalEtaSquaredRankTschuprowtFilterTschuprowtWith
0Categorical('Region')0.47980.00000.04590.00000.0000itself
1Ordinal('HouseAgeBand')0.34860.00000.00471.00000.0842Region
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Features(['Region', 'HouseAgeBand', 'MedInc', 'AveBedrms', 'Population'])" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "custom_selector = RegressionSelector(\n", " features=features,\n", " n_best_per_type=n_best_per_type,\n", " measures=measures,\n", " filters=filters,\n", " config=ProcessingConfig(verbose=True),\n", ")\n", "custom_selector.fit(train_set, train_set[target]).selected_features" ] }, { "cell_type": "markdown", "id": "e37c6f03", "metadata": {}, "source": [ "## What's next?\n", "\n", "* You've selected the features most associated with your regression target!\n", "* Head over to the [Carvers Examples](https://autocarver.readthedocs.io/en/latest/carvers_examples.html) — in particular the Continuous Regression example — to maximize the predictive power of the selected features." ] } ], "metadata": { "kernelspec": { "display_name": "AutoCarver", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.15" } }, "nbformat": 4, "nbformat_minor": 5 }