{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting things up\n",
"\n",
"## About this notebook\n",
"\n",
"In this notebook, we embark on a journey to elevate the predictive capabilities of the California Housing Prices Dataset through advanced preprocessing using the ``ContinuousCarver`` pipeline. Renowned for its association-maximizing discretization, ``ContinuousCarver`` is a powerful Python tool designed to handle diverse data types—whether they be quantitative or qualitative. Our specific goal is to prepare the dataset for continuous regression tasks, such as predicting housing prices.\n",
"\n",
"The California Housing Prices Dataset is a treasure trove of features, encompassing information on factors like square footage, bedrooms, location, and more. By employing ``ContinuousCarver``, we aim to seamlessly discretize both quantitative and qualitative features, tailoring them for optimal representation in our continuous regression models.\n",
"\n",
"Throughout this notebook, we'll explore the intricacies of ``ContinuousCarver``'s discretization pipeline, witnessing its adaptability to a variety of data types. Whether it involves transforming square footage or encoding location information, ``ContinuousCarver`` ensures that each feature is finely tuned for our regression tasks.\n",
"\n",
"Join us in this exploration as we leverage the power of ``ContinuousCarver`` to preprocess the California Housing Prices Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that captures the nuanced relationships within the housing market, setting the stage for the development of accurate and impactful continuous regression models.\n",
"\n",
"Let's dive in and uncover the potential of ``ContinuousCarver`` in transforming the California Housing Prices Dataset for optimal predictive modeling.\n",
"\n",
"\n",
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# %pip install AutoCarver[jupyter]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Califorinia Housing Prices Data\n",
"\n",
"In this example notebook, we will use the California Housing Prices dataset.\n",
"\n",
"The California Housing Prices dataset is a well-known dataset in the field of machine learning and statistics. It provides information about housing districts in California and is frequently used for regression analysis and predictive modeling tasks.\n",
"\n",
"Comprising housing-related metrics for various districts in California, such as median house value, median income, housing median age, average rooms, average bedrooms, population, households, and more, the California Housing Prices dataset is a valuable resource for exploring the relationships between different features and predicting the median house values (continuous regression)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MedInc | \n",
" HouseAge | \n",
" AveRooms | \n",
" AveBedrms | \n",
" Population | \n",
" AveOccup | \n",
" Latitude | \n",
" Longitude | \n",
" MedHouseVal | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 8.3252 | \n",
" 41.0 | \n",
" 6.984127 | \n",
" 1.023810 | \n",
" 322.0 | \n",
" 2.555556 | \n",
" 37.88 | \n",
" -122.23 | \n",
" 4.526 | \n",
"
\n",
" \n",
" | 1 | \n",
" 8.3014 | \n",
" 21.0 | \n",
" 6.238137 | \n",
" 0.971880 | \n",
" 2401.0 | \n",
" 2.109842 | \n",
" 37.86 | \n",
" -122.22 | \n",
" 3.585 | \n",
"
\n",
" \n",
" | 2 | \n",
" 7.2574 | \n",
" 52.0 | \n",
" 8.288136 | \n",
" 1.073446 | \n",
" 496.0 | \n",
" 2.802260 | \n",
" 37.85 | \n",
" -122.24 | \n",
" 3.521 | \n",
"
\n",
" \n",
" | 3 | \n",
" 5.6431 | \n",
" 52.0 | \n",
" 5.817352 | \n",
" 1.073059 | \n",
" 558.0 | \n",
" 2.547945 | \n",
" 37.85 | \n",
" -122.25 | \n",
" 3.413 | \n",
"
\n",
" \n",
" | 4 | \n",
" 3.8462 | \n",
" 52.0 | \n",
" 6.281853 | \n",
" 1.081081 | \n",
" 565.0 | \n",
" 2.181467 | \n",
" 37.85 | \n",
" -122.25 | \n",
" 3.422 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n",
"0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n",
"1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n",
"2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n",
"3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n",
"4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n",
"\n",
" Longitude MedHouseVal \n",
"0 -122.23 4.526 \n",
"1 -122.22 3.585 \n",
"2 -122.24 3.521 \n",
"3 -122.25 3.413 \n",
"4 -122.25 3.422 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn import datasets\n",
"\n",
"# Load dataset directly from sklearn\n",
"housing = datasets.fetch_california_housing(as_frame=True)\n",
"\n",
"# conversion to pandas\n",
"housing_data = housing[\"data\"]\n",
"housing_data[housing[\"target_names\"][0]] = housing[\"target\"]\n",
"\n",
"# Display the first few rows of the dataset\n",
"housing_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Target type and Carver selection"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 20640.000000\n",
"mean 2.068558\n",
"std 1.153956\n",
"min 0.149990\n",
"25% 1.196000\n",
"50% 1.797000\n",
"75% 2.647250\n",
"max 5.000010\n",
"Name: MedHouseVal, dtype: float64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target = \"MedHouseVal\"\n",
"\n",
"housing_data[target].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The target ``\"MedHouseVal\"`` is a continuous target of type ``float64`` used in a regression task. Hence we will use ``AutoCarver.ContinuousCarver`` and ``AutoCarver.selectors.RegressionSelector`` in following code blocks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Sampling"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(2.0666362048018514, 2.072459655020552)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# stratified sampling by target\n",
"train_set, dev_set = train_test_split(housing_data, test_size=0.33, random_state=42)\n",
"\n",
"# checking target rate per dataset\n",
"train_set[target].mean(), dev_set[target].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Picking up columns to Carve"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MedInc | \n",
" HouseAge | \n",
" AveRooms | \n",
" AveBedrms | \n",
" Population | \n",
" AveOccup | \n",
" Latitude | \n",
" Longitude | \n",
" MedHouseVal | \n",
"
\n",
" \n",
" \n",
" \n",
" | 5088 | \n",
" 0.9809 | \n",
" 19.0 | \n",
" 3.187726 | \n",
" 1.129964 | \n",
" 726.0 | \n",
" 2.620939 | \n",
" 33.98 | \n",
" -118.28 | \n",
" 1.214 | \n",
"
\n",
" \n",
" | 17096 | \n",
" 4.2232 | \n",
" 33.0 | \n",
" 6.189696 | \n",
" 1.086651 | \n",
" 1015.0 | \n",
" 2.377049 | \n",
" 37.46 | \n",
" -122.23 | \n",
" 3.637 | \n",
"
\n",
" \n",
" | 5617 | \n",
" 3.5488 | \n",
" 42.0 | \n",
" 4.821577 | \n",
" 1.095436 | \n",
" 1044.0 | \n",
" 4.331950 | \n",
" 33.79 | \n",
" -118.26 | \n",
" 2.056 | \n",
"
\n",
" \n",
" | 20060 | \n",
" 1.6469 | \n",
" 24.0 | \n",
" 4.274194 | \n",
" 1.048387 | \n",
" 1686.0 | \n",
" 4.532258 | \n",
" 35.87 | \n",
" -119.26 | \n",
" 0.476 | \n",
"
\n",
" \n",
" | 895 | \n",
" 3.9909 | \n",
" 14.0 | \n",
" 4.608303 | \n",
" 1.089350 | \n",
" 2738.0 | \n",
" 2.471119 | \n",
" 37.54 | \n",
" -121.96 | \n",
" 2.360 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n",
"5088 0.9809 19.0 3.187726 1.129964 726.0 2.620939 33.98 \n",
"17096 4.2232 33.0 6.189696 1.086651 1015.0 2.377049 37.46 \n",
"5617 3.5488 42.0 4.821577 1.095436 1044.0 4.331950 33.79 \n",
"20060 1.6469 24.0 4.274194 1.048387 1686.0 4.532258 35.87 \n",
"895 3.9909 14.0 4.608303 1.089350 2738.0 2.471119 37.54 \n",
"\n",
" Longitude MedHouseVal \n",
"5088 -118.28 1.214 \n",
"17096 -122.23 3.637 \n",
"5617 -118.26 2.056 \n",
"20060 -119.26 0.476 \n",
"895 -121.96 2.360 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_set.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"MedInc float64\n",
"HouseAge float64\n",
"AveRooms float64\n",
"AveBedrms float64\n",
"Population float64\n",
"AveOccup float64\n",
"Latitude float64\n",
"Longitude float64\n",
"MedHouseVal float64\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# column data types\n",
"train_set.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All features are quantitative continuous features at the exception of ``Latitude`` and ``Longitude`` which are geographical featues (not supported by ``AutoCarver`` as is). All other features will be added to the list of ``quantitative_features``."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from AutoCarver import Features\n",
"\n",
"# lists of features per data type\n",
"features = Features(quantitatives=[\"MedInc\", \"HouseAge\", \"AveRooms\", \"AveBedrms\", \"Population\", \"AveOccup\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using AutoCarver"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AutoCarver settings\n",
"\n",
"### Representativness of modalities\n",
"\n",
"The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used:\n",
"\n",
"- For quantitative features, to define the number of quantiles to initialy discretize the features with.\n",
"\n",
"- For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"min_freq = 0.1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Desired number of modalities\n",
"\n",
"The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"max_n_mod = 4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Grouping NaNs\n",
"\n",
"The attribute ``dropna`` allows one to choose whether or not ``nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-``nan`` values, and then test out all possible combinations with ``nan``."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"dropna = False # anyway, there are no nan in this dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Type of output carved features\n",
"\n",
"The attribute ``ordinal_encoding`` allows one to choose the output type:\n",
"\n",
"* Use ``True`` for integer output of ranked modalities (default)\n",
"* Use ``False`` for string output of modalities"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"ordinal_encoding = True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Fitting AutoCarver\n",
"\n",
"* First, all quantitative features are discretized:\n",
" 1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq``)\n",
" 2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2``) to be grouped with its closest modality\n",
"\n",
"* Second, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):\n",
" 1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step\n",
" 2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``\n",
" 3. Computing associations: the association metric (Krsuskal-Wallis' statistic, by default) is computed with the provided ``train_set[target]``\n",
" 4. Combinations are sorted in descending order by association value\n",
" 5. Testing robustness: finds the first combination that checks the following:\n",
" - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq/2``)\n",
" - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` \n",
" - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)\n",
" 6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``nan`` are applied to ``train_set`` and steps 3. and 4. are run\n",
" 7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"------\n",
"--- [QuantitativeDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])\n",
" - [ContinuousDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])\n",
" - [OrdinalDiscretizer] Fit Features(['HouseAge'])\n",
"------\n",
"\n",
"---------\n",
"------ [ContinuousCarver] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup'])\n",
"--- [ContinuousCarver] Fit Quantitative('MedInc') (1/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 1.60e+00 | \n",
" 1.1102 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.60e+00 < x <= 1.91e+00 | \n",
" 1.1285 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.91e+00 < x <= 2.15e+00 | \n",
" 1.2198 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.15e+00 < x <= 2.35e+00 | \n",
" 1.3171 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.35e+00 < x <= 2.57e+00 | \n",
" 1.3817 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.57e+00 < x <= 2.74e+00 | \n",
" 1.5409 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.74e+00 < x <= 2.98e+00 | \n",
" 1.6159 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.98e+00 < x <= 3.14e+00 | \n",
" 1.6906 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 3.14e+00 < x <= 3.32e+00 | \n",
" 1.8232 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.32e+00 < x <= 3.54e+00 | \n",
" 1.9059 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.54e+00 < x <= 3.73e+00 | \n",
" 2.0076 | \n",
" 0.0502 | \n",
"
\n",
" \n",
" | 3.73e+00 < x <= 3.97e+00 | \n",
" 2.0271 | \n",
" 0.0498 | \n",
"
\n",
" \n",
" | 3.97e+00 < x <= 4.18e+00 | \n",
" 2.1456 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.18e+00 < x <= 4.46e+00 | \n",
" 2.2433 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.46e+00 < x <= 4.76e+00 | \n",
" 2.3621 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 4.76e+00 < x <= 5.12e+00 | \n",
" 2.3986 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 5.12e+00 < x <= 5.54e+00 | \n",
" 2.6438 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.54e+00 < x <= 6.16e+00 | \n",
" 2.9324 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 6.16e+00 < x <= 7.32e+00 | \n",
" 3.4592 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 7.32e+00 < x | \n",
" 4.3784 | \n",
" 0.0500 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.1017 | \n",
" 0.0509 | \n",
"
\n",
" \n",
" | 1.0410 | \n",
" 0.0502 | \n",
"
\n",
" \n",
" | 1.2407 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 1.2919 | \n",
" 0.0506 | \n",
"
\n",
" \n",
" | 1.4676 | \n",
" 0.0536 | \n",
"
\n",
" \n",
" | 1.5605 | \n",
" 0.0417 | \n",
"
\n",
" \n",
" | 1.6280 | \n",
" 0.0584 | \n",
"
\n",
" \n",
" | 1.7519 | \n",
" 0.0471 | \n",
"
\n",
" \n",
" | 1.8443 | \n",
" 0.0504 | \n",
"
\n",
" \n",
" | 1.8500 | \n",
" 0.0498 | \n",
"
\n",
" \n",
" | 2.0040 | \n",
" 0.0533 | \n",
"
\n",
" \n",
" | 2.0890 | \n",
" 0.0502 | \n",
"
\n",
" \n",
" | 2.1641 | \n",
" 0.0505 | \n",
"
\n",
" \n",
" | 2.2700 | \n",
" 0.0540 | \n",
"
\n",
" \n",
" | 2.3768 | \n",
" 0.0439 | \n",
"
\n",
" \n",
" | 2.5087 | \n",
" 0.0479 | \n",
"
\n",
" \n",
" | 2.6814 | \n",
" 0.0483 | \n",
"
\n",
" \n",
" | 2.9805 | \n",
" 0.0479 | \n",
"
\n",
" \n",
" | 3.3748 | \n",
" 0.0530 | \n",
"
\n",
" \n",
" | 4.3748 | \n",
" 0.0483 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 1158/1159 [00:00<00:00, 1300.95it/s]\n",
"Computing associations: 100%|██████████| 1159/1159 [00:04<00:00, 280.31it/s]\n",
"Testing robustness : 0%| | 0/1159 [00:00, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 2.57e+00 | \n",
" 1.2314 | \n",
" 0.2500 | \n",
"
\n",
" \n",
" | 2.57e+00 < x <= 3.97e+00 | \n",
" 1.8016 | \n",
" 0.3500 | \n",
"
\n",
" \n",
" | 3.97e+00 < x <= 5.54e+00 | \n",
" 2.3587 | \n",
" 0.2499 | \n",
"
\n",
" \n",
" | 5.54e+00 < x | \n",
" 3.5900 | \n",
" 0.1501 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.2315 | \n",
" 0.2554 | \n",
"
\n",
" \n",
" | 1.8222 | \n",
" 0.3509 | \n",
"
\n",
" \n",
" | 2.3953 | \n",
" 0.2446 | \n",
"
\n",
" \n",
" | 3.5721 | \n",
" 0.1491 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [ContinuousCarver] Fit Quantitative('HouseAge') (2/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 8.00e+00 | \n",
" 2.1158 | \n",
" 0.0537 | \n",
"
\n",
" \n",
" | 8.00e+00 < x <= 1.20e+01 | \n",
" 1.8220 | \n",
" 0.0477 | \n",
"
\n",
" \n",
" | 1.20e+01 < x <= 1.50e+01 | \n",
" 1.8590 | \n",
" 0.0613 | \n",
"
\n",
" \n",
" | 1.50e+01 < x <= 1.60e+01 | \n",
" 2.0358 | \n",
" 0.0393 | \n",
"
\n",
" \n",
" | 1.60e+01 < x <= 1.80e+01 | \n",
" 1.9013 | \n",
" 0.0596 | \n",
"
\n",
" \n",
" | 1.80e+01 < x <= 2.00e+01 | \n",
" 1.9399 | \n",
" 0.0468 | \n",
"
\n",
" \n",
" | 2.00e+01 < x <= 2.20e+01 | \n",
" 2.0134 | \n",
" 0.0404 | \n",
"
\n",
" \n",
" | 2.20e+01 < x <= 2.50e+01 | \n",
" 2.1055 | \n",
" 0.0705 | \n",
"
\n",
" \n",
" | 2.50e+01 < x <= 2.60e+01 | \n",
" 2.0977 | \n",
" 0.0300 | \n",
"
\n",
" \n",
" | 2.60e+01 < x <= 2.80e+01 | \n",
" 2.0218 | \n",
" 0.0475 | \n",
"
\n",
" \n",
" | 2.80e+01 < x <= 3.10e+01 | \n",
" 2.0439 | \n",
" 0.0682 | \n",
"
\n",
" \n",
" | 3.10e+01 < x <= 3.30e+01 | \n",
" 2.0275 | \n",
" 0.0575 | \n",
"
\n",
" \n",
" | 3.30e+01 < x <= 3.40e+01 | \n",
" 2.1189 | \n",
" 0.0328 | \n",
"
\n",
" \n",
" | 3.40e+01 < x <= 3.50e+01 | \n",
" 2.0204 | \n",
" 0.0395 | \n",
"
\n",
" \n",
" | 3.50e+01 < x <= 3.70e+01 | \n",
" 2.0750 | \n",
" 0.0687 | \n",
"
\n",
" \n",
" | 3.70e+01 < x <= 3.90e+01 | \n",
" 2.0212 | \n",
" 0.0361 | \n",
"
\n",
" \n",
" | 3.90e+01 < x <= 4.20e+01 | \n",
" 2.0013 | \n",
" 0.0450 | \n",
"
\n",
" \n",
" | 4.20e+01 < x <= 4.50e+01 | \n",
" 2.1301 | \n",
" 0.0485 | \n",
"
\n",
" \n",
" | 4.50e+01 < x | \n",
" 2.4785 | \n",
" 0.1072 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.0205 | \n",
" 0.0526 | \n",
"
\n",
" \n",
" | 1.7827 | \n",
" 0.0443 | \n",
"
\n",
" \n",
" | 1.8780 | \n",
" 0.0556 | \n",
"
\n",
" \n",
" | 1.9208 | \n",
" 0.0335 | \n",
"
\n",
" \n",
" | 1.9484 | \n",
" 0.0652 | \n",
"
\n",
" \n",
" | 1.9517 | \n",
" 0.0470 | \n",
"
\n",
" \n",
" | 2.1141 | \n",
" 0.0421 | \n",
"
\n",
" \n",
" | 2.1179 | \n",
" 0.0759 | \n",
"
\n",
" \n",
" | 2.0888 | \n",
" 0.0299 | \n",
"
\n",
" \n",
" | 2.2138 | \n",
" 0.0443 | \n",
"
\n",
" \n",
" | 1.9546 | \n",
" 0.0664 | \n",
"
\n",
" \n",
" | 2.0512 | \n",
" 0.0565 | \n",
"
\n",
" \n",
" | 2.1979 | \n",
" 0.0346 | \n",
"
\n",
" \n",
" | 2.1762 | \n",
" 0.0408 | \n",
"
\n",
" \n",
" | 2.0747 | \n",
" 0.0659 | \n",
"
\n",
" \n",
" | 1.9885 | \n",
" 0.0388 | \n",
"
\n",
" \n",
" | 2.0394 | \n",
" 0.0508 | \n",
"
\n",
" \n",
" | 2.0015 | \n",
" 0.0489 | \n",
"
\n",
" \n",
" | 2.4651 | \n",
" 0.1069 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 986/987 [00:00<00:00, 1312.66it/s]\n",
"Computing associations: 100%|██████████| 987/987 [00:03<00:00, 288.78it/s]\n",
"Testing robustness : 1%| | 6/987 [00:00<00:02, 335.58it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 2.20e+01 | \n",
" 1.9494 | \n",
" 0.3486 | \n",
"
\n",
" \n",
" | 2.20e+01 < x <= 2.60e+01 | \n",
" 2.1032 | \n",
" 0.1005 | \n",
"
\n",
" \n",
" | 2.60e+01 < x <= 4.50e+01 | \n",
" 2.0509 | \n",
" 0.4437 | \n",
"
\n",
" \n",
" | 4.50e+01 < x | \n",
" 2.4785 | \n",
" 0.1072 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.9447 | \n",
" 0.3403 | \n",
"
\n",
" \n",
" | 2.1097 | \n",
" 0.1058 | \n",
"
\n",
" \n",
" | 2.0670 | \n",
" 0.4470 | \n",
"
\n",
" \n",
" | 2.4651 | \n",
" 0.1069 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [ContinuousCarver] Fit Quantitative('AveRooms') (3/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 3.44e+00 | \n",
" 1.9126 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.44e+00 < x <= 3.79e+00 | \n",
" 1.8286 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.79e+00 < x <= 4.06e+00 | \n",
" 1.8169 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.06e+00 < x <= 4.28e+00 | \n",
" 1.8418 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.28e+00 < x <= 4.46e+00 | \n",
" 1.7529 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.46e+00 < x <= 4.62e+00 | \n",
" 1.7915 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.62e+00 < x <= 4.79e+00 | \n",
" 1.8214 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.79e+00 < x <= 4.94e+00 | \n",
" 1.7685 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.94e+00 < x <= 5.09e+00 | \n",
" 1.7466 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.09e+00 < x <= 5.23e+00 | \n",
" 1.7717 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.23e+00 < x <= 5.38e+00 | \n",
" 1.8664 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.38e+00 < x <= 5.53e+00 | \n",
" 1.8472 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.53e+00 < x <= 5.69e+00 | \n",
" 1.9199 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.69e+00 < x <= 5.86e+00 | \n",
" 1.9910 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 5.86e+00 < x <= 6.06e+00 | \n",
" 2.0870 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 6.06e+00 < x <= 6.27e+00 | \n",
" 2.1908 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 6.27e+00 < x <= 6.54e+00 | \n",
" 2.4050 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 6.54e+00 < x <= 6.95e+00 | \n",
" 2.6874 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 6.95e+00 < x <= 7.65e+00 | \n",
" 3.1129 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 7.65e+00 < x | \n",
" 3.1718 | \n",
" 0.0500 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.8659 | \n",
" 0.0518 | \n",
"
\n",
" \n",
" | 1.8728 | \n",
" 0.0505 | \n",
"
\n",
" \n",
" | 1.7627 | \n",
" 0.0524 | \n",
"
\n",
" \n",
" | 1.8020 | \n",
" 0.0543 | \n",
"
\n",
" \n",
" | 1.7223 | \n",
" 0.0552 | \n",
"
\n",
" \n",
" | 1.6802 | \n",
" 0.0452 | \n",
"
\n",
" \n",
" | 1.7707 | \n",
" 0.0530 | \n",
"
\n",
" \n",
" | 1.8030 | \n",
" 0.0443 | \n",
"
\n",
" \n",
" | 1.8209 | \n",
" 0.0523 | \n",
"
\n",
" \n",
" | 1.8326 | \n",
" 0.0437 | \n",
"
\n",
" \n",
" | 1.7923 | \n",
" 0.0550 | \n",
"
\n",
" \n",
" | 1.9388 | \n",
" 0.0514 | \n",
"
\n",
" \n",
" | 1.9465 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 2.0248 | \n",
" 0.0468 | \n",
"
\n",
" \n",
" | 2.1049 | \n",
" 0.0483 | \n",
"
\n",
" \n",
" | 2.2239 | \n",
" 0.0490 | \n",
"
\n",
" \n",
" | 2.4339 | \n",
" 0.0467 | \n",
"
\n",
" \n",
" | 2.7667 | \n",
" 0.0468 | \n",
"
\n",
" \n",
" | 3.1001 | \n",
" 0.0548 | \n",
"
\n",
" \n",
" | 3.2429 | \n",
" 0.0483 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 1158/1159 [00:00<00:00, 1378.37it/s]\n",
"Computing associations: 100%|██████████| 1159/1159 [00:03<00:00, 289.89it/s]\n",
"Testing robustness : 1%| | 7/1159 [00:00<00:03, 304.14it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 5.23e+00 | \n",
" 1.8053 | \n",
" 0.5000 | \n",
"
\n",
" \n",
" | 5.23e+00 < x <= 5.86e+00 | \n",
" 1.9061 | \n",
" 0.2000 | \n",
"
\n",
" \n",
" | 5.86e+00 < x <= 6.54e+00 | \n",
" 2.2275 | \n",
" 0.1500 | \n",
"
\n",
" \n",
" | 6.54e+00 < x | \n",
" 2.9907 | \n",
" 0.1501 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.7933 | \n",
" 0.5028 | \n",
"
\n",
" \n",
" | 1.9208 | \n",
" 0.2033 | \n",
"
\n",
" \n",
" | 2.2521 | \n",
" 0.1440 | \n",
"
\n",
" \n",
" | 3.0420 | \n",
" 0.1499 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [ContinuousCarver] Fit Quantitative('AveBedrms') (4/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 9.4000e-01 | \n",
" 2.0684 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 9.4000e-01 < x <= 9.6724e-01 | \n",
" 2.0735 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 9.6724e-01 < x <= 9.8319e-01 | \n",
" 2.2167 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 9.8319e-01 < x <= 9.9576e-01 | \n",
" 2.1706 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 9.9576e-01 < x <= 1.0066e+00 | \n",
" 2.1310 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0066e+00 < x <= 1.0154e+00 | \n",
" 2.2358 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0154e+00 < x <= 1.0247e+00 | \n",
" 2.1668 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0247e+00 < x <= 1.0331e+00 | \n",
" 2.2102 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0331e+00 < x <= 1.0412e+00 | \n",
" 2.1295 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0412e+00 < x <= 1.0495e+00 | \n",
" 2.1548 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0495e+00 < x <= 1.0576e+00 | \n",
" 2.1238 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0576e+00 < x <= 1.0665e+00 | \n",
" 2.1025 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0665e+00 < x <= 1.0768e+00 | \n",
" 2.0704 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.0768e+00 < x <= 1.0878e+00 | \n",
" 2.0664 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 1.0878e+00 < x <= 1.1003e+00 | \n",
" 2.1118 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 1.1003e+00 < x <= 1.1161e+00 | \n",
" 1.9937 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.1161e+00 < x <= 1.1382e+00 | \n",
" 1.9405 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.1382e+00 < x <= 1.1738e+00 | \n",
" 1.7990 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.1738e+00 < x <= 1.2732e+00 | \n",
" 1.9162 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.2732e+00 < x | \n",
" 1.6515 | \n",
" 0.0500 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.0416 | \n",
" 0.0539 | \n",
"
\n",
" \n",
" | 2.2043 | \n",
" 0.0527 | \n",
"
\n",
" \n",
" | 2.0997 | \n",
" 0.0482 | \n",
"
\n",
" \n",
" | 2.1835 | \n",
" 0.0487 | \n",
"
\n",
" \n",
" | 2.2628 | \n",
" 0.0552 | \n",
"
\n",
" \n",
" | 2.1619 | \n",
" 0.0480 | \n",
"
\n",
" \n",
" | 2.2295 | \n",
" 0.0567 | \n",
"
\n",
" \n",
" | 2.1690 | \n",
" 0.0493 | \n",
"
\n",
" \n",
" | 2.1581 | \n",
" 0.0528 | \n",
"
\n",
" \n",
" | 2.1202 | \n",
" 0.0476 | \n",
"
\n",
" \n",
" | 2.1039 | \n",
" 0.0452 | \n",
"
\n",
" \n",
" | 2.1595 | \n",
" 0.0509 | \n",
"
\n",
" \n",
" | 2.1037 | \n",
" 0.0521 | \n",
"
\n",
" \n",
" | 2.0662 | \n",
" 0.0484 | \n",
"
\n",
" \n",
" | 2.0487 | \n",
" 0.0489 | \n",
"
\n",
" \n",
" | 1.9543 | \n",
" 0.0467 | \n",
"
\n",
" \n",
" | 1.8871 | \n",
" 0.0484 | \n",
"
\n",
" \n",
" | 1.8680 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 1.8371 | \n",
" 0.0465 | \n",
"
\n",
" \n",
" | 1.7182 | \n",
" 0.0498 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 1158/1159 [00:00<00:00, 1278.14it/s]\n",
"Computing associations: 100%|██████████| 1159/1159 [00:04<00:00, 277.88it/s]\n",
"Testing robustness : 3%|▎ | 35/1159 [00:00<00:02, 512.40it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 9.672e-01 | \n",
" 2.0709 | \n",
" 0.1000 | \n",
"
\n",
" \n",
" | 9.672e-01 < x <= 1.058e+00 | \n",
" 2.1710 | \n",
" 0.4500 | \n",
"
\n",
" \n",
" | 1.058e+00 < x <= 1.138e+00 | \n",
" 2.0475 | \n",
" 0.2999 | \n",
"
\n",
" \n",
" | 1.138e+00 < x | \n",
" 1.7888 | \n",
" 0.1501 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.1221 | \n",
" 0.1066 | \n",
"
\n",
" \n",
" | 2.1685 | \n",
" 0.4517 | \n",
"
\n",
" \n",
" | 2.0390 | \n",
" 0.2955 | \n",
"
\n",
" \n",
" | 1.8072 | \n",
" 0.1462 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [ContinuousCarver] Fit Quantitative('Population') (5/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 3.53e+02 | \n",
" 1.9859 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 3.53e+02 < x <= 5.14e+02 | \n",
" 2.1616 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 5.14e+02 < x <= 6.27e+02 | \n",
" 2.1117 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 6.27e+02 < x <= 7.15e+02 | \n",
" 2.2819 | \n",
" 0.0497 | \n",
"
\n",
" \n",
" | 7.15e+02 < x <= 7.93e+02 | \n",
" 2.0335 | \n",
" 0.0509 | \n",
"
\n",
" \n",
" | 7.93e+02 < x <= 8.64e+02 | \n",
" 2.2113 | \n",
" 0.0492 | \n",
"
\n",
" \n",
" | 8.64e+02 < x <= 9.38e+02 | \n",
" 2.0772 | \n",
" 0.0498 | \n",
"
\n",
" \n",
" | 9.38e+02 < x <= 1.02e+03 | \n",
" 2.1386 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.02e+03 < x <= 1.09e+03 | \n",
" 2.0430 | \n",
" 0.0503 | \n",
"
\n",
" \n",
" | 1.09e+03 < x <= 1.17e+03 | \n",
" 2.0506 | \n",
" 0.0496 | \n",
"
\n",
" \n",
" | 1.17e+03 < x <= 1.26e+03 | \n",
" 2.0870 | \n",
" 0.0505 | \n",
"
\n",
" \n",
" | 1.26e+03 < x <= 1.35e+03 | \n",
" 2.0195 | \n",
" 0.0497 | \n",
"
\n",
" \n",
" | 1.35e+03 < x <= 1.46e+03 | \n",
" 2.0004 | \n",
" 0.0502 | \n",
"
\n",
" \n",
" | 1.46e+03 < x <= 1.58e+03 | \n",
" 2.1102 | \n",
" 0.0498 | \n",
"
\n",
" \n",
" | 1.58e+03 < x <= 1.73e+03 | \n",
" 2.0346 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.73e+03 < x <= 1.91e+03 | \n",
" 1.9139 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 1.91e+03 < x <= 2.15e+03 | \n",
" 2.0006 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.15e+03 < x <= 2.56e+03 | \n",
" 2.0707 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.56e+03 < x <= 3.30e+03 | \n",
" 1.9614 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.30e+03 < x | \n",
" 2.0428 | \n",
" 0.0500 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1.9012 | \n",
" 0.0530 | \n",
"
\n",
" \n",
" | 2.1915 | \n",
" 0.0520 | \n",
"
\n",
" \n",
" | 2.1706 | \n",
" 0.0523 | \n",
"
\n",
" \n",
" | 2.1062 | \n",
" 0.0514 | \n",
"
\n",
" \n",
" | 2.2019 | \n",
" 0.0531 | \n",
"
\n",
" \n",
" | 2.1765 | \n",
" 0.0490 | \n",
"
\n",
" \n",
" | 2.2025 | \n",
" 0.0506 | \n",
"
\n",
" \n",
" | 2.1329 | \n",
" 0.0553 | \n",
"
\n",
" \n",
" | 2.1744 | \n",
" 0.0437 | \n",
"
\n",
" \n",
" | 2.1319 | \n",
" 0.0480 | \n",
"
\n",
" \n",
" | 1.9939 | \n",
" 0.0534 | \n",
"
\n",
" \n",
" | 2.0096 | \n",
" 0.0465 | \n",
"
\n",
" \n",
" | 1.9569 | \n",
" 0.0465 | \n",
"
\n",
" \n",
" | 1.9756 | \n",
" 0.0504 | \n",
"
\n",
" \n",
" | 2.0815 | \n",
" 0.0496 | \n",
"
\n",
" \n",
" | 2.0272 | \n",
" 0.0461 | \n",
"
\n",
" \n",
" | 1.9789 | \n",
" 0.0487 | \n",
"
\n",
" \n",
" | 1.9355 | \n",
" 0.0496 | \n",
"
\n",
" \n",
" | 2.0714 | \n",
" 0.0518 | \n",
"
\n",
" \n",
" | 2.0157 | \n",
" 0.0487 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 1158/1159 [00:00<00:00, 1364.07it/s]\n",
"Computing associations: 100%|██████████| 1159/1159 [00:03<00:00, 294.36it/s]\n",
"Testing robustness : 16%|█▋ | 191/1159 [00:00<00:01, 626.53it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 6.27e+02 | \n",
" 2.0864 | \n",
" 0.1503 | \n",
"
\n",
" \n",
" | 6.27e+02 < x <= 8.64e+02 | \n",
" 2.1743 | \n",
" 0.1498 | \n",
"
\n",
" \n",
" | 8.64e+02 < x <= 2.15e+03 | \n",
" 2.0433 | \n",
" 0.5498 | \n",
"
\n",
" \n",
" | 2.15e+03 < x | \n",
" 2.0250 | \n",
" 0.1501 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.0867 | \n",
" 0.1572 | \n",
"
\n",
" \n",
" | 2.1618 | \n",
" 0.1536 | \n",
"
\n",
" \n",
" | 2.0607 | \n",
" 0.5390 | \n",
"
\n",
" \n",
" | 2.0084 | \n",
" 0.1502 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [ContinuousCarver] Fit Quantitative('AveOccup') (6/6)\n",
" [ContinuousCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 1.870e+00 | \n",
" 2.7122 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 1.870e+00 < x <= 2.067e+00 | \n",
" 2.6633 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.067e+00 < x <= 2.225e+00 | \n",
" 2.3373 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.225e+00 < x <= 2.338e+00 | \n",
" 2.3080 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.338e+00 < x <= 2.432e+00 | \n",
" 2.1976 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.432e+00 < x <= 2.513e+00 | \n",
" 2.2064 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.513e+00 < x <= 2.595e+00 | \n",
" 2.1736 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.595e+00 < x <= 2.668e+00 | \n",
" 2.1862 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.668e+00 < x <= 2.743e+00 | \n",
" 2.1378 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.743e+00 < x <= 2.820e+00 | \n",
" 2.1902 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.820e+00 < x <= 2.898e+00 | \n",
" 2.1824 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.898e+00 < x <= 2.984e+00 | \n",
" 2.0741 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 2.984e+00 < x <= 3.073e+00 | \n",
" 2.0255 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 3.073e+00 < x <= 3.171e+00 | \n",
" 1.9914 | \n",
" 0.0498 | \n",
"
\n",
" \n",
" | 3.171e+00 < x <= 3.282e+00 | \n",
" 1.8992 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.282e+00 < x <= 3.425e+00 | \n",
" 1.8926 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.425e+00 < x <= 3.607e+00 | \n",
" 1.7085 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.607e+00 < x <= 3.877e+00 | \n",
" 1.5666 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 3.877e+00 < x <= 4.325e+00 | \n",
" 1.4505 | \n",
" 0.0500 | \n",
"
\n",
" \n",
" | 4.325e+00 < x | \n",
" 1.4294 | \n",
" 0.0500 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.7684 | \n",
" 0.0484 | \n",
"
\n",
" \n",
" | 2.5334 | \n",
" 0.0435 | \n",
"
\n",
" \n",
" | 2.3989 | \n",
" 0.0542 | \n",
"
\n",
" \n",
" | 2.3641 | \n",
" 0.0533 | \n",
"
\n",
" \n",
" | 2.2272 | \n",
" 0.0546 | \n",
"
\n",
" \n",
" | 2.2969 | \n",
" 0.0489 | \n",
"
\n",
" \n",
" | 2.3179 | \n",
" 0.0508 | \n",
"
\n",
" \n",
" | 2.0793 | \n",
" 0.0467 | \n",
"
\n",
" \n",
" | 2.1847 | \n",
" 0.0521 | \n",
"
\n",
" \n",
" | 2.1752 | \n",
" 0.0504 | \n",
"
\n",
" \n",
" | 2.0762 | \n",
" 0.0533 | \n",
"
\n",
" \n",
" | 2.0535 | \n",
" 0.0501 | \n",
"
\n",
" \n",
" | 2.0535 | \n",
" 0.0528 | \n",
"
\n",
" \n",
" | 1.9477 | \n",
" 0.0458 | \n",
"
\n",
" \n",
" | 1.8397 | \n",
" 0.0449 | \n",
"
\n",
" \n",
" | 1.8861 | \n",
" 0.0514 | \n",
"
\n",
" \n",
" | 1.7301 | \n",
" 0.0448 | \n",
"
\n",
" \n",
" | 1.6200 | \n",
" 0.0499 | \n",
"
\n",
" \n",
" | 1.4423 | \n",
" 0.0527 | \n",
"
\n",
" \n",
" | 1.4596 | \n",
" 0.0515 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 1158/1159 [00:01<00:00, 1152.95it/s]\n",
"Computing associations: 100%|██████████| 1159/1159 [00:03<00:00, 297.60it/s]\n",
"Testing robustness : 0%| | 3/1159 [00:00<00:06, 184.97it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [ContinuousCarver] Carved distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 2.22e+00 | \n",
" 2.5709 | \n",
" 0.1501 | \n",
"
\n",
" \n",
" | 2.22e+00 < x <= 3.07e+00 | \n",
" 2.1681 | \n",
" 0.5001 | \n",
"
\n",
" \n",
" | 3.07e+00 < x <= 3.61e+00 | \n",
" 1.8729 | \n",
" 0.1998 | \n",
"
\n",
" \n",
" | 3.61e+00 < x | \n",
" 1.4822 | \n",
" 0.1501 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 2.5615 | \n",
" 0.1461 | \n",
"
\n",
" \n",
" | 2.1836 | \n",
" 0.5129 | \n",
"
\n",
" \n",
" | 1.8527 | \n",
" 0.1869 | \n",
"
\n",
" \n",
" | 1.5056 | \n",
" 0.1541 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from AutoCarver import ContinuousCarver\n",
"\n",
"# intiating AutoCarver\n",
"auto_carver = ContinuousCarver(\n",
" features=features,\n",
" min_freq=min_freq,\n",
" max_n_mod=max_n_mod,\n",
" dropna=dropna,\n",
" ordinal_encoding=ordinal_encoding,\n",
" verbose=True, # showing statistics\n",
" copy=True, # whether or not to return a copy of the input dataset\n",
")\n",
"\n",
"# fitting on training sample, a dev sample can be specified to evaluate carving robustness\n",
"train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AutoCarver analysis\n",
"\n",
"### Carving Summary"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" content | \n",
" frequency | \n",
"
\n",
" \n",
" | feature | \n",
" target_mean | \n",
" kruskal | \n",
" n_mod | \n",
" label | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Quantitative('MedInc') | \n",
" 1.231421 | \n",
" 6037.182135 | \n",
" 4 | \n",
" 0 | \n",
" x <= 2.57e+00 | \n",
" 0.250000 | \n",
"
\n",
" \n",
" | 1.801562 | \n",
" 6037.182135 | \n",
" 4 | \n",
" 1 | \n",
" 2.57e+00 < x <= 3.97e+00 | \n",
" 0.350014 | \n",
"
\n",
" \n",
" | 2.358660 | \n",
" 6037.182135 | \n",
" 4 | \n",
" 2 | \n",
" 3.97e+00 < x <= 5.54e+00 | \n",
" 0.249928 | \n",
"
\n",
" \n",
" | 3.590040 | \n",
" 6037.182135 | \n",
" 4 | \n",
" 3 | \n",
" 5.54e+00 < x | \n",
" 0.150058 | \n",
"
\n",
" \n",
" | Quantitative('HouseAge') | \n",
" 1.949361 | \n",
" 163.527841 | \n",
" 4 | \n",
" 0 | \n",
" x <= 2.20e+01 | \n",
" 0.348568 | \n",
"
\n",
" \n",
" | 2.103173 | \n",
" 163.527841 | \n",
" 4 | \n",
" 1 | \n",
" 2.20e+01 < x <= 2.60e+01 | \n",
" 0.100521 | \n",
"
\n",
" \n",
" | 2.050927 | \n",
" 163.527841 | \n",
" 4 | \n",
" 2 | \n",
" 2.60e+01 < x <= 4.50e+01 | \n",
" 0.443665 | \n",
"
\n",
" \n",
" | 2.478542 | \n",
" 163.527841 | \n",
" 4 | \n",
" 3 | \n",
" 4.50e+01 < x | \n",
" 0.107246 | \n",
"
\n",
" \n",
" | Quantitative('AveRooms') | \n",
" 1.805255 | \n",
" 1391.586489 | \n",
" 4 | \n",
" 0 | \n",
" x <= 5.23e+00 | \n",
" 0.500000 | \n",
"
\n",
" \n",
" | 1.906098 | \n",
" 1391.586489 | \n",
" 4 | \n",
" 1 | \n",
" 5.23e+00 < x <= 5.86e+00 | \n",
" 0.199957 | \n",
"
\n",
" \n",
" | 2.227531 | \n",
" 1391.586489 | \n",
" 4 | \n",
" 2 | \n",
" 5.86e+00 < x <= 6.54e+00 | \n",
" 0.149986 | \n",
"
\n",
" \n",
" | 2.990676 | \n",
" 1391.586489 | \n",
" 4 | \n",
" 3 | \n",
" 6.54e+00 < x | \n",
" 0.150058 | \n",
"
\n",
" \n",
" | Quantitative('AveBedrms') | \n",
" 2.070937 | \n",
" 315.794350 | \n",
" 4 | \n",
" 0 | \n",
" x <= 9.672e-01 | \n",
" 0.100014 | \n",
"
\n",
" \n",
" | 2.171033 | \n",
" 315.794350 | \n",
" 4 | \n",
" 1 | \n",
" 9.672e-01 < x <= 1.058e+00 | \n",
" 0.450029 | \n",
"
\n",
" \n",
" | 2.047547 | \n",
" 315.794350 | \n",
" 4 | \n",
" 2 | \n",
" 1.058e+00 < x <= 1.138e+00 | \n",
" 0.299899 | \n",
"
\n",
" \n",
" | 1.788831 | \n",
" 315.794350 | \n",
" 4 | \n",
" 3 | \n",
" 1.138e+00 < x | \n",
" 0.150058 | \n",
"
\n",
" \n",
" | Quantitative('Population') | \n",
" 2.086394 | \n",
" 16.109709 | \n",
" 4 | \n",
" 0 | \n",
" x <= 6.27e+02 | \n",
" 0.150347 | \n",
"
\n",
" \n",
" | 2.174297 | \n",
" 16.109709 | \n",
" 4 | \n",
" 1 | \n",
" 6.27e+02 < x <= 8.64e+02 | \n",
" 0.149841 | \n",
"
\n",
" \n",
" | 2.043255 | \n",
" 16.109709 | \n",
" 4 | \n",
" 2 | \n",
" 8.64e+02 < x <= 2.15e+03 | \n",
" 0.549754 | \n",
"
\n",
" \n",
" | 2.024995 | \n",
" 16.109709 | \n",
" 4 | \n",
" 3 | \n",
" 2.15e+03 < x | \n",
" 0.150058 | \n",
"
\n",
" \n",
" | Quantitative('AveOccup') | \n",
" 2.570888 | \n",
" 991.408301 | \n",
" 4 | \n",
" 0 | \n",
" x <= 2.22e+00 | \n",
" 0.150058 | \n",
"
\n",
" \n",
" | 2.168126 | \n",
" 991.408301 | \n",
" 4 | \n",
" 1 | \n",
" 2.22e+00 < x <= 3.07e+00 | \n",
" 0.500072 | \n",
"
\n",
" \n",
" | 1.872867 | \n",
" 991.408301 | \n",
" 4 | \n",
" 2 | \n",
" 3.07e+00 < x <= 3.61e+00 | \n",
" 0.199812 | \n",
"
\n",
" \n",
" | 1.482183 | \n",
" 991.408301 | \n",
" 4 | \n",
" 3 | \n",
" 3.61e+00 < x | \n",
" 0.150058 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" content \\\n",
"feature target_mean kruskal n_mod label \n",
"Quantitative('MedInc') 1.231421 6037.182135 4 0 x <= 2.57e+00 \n",
" 1.801562 6037.182135 4 1 2.57e+00 < x <= 3.97e+00 \n",
" 2.358660 6037.182135 4 2 3.97e+00 < x <= 5.54e+00 \n",
" 3.590040 6037.182135 4 3 5.54e+00 < x \n",
"Quantitative('HouseAge') 1.949361 163.527841 4 0 x <= 2.20e+01 \n",
" 2.103173 163.527841 4 1 2.20e+01 < x <= 2.60e+01 \n",
" 2.050927 163.527841 4 2 2.60e+01 < x <= 4.50e+01 \n",
" 2.478542 163.527841 4 3 4.50e+01 < x \n",
"Quantitative('AveRooms') 1.805255 1391.586489 4 0 x <= 5.23e+00 \n",
" 1.906098 1391.586489 4 1 5.23e+00 < x <= 5.86e+00 \n",
" 2.227531 1391.586489 4 2 5.86e+00 < x <= 6.54e+00 \n",
" 2.990676 1391.586489 4 3 6.54e+00 < x \n",
"Quantitative('AveBedrms') 2.070937 315.794350 4 0 x <= 9.672e-01 \n",
" 2.171033 315.794350 4 1 9.672e-01 < x <= 1.058e+00 \n",
" 2.047547 315.794350 4 2 1.058e+00 < x <= 1.138e+00 \n",
" 1.788831 315.794350 4 3 1.138e+00 < x \n",
"Quantitative('Population') 2.086394 16.109709 4 0 x <= 6.27e+02 \n",
" 2.174297 16.109709 4 1 6.27e+02 < x <= 8.64e+02 \n",
" 2.043255 16.109709 4 2 8.64e+02 < x <= 2.15e+03 \n",
" 2.024995 16.109709 4 3 2.15e+03 < x \n",
"Quantitative('AveOccup') 2.570888 991.408301 4 0 x <= 2.22e+00 \n",
" 2.168126 991.408301 4 1 2.22e+00 < x <= 3.07e+00 \n",
" 1.872867 991.408301 4 2 3.07e+00 < x <= 3.61e+00 \n",
" 1.482183 991.408301 4 3 3.61e+00 < x \n",
"\n",
" frequency \n",
"feature target_mean kruskal n_mod label \n",
"Quantitative('MedInc') 1.231421 6037.182135 4 0 0.250000 \n",
" 1.801562 6037.182135 4 1 0.350014 \n",
" 2.358660 6037.182135 4 2 0.249928 \n",
" 3.590040 6037.182135 4 3 0.150058 \n",
"Quantitative('HouseAge') 1.949361 163.527841 4 0 0.348568 \n",
" 2.103173 163.527841 4 1 0.100521 \n",
" 2.050927 163.527841 4 2 0.443665 \n",
" 2.478542 163.527841 4 3 0.107246 \n",
"Quantitative('AveRooms') 1.805255 1391.586489 4 0 0.500000 \n",
" 1.906098 1391.586489 4 1 0.199957 \n",
" 2.227531 1391.586489 4 2 0.149986 \n",
" 2.990676 1391.586489 4 3 0.150058 \n",
"Quantitative('AveBedrms') 2.070937 315.794350 4 0 0.100014 \n",
" 2.171033 315.794350 4 1 0.450029 \n",
" 2.047547 315.794350 4 2 0.299899 \n",
" 1.788831 315.794350 4 3 0.150058 \n",
"Quantitative('Population') 2.086394 16.109709 4 0 0.150347 \n",
" 2.174297 16.109709 4 1 0.149841 \n",
" 2.043255 16.109709 4 2 0.549754 \n",
" 2.024995 16.109709 4 3 0.150058 \n",
"Quantitative('AveOccup') 2.570888 991.408301 4 0 0.150058 \n",
" 2.168126 991.408301 4 1 0.500072 \n",
" 1.872867 991.408301 4 2 0.199812 \n",
" 1.482183 991.408301 4 3 0.150058 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_carver.summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* As requested with ``ordinal_encoding=True``, output labels are integers of modalities\n",
"\n",
"* For quantitative feature ``Population``, the selected combination of modalities groups populations as follows:\n",
" * label ``0``: lower or equal to 630 people (``content=\"x <= 6.3e+02\"``)\n",
" * label ``1``: greater than 630 people and lower or equal to 860 people (``content=\"6.3e+02 < x <= 8.6e+02\"``)\n",
" * label ``2``: greater than 860 people and lower or equal to 2200 people (``content=\"8.6e+02 < x <= 2.2e+03\"``)\n",
" * label ``3``: higher than 2200 people (``content=\"2.2e+03 < x\"``)\n",
"\n",
"### Detailed overview of tested combinations"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" info | \n",
" kruskal | \n",
" combination | \n",
" n_mod | \n",
" dropna | \n",
" train | \n",
" viable | \n",
" dev | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Raw distribution (n_mod=20>max_n_mod=4) | \n",
" 1062.072498 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 20 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" Not viable | \n",
" 994.514410 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" {'viable': True, 'info': ''} | \n",
" False | \n",
" {'viable': False, 'info': 'Non-representative ... | \n",
"
\n",
" \n",
" | 2 | \n",
" Not viable | \n",
" 994.504665 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" {'viable': True, 'info': ''} | \n",
" False | \n",
" {'viable': False, 'info': 'Non-representative ... | \n",
"
\n",
" \n",
" | 3 | \n",
" Not viable | \n",
" 991.504255 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" {'viable': True, 'info': ''} | \n",
" False | \n",
" {'viable': False, 'info': 'Non-representative ... | \n",
"
\n",
" \n",
" | 4 | \n",
" Best for kruskal and max_n_mod=4 | \n",
" 991.408301 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" {'viable': True, 'info': ''} | \n",
" True | \n",
" {'viable': True, 'info': ''} | \n",
"
\n",
" \n",
" | 5 | \n",
" Not checked | \n",
" 991.308986 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 6 | \n",
" Not checked | \n",
" 988.666983 | \n",
" {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... | \n",
" 4 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" info kruskal \\\n",
"0 Raw distribution (n_mod=20>max_n_mod=4) 1062.072498 \n",
"1 Not viable 994.514410 \n",
"2 Not viable 994.504665 \n",
"3 Not viable 991.504255 \n",
"4 Best for kruskal and max_n_mod=4 991.408301 \n",
"5 Not checked 991.308986 \n",
"6 Not checked 988.666983 \n",
"\n",
" combination n_mod dropna \\\n",
"0 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 20 False \n",
"1 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"2 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"3 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"4 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"5 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"6 {'x <= 1.870e+00': 'x <= 1.870e+00', '1.870e+0... 4 False \n",
"\n",
" train viable \\\n",
"0 NaN NaN \n",
"1 {'viable': True, 'info': ''} False \n",
"2 {'viable': True, 'info': ''} False \n",
"3 {'viable': True, 'info': ''} False \n",
"4 {'viable': True, 'info': ''} True \n",
"5 NaN NaN \n",
"6 NaN NaN \n",
"\n",
" dev \n",
"0 NaN \n",
"1 {'viable': False, 'info': 'Non-representative ... \n",
"2 {'viable': False, 'info': 'Non-representative ... \n",
"3 {'viable': False, 'info': 'Non-representative ... \n",
"4 {'viable': True, 'info': ''} \n",
"5 NaN \n",
"6 NaN "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features[\"AveOccup\"].history.head(7)"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'viable': False, 'info': 'Non-representative modality for min_freq=10.00%'}"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features[\"AveOccup\"].history.dev[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* The most associated combination of feature ``AveOccup`` (the first tested out, where ``info!=\"Raw distribution\"``) did not pass the viability tests. When looking in ``history.dev``:\n",
" * ``\"Non-representative modality for min_freq=10.00%\"``: tells us that a modality is unstable between ``train_set`` and ``dev_set``\n",
"\n",
"* For feature feature ``AveOccup``, the 4th combination is the first to pass tests:\n",
" - ``viabe=True``\n",
" - ``info=\"Best for kruskal and max_n_mod=4\"``\n",
" - Kruskal-Wallis' H with ``MedHouseVal`` is ``991.408301`` for this combination\n",
" - Following combinations (less associated with the target) where not tested: ``info=\"Not checked\"``\n",
"\n",
"* For all combinations ``dropna=False`` means that it is not a combination in which ``nan``s are being grouped with other modalities (as requested with ``dropna=False``)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving and Loading AutoCarver"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Saving\n",
"\n",
"All **Carvers** can safely be stored as a .json file."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"auto_carver.save(\"continuous_carver.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading\n",
"\n",
"**Carvers** can safely be loaded from a .json file."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"from AutoCarver import ContinuousCarver\n",
"\n",
"# loading json file\n",
"auto_carver = ContinuousCarver.load('continuous_carver.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying AutoCarver"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"dev_set_processed = auto_carver.transform(dev_set)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MedInc | \n",
" HouseAge | \n",
" AveRooms | \n",
" AveBedrms | \n",
" Population | \n",
" AveOccup | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.0 | \n",
" 0.255432 | \n",
" 0.340282 | \n",
" 0.502789 | \n",
" 0.106577 | \n",
" 0.157223 | \n",
" 0.146066 | \n",
"
\n",
" \n",
" | 1.0 | \n",
" 0.350851 | \n",
" 0.105843 | \n",
" 0.203318 | \n",
" 0.451703 | \n",
" 0.153553 | \n",
" 0.512918 | \n",
"
\n",
" \n",
" | 2.0 | \n",
" 0.244568 | \n",
" 0.447005 | \n",
" 0.144011 | \n",
" 0.295508 | \n",
" 0.539049 | \n",
" 0.186876 | \n",
"
\n",
" \n",
" | 3.0 | \n",
" 0.149149 | \n",
" 0.106870 | \n",
" 0.149883 | \n",
" 0.146213 | \n",
" 0.150176 | \n",
" 0.154140 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MedInc HouseAge AveRooms AveBedrms Population AveOccup\n",
"0.0 0.255432 0.340282 0.502789 0.106577 0.157223 0.146066\n",
"1.0 0.350851 0.105843 0.203318 0.451703 0.153553 0.512918\n",
"2.0 0.244568 0.447005 0.144011 0.295508 0.539049 0.186876\n",
"3.0 0.149149 0.106870 0.149883 0.146213 0.150176 0.154140"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Selection\n",
"## Selectors settings\n",
"### Features to select from\n",
"\n",
"Here all features have been carved using ``ContinuousCarver``, hence all features are qualitative.\n",
"\n",
"### Number of features to select\n",
"\n",
"The attribute ``n_best_per_type`` allows one to choose the number of features to be selected per data type (quantitative and qualitative)."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"n_best_per_type = 6 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Selectors"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" [RegressionSelector] Selected Qualitative Features \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | | \n",
" feature | \n",
" Nan | \n",
" Mode | \n",
" KruskalMeasure | \n",
" KruskalRank | \n",
" TschuprowtFilter | \n",
" TschuprowtWith | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Quantitative('MedInc') | \n",
" 0.0000 | \n",
" 0.3500 | \n",
" 6037.1821 | \n",
" 0.0000 | \n",
" 0.0000 | \n",
" itself | \n",
"
\n",
" \n",
" | 2 | \n",
" Quantitative('AveRooms') | \n",
" 0.0000 | \n",
" 0.5000 | \n",
" 1391.5865 | \n",
" 1.0000 | \n",
" 0.4015 | \n",
" MedInc | \n",
"
\n",
" \n",
" | 5 | \n",
" Quantitative('AveOccup') | \n",
" 0.0000 | \n",
" 0.5001 | \n",
" 991.4083 | \n",
" 2.0000 | \n",
" 0.1864 | \n",
" AveRooms | \n",
"
\n",
" \n",
" | 3 | \n",
" Quantitative('AveBedrms') | \n",
" 0.0000 | \n",
" 0.4500 | \n",
" 315.7944 | \n",
" 3.0000 | \n",
" 0.1392 | \n",
" MedInc | \n",
"
\n",
" \n",
" | 1 | \n",
" Quantitative('HouseAge') | \n",
" 0.0000 | \n",
" 0.4437 | \n",
" 163.5278 | \n",
" 4.0000 | \n",
" 0.1362 | \n",
" AveRooms | \n",
"
\n",
" \n",
" | 4 | \n",
" Quantitative('Population') | \n",
" 0.0000 | \n",
" 0.5498 | \n",
" 16.1097 | \n",
" 5.0000 | \n",
" 0.1517 | \n",
" AveBedrms | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Features(['MedInc', 'AveRooms', 'AveOccup', 'AveBedrms', 'HouseAge', 'Population'])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from AutoCarver.selectors import RegressionSelector\n",
"\n",
"# select the most target associated qualitative features\n",
"feature_selector = RegressionSelector(\n",
" features=features,\n",
" n_best_per_type=n_best_per_type,\n",
" verbose=True, # displays statistics\n",
")\n",
"best_features = feature_selector.select(train_set_processed, train_set_processed[target])\n",
"best_features"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MedInc | \n",
" AveRooms | \n",
" AveOccup | \n",
" AveBedrms | \n",
" HouseAge | \n",
" Population | \n",
"
\n",
" \n",
" \n",
" \n",
" | 5088 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 17096 | \n",
" 2.0 | \n",
" 2.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 2.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" | 5617 | \n",
" 1.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 2.0 | \n",
" 2.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" | 20060 | \n",
" 0.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
" 1.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
"
\n",
" \n",
" | 895 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 1.0 | \n",
" 2.0 | \n",
" 0.0 | \n",
" 3.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MedInc AveRooms AveOccup AveBedrms HouseAge Population\n",
"5088 0.0 0.0 1.0 2.0 0.0 1.0\n",
"17096 2.0 2.0 1.0 2.0 2.0 2.0\n",
"5617 1.0 0.0 3.0 2.0 2.0 2.0\n",
"20060 0.0 0.0 3.0 1.0 1.0 2.0\n",
"895 2.0 0.0 1.0 2.0 0.0 3.0"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_set_processed[best_features].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Feature ``MedInc`` is the most associated with the target ``MedHouseVal``:\n",
" - Kruskal-Wallis' H value is ``KruskalMeasure=6037.1821``\n",
" - It has 0 % of NaNs (``NanMeasure=0.0000``) \n",
" - Its mode represents 35 % of observed data (``ModeMeasure=0.3500``)\n",
"\n",
"* Feature ``AveRooms`` is strongly associated to feature ``MedInc``:\n",
" - Tschuprow's T value is ``TschuprowtFilter=0.4015`` for ``TschuprowtWith=MedInc``\n",
"\n",
"* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modeling\n",
"Fitting model on train data"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'super' object has no attribute '__sklearn_tags__'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:974\u001b[0m, in \u001b[0;36mMimeBundleFormatter.__call__\u001b[1;34m(self, obj, include, exclude)\u001b[0m\n\u001b[0;32m 971\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 973\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 974\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43minclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minclude\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 975\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 976\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:469\u001b[0m, in \u001b[0;36mBaseEstimator._repr_mimebundle_\u001b[1;34m(self, **kwargs)\u001b[0m\n\u001b[0;32m 467\u001b[0m output \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/plain\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;28mrepr\u001b[39m(\u001b[38;5;28mself\u001b[39m)}\n\u001b[0;32m 468\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m get_config()[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdisplay\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdiagram\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m--> 469\u001b[0m output[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/html\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 470\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m output\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:613\u001b[0m, in \u001b[0;36mRegressorMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 612\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 613\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 614\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mregressor\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 615\u001b[0m tags\u001b[38;5;241m.\u001b[39mregressor_tags \u001b[38;5;241m=\u001b[39m RegressorTags()\n",
"\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'"
]
},
{
"ename": "AttributeError",
"evalue": "'super' object has no attribute '__sklearn_tags__'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:344\u001b[0m, in \u001b[0;36mBaseFormatter.__call__\u001b[1;34m(self, obj)\u001b[0m\n\u001b[0;32m 342\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 343\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 344\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 345\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 346\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:463\u001b[0m, in \u001b[0;36mBaseEstimator._repr_html_inner\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 458\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_repr_html_inner\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m 459\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"This function is returned by the @property `_repr_html_` to make\u001b[39;00m\n\u001b[0;32m 460\u001b[0m \u001b[38;5;124;03m `hasattr(estimator, \"_repr_html_\") return `True` or `False` depending\u001b[39;00m\n\u001b[0;32m 461\u001b[0m \u001b[38;5;124;03m on `get_config()[\"display\"]`.\u001b[39;00m\n\u001b[0;32m 462\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 463\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:613\u001b[0m, in \u001b[0;36mRegressorMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 612\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 613\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 614\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mregressor\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 615\u001b[0m tags\u001b[38;5;241m.\u001b[39mregressor_tags \u001b[38;5;241m=\u001b[39m RegressorTags()\n",
"\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'"
]
},
{
"data": {
"text/plain": [
"XGBRegressor(base_score=None, booster=None, callbacks=None,\n",
" colsample_bylevel=None, colsample_bynode=None,\n",
" colsample_bytree=None, device=None, early_stopping_rounds=None,\n",
" enable_categorical=False, eval_metric=None, feature_types=None,\n",
" gamma=None, grow_policy=None, importance_type=None,\n",
" interaction_constraints=None, learning_rate=None, max_bin=None,\n",
" max_cat_threshold=None, max_cat_to_onehot=None,\n",
" max_delta_step=None, max_depth=None, max_leaves=None,\n",
" min_child_weight=None, missing=nan, monotone_constraints=None,\n",
" multi_strategy=None, n_estimators=None, n_jobs=None,\n",
" num_parallel_tree=None, random_state=None, ...)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from xgboost import XGBRegressor\n",
"\n",
"model = XGBRegressor()\n",
"model.fit(train_set_processed[best_features], train_set_processed[target])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saving model"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"model.save_model(\"regression_xgboost.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prediction on dev dataset and performance"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.7773564029114313"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import root_mean_squared_error\n",
"\n",
"dev_pred = model.predict(dev_set_processed[best_features])\n",
"root_mean_squared_error(dev_set_processed[target], dev_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"* Thanks to **Carvers** all of your features are now optimally processed for your regression task!\n",
"* As a final step towards your model, **Selectors** can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out [Selectors Examples](https://autocarver.readthedocs.io/en/latest/selectors_examples.html)!\n",
"\n",
"## Well done!\n",
"\n",
"Your commitment to achieving optimal results in continuous regression tasks shines through in your meticulous use of **AutoCarver**'s ``ContinuousCarver`` for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.\n",
"\n",
"The ``ContinuousCarver`` has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.\n",
"\n",
"We extend our sincere appreciation for choosing **AutoCarver** as your companion in the data preprocessing journey. Your use of **AutoCarver** demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in continuous regression tasks.\n",
"\n",
"As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We're excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.\n",
"\n",
"Thank you for trusting **AutoCarver**, and we wish you continued success in your data-driven ventures."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "autocarver-i96ERKJw-py3.9",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}