{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Setting things up\n",
"\n",
"## About this notebook\n",
"\n",
"In this notebook, we embark on a journey to enhance the predictive power of the Titanic Dataset through sophisticated preprocessing using the ``BinaryCarver`` pipeline. Designed to maximize associations in the data, ``BinaryCarver`` is a robust Python tool capable of discretizing any type of data—whether it be quantitative or qualitative. Our specific focus is on preparing the dataset for binary classification tasks, such as predicting survival outcomes.\n",
"\n",
"The Titanic Dataset, derived from the iconic 1912 Titanic passenger information, provides a diverse set of features ranging from socio-economic status and age to cabin location. Leveraging ``BinaryCarver``, we aim to perform association-maximizing discretization, refining both quantitative and qualitative features to create a finely tuned dataset for our binary classification endeavors.\n",
"\n",
"Throughout this notebook, we'll delve into the intricacies of ``BinaryCarver``'s discretization pipeline, exploring its capabilities in handling a variety of data types. Whether it's transforming passenger ages or classifying fares, ``BinaryCarver``'s adaptability ensures that every feature is optimally represented for our classification tasks.\n",
"\n",
"Join us in this exploration as we harness the power of ``BinaryCarver`` to preprocess the Titanic Dataset. Through effective feature engineering and discretization, we strive to create a dataset that not only captures the nuances of the Titanic passenger profiles but also sets the stage for the development of accurate and impactful binary classification models.\n",
"\n",
"Let's dive in and uncover the potential of ``BinaryCarver`` in transforming the Titanic Dataset for optimal predictive modeling.\n",
"\n",
"\n",
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# %pip install AutoCarver[jupyter]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Titanic Data\n",
"\n",
"In this example notebook, we will use the Titanic dataset.\n",
"\n",
"The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.\n",
"\n",
"The dataset includes various features such as passengers' names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification)."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" Siblings/Spouses Aboard | \n",
" Parents/Children Aboard | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 0 | \n",
" 3 | \n",
" Mr. Owen Harris Braund | \n",
" male | \n",
" 22.0 | \n",
" 1 | \n",
" 0 | \n",
" 7.2500 | \n",
"
\n",
" \n",
" | 1 | \n",
" 1 | \n",
" 1 | \n",
" Mrs. John Bradley (Florence Briggs Thayer) Cum... | \n",
" female | \n",
" 38.0 | \n",
" 1 | \n",
" 0 | \n",
" 71.2833 | \n",
"
\n",
" \n",
" | 2 | \n",
" 1 | \n",
" 3 | \n",
" Miss. Laina Heikkinen | \n",
" female | \n",
" 26.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.9250 | \n",
"
\n",
" \n",
" | 3 | \n",
" 1 | \n",
" 1 | \n",
" Mrs. Jacques Heath (Lily May Peel) Futrelle | \n",
" female | \n",
" 35.0 | \n",
" 1 | \n",
" 0 | \n",
" 53.1000 | \n",
"
\n",
" \n",
" | 4 | \n",
" 0 | \n",
" 3 | \n",
" Mr. William Henry Allen | \n",
" male | \n",
" 35.0 | \n",
" 0 | \n",
" 0 | \n",
" 8.0500 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Name \\\n",
"0 0 3 Mr. Owen Harris Braund \n",
"1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... \n",
"2 1 3 Miss. Laina Heikkinen \n",
"3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle \n",
"4 0 3 Mr. William Henry Allen \n",
"\n",
" Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare \n",
"0 male 22.0 1 0 7.2500 \n",
"1 female 38.0 1 0 71.2833 \n",
"2 female 26.0 0 0 7.9250 \n",
"3 female 35.0 1 0 53.1000 \n",
"4 male 35.0 0 0 8.0500 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"# URL to the Titanic dataset on Kaggle\n",
"titanic_url = \"https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv\"\n",
"\n",
"# Use pandas to read the CSV file directly from the URL\n",
"titanic_data = pd.read_csv(titanic_url)\n",
"\n",
"# Display the first few rows of the dataset\n",
"titanic_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Target type and Carver selection"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survived\n",
"0 545\n",
"1 342\n",
"Name: count, dtype: int64"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target = \"Survived\"\n",
"\n",
"titanic_data[target].value_counts(dropna=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The target ``\"Survived\"`` is a binary target of type ``int64`` used in a classification task. Hence we will use ``AutoCarver.BinaryCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Sampling"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(np.float64(0.38552188552188554), np.float64(0.3856655290102389))"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"# stratified sampling by target\n",
"train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])\n",
"\n",
"# checking target rate per dataset\n",
"train_set[target].mean(), dev_set[target].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setting up Features to Carver"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Survived | \n",
" Pclass | \n",
" Name | \n",
" Sex | \n",
" Age | \n",
" Siblings/Spouses Aboard | \n",
" Parents/Children Aboard | \n",
" Fare | \n",
"
\n",
" \n",
" \n",
" \n",
" | 617 | \n",
" 0 | \n",
" 3 | \n",
" Mr. Antoni Yasbeck | \n",
" male | \n",
" 27.0 | \n",
" 1 | \n",
" 0 | \n",
" 14.4542 | \n",
"
\n",
" \n",
" | 489 | \n",
" 0 | \n",
" 1 | \n",
" Mr. Harry Markland Molson | \n",
" male | \n",
" 55.0 | \n",
" 0 | \n",
" 0 | \n",
" 30.5000 | \n",
"
\n",
" \n",
" | 871 | \n",
" 1 | \n",
" 3 | \n",
" Miss. Adele Kiamie Najib | \n",
" female | \n",
" 15.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.2250 | \n",
"
\n",
" \n",
" | 654 | \n",
" 0 | \n",
" 3 | \n",
" Mrs. John (Catherine) Bourke | \n",
" female | \n",
" 32.0 | \n",
" 1 | \n",
" 1 | \n",
" 15.5000 | \n",
"
\n",
" \n",
" | 653 | \n",
" 0 | \n",
" 3 | \n",
" Mr. Alexander Radeff | \n",
" male | \n",
" 27.0 | \n",
" 0 | \n",
" 0 | \n",
" 7.8958 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Survived Pclass Name Sex Age \\\n",
"617 0 3 Mr. Antoni Yasbeck male 27.0 \n",
"489 0 1 Mr. Harry Markland Molson male 55.0 \n",
"871 1 3 Miss. Adele Kiamie Najib female 15.0 \n",
"654 0 3 Mrs. John (Catherine) Bourke female 32.0 \n",
"653 0 3 Mr. Alexander Radeff male 27.0 \n",
"\n",
" Siblings/Spouses Aboard Parents/Children Aboard Fare \n",
"617 1 0 14.4542 \n",
"489 0 0 30.5000 \n",
"871 0 0 7.2250 \n",
"654 1 1 15.5000 \n",
"653 0 0 7.8958 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_set.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Survived int64\n",
"Pclass int64\n",
"Name object\n",
"Sex object\n",
"Age float64\n",
"Siblings/Spouses Aboard int64\n",
"Parents/Children Aboard int64\n",
"Fare float64\n",
"dtype: object"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# column data types\n",
"train_set.dtypes"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Parents/Children Aboard\n",
"0 438\n",
"1 87\n",
"2 60\n",
"3 3\n",
"5 3\n",
"4 2\n",
"6 1\n",
"Name: count, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# values taken by Parents/Children Aboard\n",
"train_set[\"Parents/Children Aboard\"].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Pclass\n",
"3 326\n",
"1 142\n",
"2 126\n",
"Name: count, dtype: int64"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# values taken by Pclass\n",
"train_set[\"Pclass\"].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature ``\"Pclass\"`` is of type ``\"int64\"``, but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ``ordinal_features`` and set the ordering of its values in ``values_orders`` (string values). \n",
"\n",
"``\"Sex\"`` is the only quantitative categorical feature, it's added to the list of ``qualitative_features``.\n",
"\n",
"``\"Fare\"`` is the only quantitative continuous features, whilst ``\"Age\"``, ``\"Siblings/Spouses Aboard\"`` and ``\"Parents/Children Aboard\"`` can be considered as quantitative discrete features. Those four features will be added to the list of ``quantitative_features``."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from AutoCarver import Features\n",
"\n",
"# initiating Features to carve\n",
"features = Features(\n",
" categoricals=[\"Sex\"],\n",
" quantitatives=[\"Age\", \"Fare\", \"Siblings/Spouses Aboard\", \"Parents/Children Aboard\"],\n",
" ordinals={\"Pclass\": [\"1\", \"2\", \"3\"]}, # user-specified ordering for ordinal features\n",
")\n",
"features[\"Pclass\"], features[\"Sex\"], features[\"Age\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Using AutoCarver"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AutoCarver settings\n",
"\n",
"### Representativness of modalities\n",
"\n",
"The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used:\n",
"\n",
"- For quantitative features, to define the number of quantiles to initialy discretize the features with.\n",
"\n",
"- For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"min_freq = 0.05"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional: Desired number of modalities\n",
"\n",
"The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"max_n_mod = 5"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Optional: Grouping NaNs\n",
"\n",
"The attribute ``dropna`` allows one to choose whether or not ``nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-``nan`` values, and then test out all possible combinations with ``nan``."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"dropna = False # anyway, there are no nan in this dataset"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Fitting AutoCarver"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* First, all qualitative features are discretized:\n",
" 1. Using ``StringDiscretizer`` to convert them to ``str`` if not already the case\n",
" 2. For qualitative ordinal features: using ``OrdinalDiscretizer`` for under-represented values (less frequent than ``min_freq``) to be grouped with its closest modality\n",
" 3. For qualitative categorical features: using ``CategoricalDiscretizer`` for under-represented values (less frequent than ``min_freq``) to be grouped with a default value (``features.default=\"__OTHER__\"``)\n",
"\n",
"* Second, all quantitative features are discretized:\n",
" 1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq``)\n",
" 2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2``) to be grouped with its closest modality\n",
"\n",
"* Third, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):\n",
" 1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step\n",
" 2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``\n",
" 3. Computing associations: the association metric (Tschruprow's T, by default) is computed with the provided ``train_set[target]``\n",
" 4. Combinations are sorted in descending order by association value\n",
" 5. Testing robustness: finds the first combination that checks the following:\n",
" - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq/2``)\n",
" - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` \n",
" - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)\n",
" 6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``nan`` are applied to ``train_set`` and steps 3. and 4. are run\n",
" 7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"------\n",
"--- [QuantitativeDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])\n",
" - [ContinuousDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])\n",
" - [OrdinalDiscretizer] Fit Features(['Age', 'Fare', 'Parents/Children Aboard'])\n",
"------\n",
"\n",
"------\n",
"--- [QualitativeDiscretizer] Fit Features(['Sex', 'Pclass'])\n",
" - [StringDiscretizer] Fit Features(['Pclass'])\n",
" - [OrdinalDiscretizer] Fit Features(['Pclass'])\n",
" - [CategoricalDiscretizer] Fit Features(['Sex'])\n",
"------\n",
"\n",
"---------\n",
"------ [BinaryCarver] Fit Features(['Sex', 'Pclass', 'Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])\n",
"--- [BinaryCarver] Fit Categorical('Sex') (1/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | male | \n",
" 0.1878 | \n",
" 0.6364 | \n",
"
\n",
" \n",
" | female | \n",
" 0.7315 | \n",
" 0.3636 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.1949 | \n",
" 0.6655 | \n",
"
\n",
" \n",
" | 0.7653 | \n",
" 0.3345 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 0%| | 0/1 [00:00, ?it/s]\n",
"Computing associations: 100%|██████████| 1/1 [00:00, ?it/s]\n",
"Testing robustness : 0%| | 0/1 [00:00, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | male | \n",
" 0.1878 | \n",
" 0.6364 | \n",
"
\n",
" \n",
" | female | \n",
" 0.7315 | \n",
" 0.3636 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.1949 | \n",
" 0.6655 | \n",
"
\n",
" \n",
" | 0.7653 | \n",
" 0.3345 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [BinaryCarver] Fit Ordinal('Pclass') (2/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 | \n",
" 0.6197 | \n",
" 0.2391 | \n",
"
\n",
" \n",
" | 2 | \n",
" 0.4683 | \n",
" 0.2121 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.2515 | \n",
" 0.5488 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.6486 | \n",
" 0.2526 | \n",
"
\n",
" \n",
" | 0.4828 | \n",
" 0.1980 | \n",
"
\n",
" \n",
" | 0.2298 | \n",
" 0.5495 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 67%|██████▋ | 2/3 [00:00<00:00, 1988.76it/s]\n",
"Computing associations: 100%|██████████| 3/3 [00:00<00:00, 2981.03it/s]\n",
"Testing robustness : 0%| | 0/3 [00:00, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 1 to 2 | \n",
" 0.5485 | \n",
" 0.4512 | \n",
"
\n",
" \n",
" | 3 | \n",
" 0.2515 | \n",
" 0.5488 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.5758 | \n",
" 0.4505 | \n",
"
\n",
" \n",
" | 0.2298 | \n",
" 0.5495 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [BinaryCarver] Fit Quantitative('Age') (3/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 2.00e+00 | \n",
" 0.7500 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 2.00e+00 < x <= 4.00e+00 | \n",
" 0.7143 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 4.00e+00 < x <= 8.00e+00 | \n",
" 0.4286 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 8.00e+00 < x <= 1.40e+01 | \n",
" 0.2000 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 1.40e+01 < x <= 1.60e+01 | \n",
" 0.5000 | \n",
" 0.0303 | \n",
"
\n",
" \n",
" | 1.60e+01 < x <= 1.80e+01 | \n",
" 0.3226 | \n",
" 0.0522 | \n",
"
\n",
" \n",
" | 1.80e+01 < x <= 1.90e+01 | \n",
" 0.3913 | \n",
" 0.0387 | \n",
"
\n",
" \n",
" | 1.90e+01 < x <= 2.05e+01 | \n",
" 0.1111 | \n",
" 0.0303 | \n",
"
\n",
" \n",
" | 2.05e+01 < x <= 2.10e+01 | \n",
" 0.1905 | \n",
" 0.0354 | \n",
"
\n",
" \n",
" | 2.10e+01 < x <= 2.20e+01 | \n",
" 0.4242 | \n",
" 0.0556 | \n",
"
\n",
" \n",
" | 2.20e+01 < x <= 2.35e+01 | \n",
" 0.4000 | \n",
" 0.0168 | \n",
"
\n",
" \n",
" | 2.35e+01 < x <= 2.40e+01 | \n",
" 0.5417 | \n",
" 0.0404 | \n",
"
\n",
" \n",
" | 2.40e+01 < x <= 2.50e+01 | \n",
" 0.1333 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 2.50e+01 < x <= 2.70e+01 | \n",
" 0.4667 | \n",
" 0.0505 | \n",
"
\n",
" \n",
" | 2.70e+01 < x <= 2.85e+01 | \n",
" 0.2500 | \n",
" 0.0337 | \n",
"
\n",
" \n",
" | 2.85e+01 < x <= 2.90e+01 | \n",
" 0.4444 | \n",
" 0.0303 | \n",
"
\n",
" \n",
" | 2.90e+01 < x <= 3.00e+01 | \n",
" 0.2917 | \n",
" 0.0404 | \n",
"
\n",
" \n",
" | 3.00e+01 < x <= 3.10e+01 | \n",
" 0.3846 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 3.10e+01 < x <= 3.20e+01 | \n",
" 0.5000 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 3.20e+01 < x <= 3.30e+01 | \n",
" 0.3846 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 3.30e+01 < x <= 3.40e+01 | \n",
" 0.3077 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 3.40e+01 < x <= 3.60e+01 | \n",
" 0.4643 | \n",
" 0.0471 | \n",
"
\n",
" \n",
" | 3.60e+01 < x <= 3.80e+01 | \n",
" 0.4118 | \n",
" 0.0286 | \n",
"
\n",
" \n",
" | 3.80e+01 < x <= 4.10e+01 | \n",
" 0.3871 | \n",
" 0.0522 | \n",
"
\n",
" \n",
" | 4.10e+01 < x <= 4.20e+01 | \n",
" 0.4615 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 4.20e+01 < x <= 4.50e+01 | \n",
" 0.3913 | \n",
" 0.0387 | \n",
"
\n",
" \n",
" | 4.50e+01 < x <= 4.70e+01 | \n",
" 0.2500 | \n",
" 0.0202 | \n",
"
\n",
" \n",
" | 4.70e+01 < x <= 4.90e+01 | \n",
" 0.7143 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 4.90e+01 < x <= 5.10e+01 | \n",
" 0.2727 | \n",
" 0.0185 | \n",
"
\n",
" \n",
" | 5.10e+01 < x <= 5.60e+01 | \n",
" 0.3889 | \n",
" 0.0303 | \n",
"
\n",
" \n",
" | 5.60e+01 < x <= 6.10e+01 | \n",
" 0.1538 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 6.10e+01 < x | \n",
" 0.2000 | \n",
" 0.0253 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.4444 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.7500 | \n",
" 0.0137 | \n",
"
\n",
" \n",
" | 0.6667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.6000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.4286 | \n",
" 0.0717 | \n",
"
\n",
" \n",
" | 0.2000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.3333 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.1538 | \n",
" 0.0444 | \n",
"
\n",
" \n",
" | 0.1667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.1875 | \n",
" 0.0546 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.3529 | \n",
" 0.0580 | \n",
"
\n",
" \n",
" | 0.2632 | \n",
" 0.0648 | \n",
"
\n",
" \n",
" | 0.4286 | \n",
" 0.0239 | \n",
"
\n",
" \n",
" | 0.3333 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.6250 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.4000 | \n",
" 0.0171 | \n",
"
\n",
" \n",
" | 0.6667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.7500 | \n",
" 0.0137 | \n",
"
\n",
" \n",
" | 0.5882 | \n",
" 0.0580 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.1875 | \n",
" 0.0546 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0137 | \n",
"
\n",
" \n",
" | 0.1429 | \n",
" 0.0239 | \n",
"
\n",
" \n",
" | 0.1667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.6667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.6000 | \n",
" 0.0171 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0273 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 36455/36456 [00:04<00:00, 7444.66it/s]\n",
"Computing associations: 100%|██████████| 36456/36456 [00:10<00:00, 3595.62it/s]\n",
"Testing robustness : 1%| | 302/36456 [00:00<01:49, 328.89it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 8.0e+00 | \n",
" 0.6364 | \n",
" 0.0741 | \n",
"
\n",
" \n",
" | 8.0e+00 < x | \n",
" 0.3655 | \n",
" 0.9259 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.5789 | \n",
" 0.0648 | \n",
"
\n",
" \n",
" | 0.3723 | \n",
" 0.9352 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [BinaryCarver] Fit Quantitative('Fare') (4/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 6.858e+00 | \n",
" 0.0000 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 6.858e+00 < x <= 7.142e+00 | \n",
" 0.1333 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 7.142e+00 < x <= 7.229e+00 | \n",
" 0.2632 | \n",
" 0.0320 | \n",
"
\n",
" \n",
" | 7.229e+00 < x <= 7.250e+00 | \n",
" 0.0909 | \n",
" 0.0185 | \n",
"
\n",
" \n",
" | 7.250e+00 < x <= 7.750e+00 | \n",
" 0.3500 | \n",
" 0.0673 | \n",
"
\n",
" \n",
" | 7.750e+00 < x <= 7.854e+00 | \n",
" 0.3333 | \n",
" 0.0404 | \n",
"
\n",
" \n",
" | 7.854e+00 < x <= 7.896e+00 | \n",
" 0.1429 | \n",
" 0.0471 | \n",
"
\n",
" \n",
" | 7.896e+00 < x <= 8.029e+00 | \n",
" 0.5000 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 8.029e+00 < x <= 8.050e+00 | \n",
" 0.0968 | \n",
" 0.0522 | \n",
"
\n",
" \n",
" | 8.050e+00 < x <= 9.000e+00 | \n",
" 0.1250 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 9.000e+00 < x <= 9.842e+00 | \n",
" 0.3571 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 9.842e+00 < x <= 1.050e+01 | \n",
" 0.3571 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 1.050e+01 < x <= 1.300e+01 | \n",
" 0.5128 | \n",
" 0.0657 | \n",
"
\n",
" \n",
" | 1.300e+01 < x <= 1.445e+01 | \n",
" 0.3333 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 1.445e+01 < x <= 1.550e+01 | \n",
" 0.2000 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 1.550e+01 < x <= 1.670e+01 | \n",
" 0.5833 | \n",
" 0.0202 | \n",
"
\n",
" \n",
" | 1.670e+01 < x <= 2.025e+01 | \n",
" 0.5714 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 2.025e+01 < x <= 2.300e+01 | \n",
" 0.4286 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 2.300e+01 < x <= 2.600e+01 | \n",
" 0.3333 | \n",
" 0.0606 | \n",
"
\n",
" \n",
" | 2.600e+01 < x <= 2.655e+01 | \n",
" 0.5789 | \n",
" 0.0320 | \n",
"
\n",
" \n",
" | 2.655e+01 < x <= 2.790e+01 | \n",
" 0.2500 | \n",
" 0.0202 | \n",
"
\n",
" \n",
" | 2.790e+01 < x <= 3.000e+01 | \n",
" 0.4615 | \n",
" 0.0219 | \n",
"
\n",
" \n",
" | 3.000e+01 < x <= 3.139e+01 | \n",
" 0.3333 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 3.139e+01 < x <= 3.850e+01 | \n",
" 0.2857 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 3.850e+01 < x <= 4.240e+01 | \n",
" 0.4667 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 4.240e+01 < x <= 5.200e+01 | \n",
" 0.2353 | \n",
" 0.0286 | \n",
"
\n",
" \n",
" | 5.200e+01 < x <= 5.650e+01 | \n",
" 0.7857 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 5.650e+01 < x <= 6.955e+01 | \n",
" 0.5333 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 6.955e+01 < x <= 7.729e+01 | \n",
" 0.4167 | \n",
" 0.0202 | \n",
"
\n",
" \n",
" | 7.729e+01 < x <= 8.316e+01 | \n",
" 0.8000 | \n",
" 0.0253 | \n",
"
\n",
" \n",
" | 8.316e+01 < x <= 1.109e+02 | \n",
" 0.7857 | \n",
" 0.0236 | \n",
"
\n",
" \n",
" | 1.109e+02 < x <= 1.516e+02 | \n",
" 0.8750 | \n",
" 0.0269 | \n",
"
\n",
" \n",
" | 1.516e+02 < x | \n",
" 0.7143 | \n",
" 0.0236 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.1111 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.0000 | \n",
" 0.0102 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.0000 | \n",
" 0.0068 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0546 | \n",
"
\n",
" \n",
" | 0.1333 | \n",
" 0.0512 | \n",
"
\n",
" \n",
" | 0.0714 | \n",
" 0.0478 | \n",
"
\n",
" \n",
" | 0.3333 | \n",
" 0.0102 | \n",
"
\n",
" \n",
" | 0.1667 | \n",
" 0.0410 | \n",
"
\n",
" \n",
" | 0.1667 | \n",
" 0.0410 | \n",
"
\n",
" \n",
" | 0.0000 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.2857 | \n",
" 0.0478 | \n",
"
\n",
" \n",
" | 0.3846 | \n",
" 0.0887 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0137 | \n",
"
\n",
" \n",
" | 0.4545 | \n",
" 0.0375 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.4444 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.6000 | \n",
" 0.0341 | \n",
"
\n",
" \n",
" | 0.5294 | \n",
" 0.0580 | \n",
"
\n",
" \n",
" | 0.8571 | \n",
" 0.0239 | \n",
"
\n",
" \n",
" | 0.2000 | \n",
" 0.0171 | \n",
"
\n",
" \n",
" | 0.4000 | \n",
" 0.0171 | \n",
"
\n",
" \n",
" | 0.6250 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.0000 | \n",
" 0.0102 | \n",
"
\n",
" \n",
" | 0.6000 | \n",
" 0.0171 | \n",
"
\n",
" \n",
" | 0.6667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
" | 0.5556 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.6667 | \n",
" 0.0102 | \n",
"
\n",
" \n",
" | 0.7143 | \n",
" 0.0239 | \n",
"
\n",
" \n",
" | 0.7778 | \n",
" 0.0307 | \n",
"
\n",
" \n",
" | 0.5000 | \n",
" 0.0137 | \n",
"
\n",
" \n",
" | 0.7273 | \n",
" 0.0375 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 100%|█████████▉| 41447/41448 [00:06<00:00, 6897.10it/s]\n",
"Computing associations: 100%|██████████| 41448/41448 [00:09<00:00, 4172.63it/s]\n",
"Testing robustness : 0%| | 0/41448 [00:00, ?it/s]\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 5.2e+01 | \n",
" 0.3198 | \n",
" 0.8316 | \n",
"
\n",
" \n",
" | 5.2e+01 < x | \n",
" 0.7100 | \n",
" 0.1684 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.3279 | \n",
" 0.8328 | \n",
"
\n",
" \n",
" | 0.6735 | \n",
" 0.1672 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [BinaryCarver] Fit Quantitative('Siblings/Spouses Aboard') (5/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 0.00e+00 | \n",
" 0.3614 | \n",
" 0.6801 | \n",
"
\n",
" \n",
" | 0.00e+00 < x <= 1.00e+00 | \n",
" 0.5000 | \n",
" 0.2323 | \n",
"
\n",
" \n",
" | 1.00e+00 < x <= 2.00e+00 | \n",
" 0.5500 | \n",
" 0.0337 | \n",
"
\n",
" \n",
" | 2.00e+00 < x <= 4.00e+00 | \n",
" 0.1429 | \n",
" 0.0354 | \n",
"
\n",
" \n",
" | 4.00e+00 < x | \n",
" 0.0000 | \n",
" 0.0185 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.3200 | \n",
" 0.6826 | \n",
"
\n",
" \n",
" | 0.6056 | \n",
" 0.2423 | \n",
"
\n",
" \n",
" | 0.2500 | \n",
" 0.0273 | \n",
"
\n",
" \n",
" | 0.3077 | \n",
" 0.0444 | \n",
"
\n",
" \n",
" | 0.0000 | \n",
" 0.0034 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 93%|█████████▎| 14/15 [00:00<00:00, 4605.87it/s]\n",
"Computing associations: 100%|██████████| 15/15 [00:00<00:00, 1820.44it/s]\n",
"Testing robustness : 67%|██████▋ | 10/15 [00:00<00:00, 322.91it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 0.00e+00 | \n",
" 0.3614 | \n",
" 0.6801 | \n",
"
\n",
" \n",
" | 0.00e+00 < x <= 1.00e+00 | \n",
" 0.5000 | \n",
" 0.2323 | \n",
"
\n",
" \n",
" | 1.00e+00 < x | \n",
" 0.2692 | \n",
" 0.0875 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.3200 | \n",
" 0.6826 | \n",
"
\n",
" \n",
" | 0.6056 | \n",
" 0.2423 | \n",
"
\n",
" \n",
" | 0.2727 | \n",
" 0.0751 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"--- [BinaryCarver] Fit Quantitative('Parents/Children Aboard') (6/6)\n",
" [BinaryCarver] Raw distribution\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 0.00e+00 | \n",
" 0.3447 | \n",
" 0.7374 | \n",
"
\n",
" \n",
" | 0.00e+00 < x <= 1.00e+00 | \n",
" 0.5057 | \n",
" 0.1465 | \n",
"
\n",
" \n",
" | 1.00e+00 < x <= 2.00e+00 | \n",
" 0.5167 | \n",
" 0.1010 | \n",
"
\n",
" \n",
" | 2.00e+00 < x | \n",
" 0.3333 | \n",
" 0.0152 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.3475 | \n",
" 0.8055 | \n",
"
\n",
" \n",
" | 0.6774 | \n",
" 0.1058 | \n",
"
\n",
" \n",
" | 0.4500 | \n",
" 0.0683 | \n",
"
\n",
" \n",
" | 0.1667 | \n",
" 0.0205 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Grouping modalities : 86%|████████▌ | 6/7 [00:00<00:00, 604.67it/s]\n",
"Computing associations: 100%|██████████| 7/7 [00:00<00:00, 932.45it/s]\n",
"Testing robustness : 0%| | 0/7 [00:00, ?it/s]"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"\n",
" [BinaryCarver] Carved distribution\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" X distribution\n",
" \n",
" \n",
" | | \n",
" target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | x <= 0.0e+00 | \n",
" 0.3447 | \n",
" 0.7374 | \n",
"
\n",
" \n",
" | 0.0e+00 < x | \n",
" 0.5000 | \n",
" 0.2626 | \n",
"
\n",
" \n",
"
\n",
" \n",
"\n",
" X_dev distribution\n",
" \n",
" \n",
" | target_mean | \n",
" frequency | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.3475 | \n",
" 0.8055 | \n",
"
\n",
" \n",
" | 0.5439 | \n",
" 0.1945 | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\defra\\Desktop\\git\\PROJECTS\\AutoCarver\\AutoCarver\\discretizers\\utils\\base_discretizer.py:433: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" sample.X.replace(\n"
]
}
],
"source": [
"from AutoCarver import BinaryCarver\n",
"\n",
"# intiating AutoCarver\n",
"auto_carver = BinaryCarver(\n",
" features=features,\n",
" min_freq=min_freq,\n",
" dropna=dropna,\n",
" verbose=True, # showing statistics\n",
" copy=True, # whether or not to return a copy of the input dataset\n",
")\n",
"\n",
"# fitting on training sample, a dev sample can be specified to evaluate carving robustness\n",
"train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## AutoCarver analysis\n",
"\n",
"### Carving Summary"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" | \n",
" content | \n",
" frequency | \n",
"
\n",
" \n",
" | feature | \n",
" target_mean | \n",
" cramerv | \n",
" tschuprowt | \n",
" n_mod | \n",
" label | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" | Categorical('Sex') | \n",
" 0.187831 | \n",
" 0.533719 | \n",
" 0.533719 | \n",
" 2 | \n",
" 0 | \n",
" male | \n",
" 0.636364 | \n",
"
\n",
" \n",
" | 0.731481 | \n",
" 0.533719 | \n",
" 0.533719 | \n",
" 2 | \n",
" 1 | \n",
" female | \n",
" 0.363636 | \n",
"
\n",
" \n",
" | Ordinal('Pclass') | \n",
" 0.548507 | \n",
" 0.300144 | \n",
" 0.300144 | \n",
" 2 | \n",
" 0 | \n",
" [2, 1] | \n",
" 0.451178 | \n",
"
\n",
" \n",
" | 0.251534 | \n",
" 0.300144 | \n",
" 0.300144 | \n",
" 2 | \n",
" 1 | \n",
" 3 | \n",
" 0.548822 | \n",
"
\n",
" \n",
" | Quantitative('Age') | \n",
" 0.636364 | \n",
" 0.139166 | \n",
" 0.139166 | \n",
" 2 | \n",
" 0 | \n",
" x <= 8.0e+00 | \n",
" 0.074074 | \n",
"
\n",
" \n",
" | 0.365455 | \n",
" 0.139166 | \n",
" 0.139166 | \n",
" 2 | \n",
" 1 | \n",
" 8.0e+00 < x | \n",
" 0.925926 | \n",
"
\n",
" \n",
" | Quantitative('Fare') | \n",
" 0.319838 | \n",
" 0.295325 | \n",
" 0.295325 | \n",
" 2 | \n",
" 0 | \n",
" x <= 5.2e+01 | \n",
" 0.831650 | \n",
"
\n",
" \n",
" | 0.710000 | \n",
" 0.295325 | \n",
" 0.295325 | \n",
" 2 | \n",
" 1 | \n",
" 5.2e+01 < x | \n",
" 0.168350 | \n",
"
\n",
" \n",
" | Quantitative('Siblings/Spouses Aboard') | \n",
" 0.361386 | \n",
" 0.139722 | \n",
" 0.117492 | \n",
" 3 | \n",
" 0 | \n",
" x <= 0.00e+00 | \n",
" 0.680135 | \n",
"
\n",
" \n",
" | 0.500000 | \n",
" 0.139722 | \n",
" 0.117492 | \n",
" 3 | \n",
" 1 | \n",
" 0.00e+00 < x <= 1.00e+00 | \n",
" 0.232323 | \n",
"
\n",
" \n",
" | 0.269231 | \n",
" 0.139722 | \n",
" 0.117492 | \n",
" 3 | \n",
" 2 | \n",
" 1.00e+00 < x | \n",
" 0.087542 | \n",
"
\n",
" \n",
" | Quantitative('Parents/Children Aboard') | \n",
" 0.344749 | \n",
" 0.136439 | \n",
" 0.136439 | \n",
" 2 | \n",
" 0 | \n",
" x <= 0.0e+00 | \n",
" 0.737374 | \n",
"
\n",
" \n",
" | 0.500000 | \n",
" 0.136439 | \n",
" 0.136439 | \n",
" 2 | \n",
" 1 | \n",
" 0.0e+00 < x | \n",
" 0.262626 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" content \\\n",
"feature target_mean cramerv tschuprowt n_mod label \n",
"Categorical('Sex') 0.187831 0.533719 0.533719 2 0 male \n",
" 0.731481 0.533719 0.533719 2 1 female \n",
"Ordinal('Pclass') 0.548507 0.300144 0.300144 2 0 [2, 1] \n",
" 0.251534 0.300144 0.300144 2 1 3 \n",
"Quantitative('Age') 0.636364 0.139166 0.139166 2 0 x <= 8.0e+00 \n",
" 0.365455 0.139166 0.139166 2 1 8.0e+00 < x \n",
"Quantitative('Fare') 0.319838 0.295325 0.295325 2 0 x <= 5.2e+01 \n",
" 0.710000 0.295325 0.295325 2 1 5.2e+01 < x \n",
"Quantitative('Siblings/Spouses Aboard') 0.361386 0.139722 0.117492 3 0 x <= 0.00e+00 \n",
" 0.500000 0.139722 0.117492 3 1 0.00e+00 < x <= 1.00e+00 \n",
" 0.269231 0.139722 0.117492 3 2 1.00e+00 < x \n",
"Quantitative('Parents/Children Aboard') 0.344749 0.136439 0.136439 2 0 x <= 0.0e+00 \n",
" 0.500000 0.136439 0.136439 2 1 0.0e+00 < x \n",
"\n",
" frequency \n",
"feature target_mean cramerv tschuprowt n_mod label \n",
"Categorical('Sex') 0.187831 0.533719 0.533719 2 0 0.636364 \n",
" 0.731481 0.533719 0.533719 2 1 0.363636 \n",
"Ordinal('Pclass') 0.548507 0.300144 0.300144 2 0 0.451178 \n",
" 0.251534 0.300144 0.300144 2 1 0.548822 \n",
"Quantitative('Age') 0.636364 0.139166 0.139166 2 0 0.074074 \n",
" 0.365455 0.139166 0.139166 2 1 0.925926 \n",
"Quantitative('Fare') 0.319838 0.295325 0.295325 2 0 0.831650 \n",
" 0.710000 0.295325 0.295325 2 1 0.168350 \n",
"Quantitative('Siblings/Spouses Aboard') 0.361386 0.139722 0.117492 3 0 0.680135 \n",
" 0.500000 0.139722 0.117492 3 1 0.232323 \n",
" 0.269231 0.139722 0.117492 3 2 0.087542 \n",
"Quantitative('Parents/Children Aboard') 0.344749 0.136439 0.136439 2 0 0.737374 \n",
" 0.500000 0.136439 0.136439 2 1 0.262626 "
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"auto_carver.summary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* For quantitative feature ``Age``, the selected combination of modalities groups ages as follows:\n",
" * modality ``0``: lower or equal to 8 years old (``content=\"x <= 8.0+00\"``)\n",
" * modality ``1``: ages higher than 8 years old (``content=\"8.0+00 < x \"``)\n",
"\n",
"* For qualitative categorical feature ``Sex``, the selected combination of modalities has left modalities ``content=\"male\"`` in modality ``0`` and ``content=\"female\"`` in modality ``1`` (no combination possible)\n",
"\n",
"* For qualitative ordinal feature ``Pclass``, the selected combination of modalities socio-economic status as follows:\n",
" * modality ``0``: upper and middle classes (``content=[2, 1]``) \n",
" * modality ``1``: lower class (``content=3``). \n",
" * The user-provided ordering of modalities has been preserved."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Detailed overview of tested combinations"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" info | \n",
" cramerv | \n",
" tschuprowt | \n",
" combination | \n",
" n_mod | \n",
" dropna | \n",
" train | \n",
" viable | \n",
" dev | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Raw distribution | \n",
" 0.321044 | \n",
" 0.269965 | \n",
" {'1': '1', '2': '2', '3': '3'} | \n",
" 3 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 1 | \n",
" Best for tschuprowt and max_n_mod=5 | \n",
" 0.300144 | \n",
" 0.300144 | \n",
" {'1': '1', '2': '1', '3': '3'} | \n",
" 2 | \n",
" False | \n",
" {'viable': True, 'info': ''} | \n",
" True | \n",
" {'viable': True, 'info': ''} | \n",
"
\n",
" \n",
" | 2 | \n",
" Not checked | \n",
" 0.321044 | \n",
" 0.269965 | \n",
" {'1': '1', '2': '2', '3': '3'} | \n",
" 3 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
" | 3 | \n",
" Not checked | \n",
" 0.265643 | \n",
" 0.265643 | \n",
" {'1': '1', '2': '2', '3': '2'} | \n",
" 2 | \n",
" False | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" info cramerv tschuprowt \\\n",
"0 Raw distribution 0.321044 0.269965 \n",
"1 Best for tschuprowt and max_n_mod=5 0.300144 0.300144 \n",
"2 Not checked 0.321044 0.269965 \n",
"3 Not checked 0.265643 0.265643 \n",
"\n",
" combination n_mod dropna \\\n",
"0 {'1': '1', '2': '2', '3': '3'} 3 False \n",
"1 {'1': '1', '2': '1', '3': '3'} 2 False \n",
"2 {'1': '1', '2': '2', '3': '3'} 3 False \n",
"3 {'1': '1', '2': '2', '3': '2'} 2 False \n",
"\n",
" train viable dev \n",
"0 NaN NaN NaN \n",
"1 {'viable': True, 'info': ''} True {'viable': True, 'info': ''} \n",
"2 NaN NaN NaN \n",
"3 NaN NaN NaN "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"features[\"Pclass\"].history"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* The most associated combination (the first tested out, where ``info!=\"Raw distribution\"``) groups ``Pclass==1`` with ``Pclass==2`` and leaves ``Pclass==3`` as its own modality\n",
"\n",
"* For feature ``Pclass``, the 1st combination passes the tests:\n",
" - ``viable=True``\n",
" - ``info=\"Best for tschuprowt and max_n_mod=5\"``\n",
" - Tschuprow's T with ``Survived`` is ``0.300144`` for this combination (by default, combinations are ranked according to this statistic)\n",
" - Following combinations (less associated with the target) where not tested: ``info=\"Not checked\"``\n",
"\n",
"* For all combinations ``dropna=False`` means that it is not a combination in which ``nan``s are being grouped with other modalities (as requested with ``dropna=False``)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Saving and Loading AutoCarver\n",
"### Saving\n",
"\n",
"All **Carvers** can safely be stored as a .json file."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"auto_carver.save(\"binary_carver.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading\n",
"\n",
"**Carvers** can safely be loaded from a .json file."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"auto_carver = BinaryCarver.load(\"binary_carver.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Applying AutoCarver"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\defra\\Desktop\\git\\PROJECTS\\AutoCarver\\AutoCarver\\discretizers\\utils\\base_discretizer.py:433: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`\n",
" sample.X.replace(\n"
]
}
],
"source": [
"dev_set_processed = auto_carver.transform(dev_set)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Pclass | \n",
" Age | \n",
" Fare | \n",
" Siblings/Spouses Aboard | \n",
" Parents/Children Aboard | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0.0 | \n",
" 0.665529 | \n",
" 0.450512 | \n",
" 0.064846 | \n",
" 0.832765 | \n",
" 0.682594 | \n",
" 0.805461 | \n",
"
\n",
" \n",
" | 1.0 | \n",
" 0.334471 | \n",
" 0.549488 | \n",
" 0.935154 | \n",
" 0.167235 | \n",
" 0.242321 | \n",
" 0.194539 | \n",
"
\n",
" \n",
" | 2.0 | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" NaN | \n",
" 0.075085 | \n",
" NaN | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Pclass Age Fare Siblings/Spouses Aboard \\\n",
"0.0 0.665529 0.450512 0.064846 0.832765 0.682594 \n",
"1.0 0.334471 0.549488 0.935154 0.167235 0.242321 \n",
"2.0 NaN NaN NaN NaN 0.075085 \n",
"\n",
" Parents/Children Aboard \n",
"0.0 0.805461 \n",
"1.0 0.194539 \n",
"2.0 NaN "
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Feature Selection\n",
"## Selectors settings\n",
"### Features to select from\n",
"\n",
"Here all features have been carved using ``BinaryCarver``, hence all features are qualitative.\n",
"\n",
"### Number of features to select\n",
"\n",
"The attribute ``n_best_per_type`` allows one to choose the number of features to be selected per data type (quantitative and qualitative)."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using Selectors"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" [ClassificationSelector] Selected Qualitative Features \n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
" \n",
" \n",
" | | \n",
" feature | \n",
" Nan | \n",
" Mode | \n",
" TschuprowtMeasure | \n",
" TschuprowtRank | \n",
" TschuprowtFilter | \n",
" TschuprowtWith | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" Categorical('Sex') | \n",
" 0.0000 | \n",
" 0.6364 | \n",
" 0.5337 | \n",
" 0.0000 | \n",
" 0.0000 | \n",
" itself | \n",
"
\n",
" \n",
" | 1 | \n",
" Ordinal('Pclass') | \n",
" 0.0000 | \n",
" 0.5488 | \n",
" 0.3001 | \n",
" 1.0000 | \n",
" 0.0988 | \n",
" Sex | \n",
"
\n",
" \n",
" | 3 | \n",
" Quantitative('Fare') | \n",
" 0.0000 | \n",
" 0.8316 | \n",
" 0.2953 | \n",
" 2.0000 | \n",
" 0.3922 | \n",
" Pclass | \n",
"
\n",
" \n",
" | 2 | \n",
" Quantitative('Age') | \n",
" 0.0000 | \n",
" 0.9259 | \n",
" 0.1392 | \n",
" 3.0000 | \n",
" 0.1002 | \n",
" Sex | \n",
"
\n",
" \n",
" | 5 | \n",
" Quantitative('Parents/Children Aboard') | \n",
" 0.0000 | \n",
" 0.7374 | \n",
" 0.1364 | \n",
" 4.0000 | \n",
" 0.4666 | \n",
" Age | \n",
"
\n",
" \n",
" | 4 | \n",
" Quantitative('Siblings/Spouses Aboard') | \n",
" 0.0000 | \n",
" 0.6801 | \n",
" 0.1175 | \n",
" 5.0000 | \n",
" 0.4060 | \n",
" Parents/Children Aboard | \n",
"
\n",
" \n",
"
\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"Features(['Sex', 'Pclass', 'Fare', 'Age'])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from AutoCarver import ClassificationSelector\n",
"\n",
"# select the most target associated qualitative features\n",
"feature_selector = ClassificationSelector(\n",
" features=features,\n",
" n_best_per_type=n_best_per_type,\n",
" verbose=True, # displays statistics\n",
")\n",
"best_features = feature_selector.select(train_set_processed, train_set_processed[target])\n",
"best_features"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Sex | \n",
" Pclass | \n",
" Fare | \n",
" Age | \n",
"
\n",
" \n",
" \n",
" \n",
" | 617 | \n",
" 0 | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 489 | \n",
" 0 | \n",
" 0 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 871 | \n",
" 1 | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 654 | \n",
" 1 | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
" | 653 | \n",
" 0 | \n",
" 1 | \n",
" 0.0 | \n",
" 1.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sex Pclass Fare Age\n",
"617 0 1 0.0 1.0\n",
"489 0 0 0.0 1.0\n",
"871 1 1 0.0 1.0\n",
"654 1 1 0.0 1.0\n",
"653 0 1 0.0 1.0"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_set_processed[best_features].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Feature ``Sex`` is the most associated with the target ``Survived``:\n",
" - Tschuprow's T value is ``TschuprowtMeasure=0.5337``\n",
" - It has 0 % of NaNs (``NaNMeasure=0.0``) \n",
" - Its mode represents 64 % of observed data (``ModeMeasure=0.6364``)\n",
"\n",
"* Feature ``Fare`` is strongly associated to feature ``Pclass``:\n",
" - Tschuprow's T value is ``TschuprowtFilter=0.3922`` with ``TschuprowtWith=Pclass``\n",
"\n",
"* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Modeling\n",
"Fitting model on train data"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'super' object has no attribute '__sklearn_tags__'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:974\u001b[0m, in \u001b[0;36mMimeBundleFormatter.__call__\u001b[1;34m(self, obj, include, exclude)\u001b[0m\n\u001b[0;32m 971\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 973\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 974\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43minclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minclude\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 975\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 976\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:469\u001b[0m, in \u001b[0;36mBaseEstimator._repr_mimebundle_\u001b[1;34m(self, **kwargs)\u001b[0m\n\u001b[0;32m 467\u001b[0m output \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/plain\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;28mrepr\u001b[39m(\u001b[38;5;28mself\u001b[39m)}\n\u001b[0;32m 468\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m get_config()[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdisplay\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdiagram\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m--> 469\u001b[0m output[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/html\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 470\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m output\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:540\u001b[0m, in \u001b[0;36mClassifierMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 539\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 540\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 541\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 542\u001b[0m tags\u001b[38;5;241m.\u001b[39mclassifier_tags \u001b[38;5;241m=\u001b[39m ClassifierTags()\n",
"\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'"
]
},
{
"ename": "AttributeError",
"evalue": "'super' object has no attribute '__sklearn_tags__'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:344\u001b[0m, in \u001b[0;36mBaseFormatter.__call__\u001b[1;34m(self, obj)\u001b[0m\n\u001b[0;32m 342\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 343\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 344\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 345\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 346\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:463\u001b[0m, in \u001b[0;36mBaseEstimator._repr_html_inner\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 458\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_repr_html_inner\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m 459\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"This function is returned by the @property `_repr_html_` to make\u001b[39;00m\n\u001b[0;32m 460\u001b[0m \u001b[38;5;124;03m `hasattr(estimator, \"_repr_html_\") return `True` or `False` depending\u001b[39;00m\n\u001b[0;32m 461\u001b[0m \u001b[38;5;124;03m on `get_config()[\"display\"]`.\u001b[39;00m\n\u001b[0;32m 462\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 463\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n",
"File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:540\u001b[0m, in \u001b[0;36mClassifierMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 539\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 540\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 541\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 542\u001b[0m tags\u001b[38;5;241m.\u001b[39mclassifier_tags \u001b[38;5;241m=\u001b[39m ClassifierTags()\n",
"\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'"
]
},
{
"data": {
"text/plain": [
"XGBClassifier(base_score=None, booster=None, callbacks=None,\n",
" colsample_bylevel=None, colsample_bynode=None,\n",
" colsample_bytree=None, device=None, early_stopping_rounds=None,\n",
" enable_categorical=False, eval_metric=None, feature_types=None,\n",
" gamma=None, grow_policy=None, importance_type=None,\n",
" interaction_constraints=None, learning_rate=None, max_bin=None,\n",
" max_cat_threshold=None, max_cat_to_onehot=None,\n",
" max_delta_step=None, max_depth=None, max_leaves=None,\n",
" min_child_weight=None, missing=nan, monotone_constraints=None,\n",
" multi_strategy=None, n_estimators=None, n_jobs=None,\n",
" num_parallel_tree=None, random_state=None, ...)"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from xgboost import XGBClassifier\n",
"\n",
"model = XGBClassifier()\n",
"model.fit(train_set_processed[best_features], train_set_processed[target])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Saving model"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"model.save_model(\"binary_xgboost.json\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Prediction on dev dataset and performance"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(0.8548426745329402)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import roc_auc_score\n",
"\n",
"dev_pred = model.predict_proba(dev_set_processed[best_features])[:, 1]\n",
"roc_auc_score(dev_set_processed[target], dev_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's next?\n",
"\n",
"* Thanks to **Carvers** all of your features are now optimally processed for your classification task!\n",
"* As a final step towards your model, **Selectors** can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out [Selectors Examples](https://autocarver.readthedocs.io/en/latest/selectors_examples.html)!\n",
"\n",
"## Well done!\n",
"\n",
"Your commitment to achieving optimal results in binary classification tasks shines through in your meticulous use of **AutoCarver**'s ``BinaryCarver`` for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.\n",
"\n",
"The ``BinaryCarver`` has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.\n",
"\n",
"We extend our sincere appreciation for choosing **AutoCarver** as your companion in the data preprocessing journey. Your use of **AutoCarver** demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in binary classification tasks.\n",
"\n",
"As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We're excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.\n",
"\n",
"Thank you for trusting **AutoCarver**, and we wish you continued success in your data-driven ventures."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "autocarver-i96ERKJw-py3.9",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 2
}