Setting things up

About this notebook

In this notebook, we embark on a journey to enhance the predictive power of the Titanic Dataset through sophisticated preprocessing using the BinaryCarver pipeline. Designed to maximize associations in the data, BinaryCarver is a robust Python tool capable of discretizing any type of data—whether it be quantitative or qualitative. Our specific focus is on preparing the dataset for binary classification tasks, such as predicting survival outcomes.

The Titanic Dataset, derived from the iconic 1912 Titanic passenger information, provides a diverse set of features ranging from socio-economic status and age to cabin location. Leveraging BinaryCarver, we aim to perform association-maximizing discretization, refining both quantitative and qualitative features to create a finely tuned dataset for our binary classification endeavors.

Throughout this notebook, we’ll delve into the intricacies of BinaryCarver’s discretization pipeline, exploring its capabilities in handling a variety of data types. Whether it’s transforming passenger ages or classifying fares, BinaryCarver’s adaptability ensures that every feature is optimally represented for our classification tasks.

Join us in this exploration as we harness the power of BinaryCarver to preprocess the Titanic Dataset. Through effective feature engineering and discretization, we strive to create a dataset that not only captures the nuances of the Titanic passenger profiles but also sets the stage for the development of accurate and impactful binary classification models.

Let’s dive in and uncover the potential of BinaryCarver in transforming the Titanic Dataset for optimal predictive modeling.

Installation

[1]:
%pip install AutoCarver[jupyter]
Collecting AutoCarver[jupyter]
  Downloading AutoCarver-5.4.6-py3-none-any.whl (69 kB)
     ---------------------------------------- 0.0/69.2 kB ? eta -:--:--
     ----- ---------------------------------- 10.2/69.2 kB ? eta -:--:--
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ---------------- --------------------- 30.7/69.2 kB 330.3 kB/s eta 0:00:01
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     ----------------------- --------------- 41.0/69.2 kB 25.9 kB/s eta 0:00:02
     --------------------------------------- 69.2/69.2 kB 37.0 kB/s eta 0:00:00
Requirement already satisfied: pandas>=2.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (2.1.3)
Requirement already satisfied: numpy in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (1.24.2)
Requirement already satisfied: scipy in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (1.10.1)
Requirement already satisfied: statsmodels in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (0.14.0)
Requirement already satisfied: scikit-learn in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (1.2.2)
Requirement already satisfied: tqdm in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (4.65.0)
Requirement already satisfied: ipython in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (8.11.0)
Requirement already satisfied: matplotlib in c:\users\defra\.conda\envs\py39\lib\site-packages (from AutoCarver[jupyter]) (3.7.1)
Requirement already satisfied: tzdata>=2022.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from pandas>=2.1->AutoCarver[jupyter]) (2023.3)
Requirement already satisfied: pytz>=2020.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from pandas>=2.1->AutoCarver[jupyter]) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\defra\.conda\envs\py39\lib\site-packages (from pandas>=2.1->AutoCarver[jupyter]) (2.8.2)
Requirement already satisfied: pygments>=2.4.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (2.14.0)
Requirement already satisfied: matplotlib-inline in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.1.6)
Requirement already satisfied: colorama in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.4.6)
Requirement already satisfied: pickleshare in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.7.5)
Requirement already satisfied: traitlets>=5 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (5.9.0)
Requirement already satisfied: backcall in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.2.0)
Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (3.0.38)
Requirement already satisfied: decorator in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (5.1.1)
Requirement already satisfied: stack-data in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.6.2)
Requirement already satisfied: jedi>=0.16 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->AutoCarver[jupyter]) (0.18.2)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (1.4.4)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (1.0.7)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (5.12.0)
Requirement already satisfied: packaging>=20.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (23.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (4.39.1)
Requirement already satisfied: cycler>=0.10 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (0.11.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (3.0.9)
Requirement already satisfied: pillow>=6.2.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from matplotlib->AutoCarver[jupyter]) (9.4.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from scikit-learn->AutoCarver[jupyter]) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from scikit-learn->AutoCarver[jupyter]) (3.1.0)
Requirement already satisfied: patsy>=0.5.2 in c:\users\defra\.conda\envs\py39\lib\site-packages (from statsmodels->AutoCarver[jupyter]) (0.5.3)
Requirement already satisfied: zipp>=3.1.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from importlib-resources>=3.2.0->matplotlib->AutoCarver[jupyter]) (3.15.0)
Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from jedi>=0.16->ipython->AutoCarver[jupyter]) (0.8.3)
Requirement already satisfied: six in c:\users\defra\.conda\envs\py39\lib\site-packages (from patsy>=0.5.2->statsmodels->AutoCarver[jupyter]) (1.16.0)
Requirement already satisfied: wcwidth in c:\users\defra\.conda\envs\py39\lib\site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->AutoCarver[jupyter]) (0.2.6)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->AutoCarver[jupyter]) (2.2.1)
Requirement already satisfied: pure-eval in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->AutoCarver[jupyter]) (0.2.2)
Requirement already satisfied: executing>=1.2.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->AutoCarver[jupyter]) (1.2.0)
Installing collected packages: AutoCarver
Successfully installed AutoCarver-5.4.6
Note: you may need to restart the kernel to use updated packages.

Titanic Data

In this example notebook, we will use the Titanic dataset.

The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.

The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).

[3]:
import pandas as pd

# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)

# Display the first few rows of the dataset
titanic_data.head()
[3]:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 0 3 Mr. William Henry Allen male 35.0 0 0 8.0500

Target type and Carver selection

[4]:
target = "Survived"

titanic_data[target].value_counts(dropna=False)
[4]:
Survived
0    545
1    342
Name: count, dtype: int64

The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.BinaryCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.

Data Sampling

[5]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:605: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if is_sparse(pd_dtype):
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:614: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[6]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[6]:
(0.38552188552188554, 0.3856655290102389)

Picking up columns to Carve

[7]:
train_set.head()
[7]:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
617 0 3 Mr. Antoni Yasbeck male 27.0 1 0 14.4542
489 0 1 Mr. Harry Markland Molson male 55.0 0 0 30.5000
871 1 3 Miss. Adele Kiamie Najib female 15.0 0 0 7.2250
654 0 3 Mrs. John (Catherine) Bourke female 32.0 1 1 15.5000
653 0 3 Mr. Alexander Radeff male 27.0 0 0 7.8958
[8]:
# column data types
train_set.dtypes
[8]:
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object
[9]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()
[9]:
Parents/Children Aboard
0    438
1     87
2     60
3      3
5      3
4      2
6      1
Name: count, dtype: int64
[10]:
# values taken by Pclass
train_set["Pclass"].value_counts()
[10]:
Pclass
3    326
1    142
2    126
Name: count, dtype: int64

The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).

"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.

"Fare" is the only quantitative continuous features, whilst "Age", "Siblings/Spouses Aboard" and "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.

[11]:
# lists of features per data type
quantitative_features = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
qualitative_features = ["Sex"]
ordinal_features = ["Pclass"]

# user-specified ordering for ordinal features
values_orders = {
    "Pclass": ["1", "2", "3"]
}

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:

  • For quantitative features, it defines the number of quantiles to initialy discretize the features with.

  • For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

[12]:
min_freq = 0.05

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[13]:
max_n_mod = 5

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Association metric

The attribute sort_by allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by Carvers.

[14]:
# For BinaryCarver, to be choosen amongst ["tschuprowt", "cramerv"]
sort_by = "tschuprowt"  # "cramerv"

Tip: use "tschuprowt" for more robust, or less output modalities, use "cramerv" for more output modalities.

Grouping NaNs

The attribute dropna allows one to choose whether or not numpy.nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with numpy.nan.

[15]:
dropna = False  # anyway, there are no numpy.nan in this dataset

Optional attributes

Minimal frequency per carved modality

The attribute min_freq_mod allows one to choose the minimum frequency per output modality. It is used by Carvers in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to min_freq/2.

[16]:
min_freq_mod = None  # for 0.05,  at least 5 % of observations per output modality in train and dev sets

Type of output carved features

The attribute output_dtype allows one to choose the output type:

  • Use "float" for integer output (default)

  • Use "str" for string output

[17]:
output_dtype = "float"  # "str"

Fitting AutoCarver

  • First, all qualitative features are discretized:

    1. Using StringDiscretizer to convert them to str if not already the case

    2. For qualitative ordinal features: using OrdinalDiscretizer for under-represented values (less frequent than min_freq=0.05) to be grouped with its closest modality

    3. For qualitative categorical features: using CategoricalDiscretizer for under-represented values (less frequent than min_freq=0.05) to be grouped with a default value (str_default="__OTHER__")

  • Second, all quantitative features are discretized:

    1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq=0.05)

    2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2=0.025) to be grouped with its closest modality

  • Third, all features are carved following this recipe, for all classes of train_set[target] (except one):

    1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step

    2. Grouping modalities: all consecutive combinations of modalities are applied to train_set

    3. Computing associations: the association metric (sort_by="cramerv") is computed with the provided target train_set[target]

    4. Combinations are sorted in descending order by association value

    5. Testing robustness: finds the first combination that checks the following:

      • Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq_mod)

      • Distinct target rates per consecutive modalities on train_set and dev_set

      • No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)

    6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with numpy.nan are applied to train_set and steps 3. and 4. are run

    7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[18]:
from AutoCarver import BinaryCarver

# intiating AutoCarver
auto_carver = BinaryCarver(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    min_freq_mod=min_freq_mod,
    max_n_mod=max_n_mod,
    dropna=dropna,
    sort_by=sort_by,
    output_dtype=output_dtype,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
[Discretizer] Fit Qualitative Features
---
 - [StringDiscretizer] Fit ['Pclass']
 - [OrdinalDiscretizer] Fit ['Pclass']
 - [CategoricalDiscretizer] Fit ['Sex']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['Parents/Children Aboard', 'Fare', 'Siblings/Spouses Aboard', 'Age']
 - [OrdinalDiscretizer] Fit ['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
------


------
[AutoCarver] Fit Sex (1/6)
---

 - [AutoCarver] Raw distribution
c:\Users\defra\Desktop\git\PROJECTS\AutoCarver\docs\source\examples\BinaryClassification\../../../../../AutoCarver\AutoCarver\discretizers\discretizers.py:325: UserWarning:  - [QualitativeDiscretizer] Non-string features: ['Pclass']. Trying to convert them using type_discretizers.StringDiscretizer, otherwise convert them manually. Unexpected data types: [<class 'int'>].
  warn(
X distribution
  target_rate frequency
male 0.1878 0.6364
female 0.7315 0.3636
X_dev distribution
target_rate frequency
0.1949 0.6655
0.7653 0.3345
Grouping modalities   : 100%|██████████| 1/1 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 1/1 [00:00<00:00, 663.24it/s]
Testing robustness    :   0%|          | 0/1 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
male 0.1878 0.6364
female 0.7315 0.3636
X_dev distribution
target_rate frequency
0.1949 0.6655
0.7653 0.3345
------


------
[AutoCarver] Fit Siblings/Spouses Aboard (2/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 0.000e+00 0.3614 0.6801
0.000e+00 < x <= 1.000e+00 0.5000 0.2323
1.000e+00 < x <= 3.000e+00 0.4138 0.0488
3.000e+00 < x 0.0870 0.0387
X_dev distribution
target_rate frequency
0.3200 0.6826
0.6056 0.2423
0.3333 0.0512
0.1429 0.0239
Grouping modalities   : 100%|██████████| 7/7 [00:00<00:00, 7025.63it/s]
Computing associations: 100%|██████████| 7/7 [00:00<00:00, 3453.32it/s]
Testing robustness    :  29%|██▊       | 2/7 [00:00<00:00, 155.62it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
x <= 0.000e+00 0.3614 0.6801
0.000e+00 < x <= 1.000e+00 0.5000 0.2323
1.000e+00 < x 0.2692 0.0875
X_dev distribution
target_rate frequency
0.3200 0.6826
0.6056 0.2423
0.2727 0.0751
------


------
[AutoCarver] Fit Fare (3/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 7.125e+00 0.0333 0.0505
7.125e+00 < x <= 7.250e+00 0.2258 0.0522
7.250e+00 < x <= 7.796e+00 0.3462 0.0875
7.796e+00 < x <= 7.896e+00 0.2000 0.0673
7.896e+00 < x <= 8.050e+00 0.2340 0.0791
8.050e+00 < x <= 1.046e+01 0.2258 0.0522
1.046e+01 < x <= 1.400e+01 0.4833 0.1010
1.400e+01 < x <= 1.610e+01 0.2812 0.0539
1.610e+01 < x <= 2.300e+01 0.5333 0.0505
2.300e+01 < x <= 2.600e+01 0.3333 0.0606
2.600e+01 < x <= 2.772e+01 0.5417 0.0404
2.772e+01 < x <= 3.127e+01 0.3125 0.0539
3.127e+01 < x <= 4.012e+01 0.3929 0.0471
4.012e+01 < x <= 5.590e+01 0.4333 0.0505
5.590e+01 < x <= 7.673e+01 0.5667 0.0505
7.673e+01 < x <= 1.109e+02 0.7419 0.0522
1.109e+02 < x 0.8000 0.0505
X_dev distribution
target_rate frequency
0.0833 0.0410
0.2000 0.0341
0.2222 0.0922
0.0556 0.0614
0.2000 0.0512
0.0870 0.0785
0.3947 0.1297
0.4167 0.0819
0.5263 0.0648
0.5294 0.0580
0.6667 0.0307
0.4667 0.0512
0.4167 0.0410
0.6667 0.0307
0.5714 0.0478
0.7500 0.0546
0.6667 0.0512
Grouping modalities   : 100%|██████████| 2516/2516 [00:00<00:00, 7771.79it/s]
Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4375.16it/s]
Testing robustness    :   0%|          | 0/2516 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
x <= 1.046e+01 0.2251 0.3889
1.046e+01 < x <= 7.673e+01 0.4305 0.5084
7.673e+01 < x 0.7705 0.1027
X_dev distribution
target_rate frequency
0.1429 0.3584
0.4841 0.5358
0.7097 0.1058
------


------
[AutoCarver] Fit Age (4/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 4.000e+00 0.7333 0.0505
4.000e+00 < x <= 1.400e+01 0.3103 0.0488
1.400e+01 < x <= 1.700e+01 0.4286 0.0471
1.700e+01 < x <= 1.900e+01 0.3636 0.0741
1.900e+01 < x <= 2.000e+01 0.1176 0.0286
2.000e+01 < x <= 2.200e+01 0.3273 0.0926
2.200e+01 < x <= 2.400e+01 0.5000 0.0572
2.400e+01 < x <= 2.700e+01 0.3556 0.0758
2.700e+01 < x <= 2.800e+01 0.2632 0.0320
2.800e+01 < x <= 3.100e+01 0.3571 0.0943
3.100e+01 < x <= 3.300e+01 0.4483 0.0488
3.300e+01 < x <= 3.600e+01 0.4146 0.0690
3.600e+01 < x <= 3.800e+01 0.4118 0.0286
3.800e+01 < x <= 4.100e+01 0.3871 0.0522
4.100e+01 < x <= 4.500e+01 0.4167 0.0606
4.500e+01 < x <= 4.900e+01 0.5000 0.0438
4.900e+01 < x <= 5.600e+01 0.3448 0.0488
5.600e+01 < x 0.1786 0.0471
X_dev distribution
target_rate frequency
0.5385 0.0444
0.6250 0.0546
0.3571 0.0478
0.3200 0.0853
0.3333 0.0205
0.1579 0.0648
0.3077 0.0887
0.4074 0.0922
0.2778 0.0614
0.4400 0.0853
0.5455 0.0375
0.6190 0.0717
0.2500 0.0273
0.1875 0.0546
0.2727 0.0375
0.3333 0.0410
0.5833 0.0410
0.3846 0.0444
Grouping modalities   : 100%|██████████| 3213/3213 [00:00<00:00, 7916.00it/s]
Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 4466.30it/s]
Testing robustness    :   0%|          | 0/3213 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
x <= 4.000e+00 0.7333 0.0505
4.000e+00 < x 0.3670 0.9495
X_dev distribution
target_rate frequency
0.5385 0.0444
0.3786 0.9556
------


------
[AutoCarver] Fit Pclass (5/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
1, 1 0.6197 0.2391
2, 2 0.4683 0.2121
3, 3 0.2515 0.5488
X_dev distribution
target_rate frequency
0.6486 0.2526
0.4828 0.1980
0.2298 0.5495
Grouping modalities   : 100%|██████████| 3/3 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3000.22it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
1 to 2 0.5485 0.4512
3, 3 0.2515 0.5488
X_dev distribution
target_rate frequency
0.5758 0.4505
0.2298 0.5495
------


------
[AutoCarver] Fit Parents/Children Aboard (6/6)
---

 - [AutoCarver] Raw distribution
X distribution
  target_rate frequency
x <= 0.000e+00 0.3447 0.7374
0.000e+00 < x <= 1.000e+00 0.5057 0.1465
1.000e+00 < x 0.4928 0.1162
X_dev distribution
target_rate frequency
0.3475 0.8055
0.6774 0.1058
0.3846 0.0887
Grouping modalities   : 100%|██████████| 3/3 [00:00<00:00, 2901.96it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3006.67it/s]
Testing robustness    :   0%|          | 0/3 [00:00<?, ?it/s]

 - [AutoCarver] Carved distribution

X distribution
  target_rate frequency
x <= 0.000e+00 0.3447 0.7374
0.000e+00 < x 0.5000 0.2626
X_dev distribution
target_rate frequency
0.3475 0.8055
0.5439 0.1945
------

AutoCarver analysis

Carving Summary

[19]:
auto_carver.summary()
[19]:
label content
feature dtype
Age float 0 [x <= 4.000e+00]
float 1 [4.000e+00 < x]
Fare float 0 [x <= 1.046e+01]
float 1 [1.046e+01 < x <= 7.673e+01]
float 2 [7.673e+01 < x]
Parents/Children Aboard float 0 [x <= 0.000e+00]
float 1 [0.000e+00 < x]
Siblings/Spouses Aboard float 0 [x <= 0.000e+00]
float 1 [0.000e+00 < x <= 1.000e+00]
float 2 [1.000e+00 < x]
Pclass str 0 [1, 2]
str 1 [3]
Sex str 0 [male]
str 1 [female]
  • As requested with output_dtype="float", output labels are integers of ranks of modalities

  • For quantitative feature Age, the selected combination of modalities groups ages as follows:

    • modality 0: lower or equal to 4 years old (content==["x <= 4.000e+00"])

    • modality 1: ages higher than 4 years old (content==["4.000e+00 < x "])

  • For qualitative categorical feature Sex, the selected combination of modalities has left modalities content=["male"] in modality 0 and content=["female"] in modality 1 (no combination possible)

  • For qualitative ordinal feature Pclass, the selected combination of modalities socio-economic status as follows:

    • modality 0: upper and middle classes (content==[1, 2])

    • modality 1: lower class (content==[3]).

    • The user-provided ordering of modalities has been preserved.

Detailed overview of tested combinations

[20]:
auto_carver.history(feature="Pclass")
[20]:
combination tschuprowt viability viability_message grouping_nan
0 [[1, 1], [2, 2], [3, 3]] 0.269965 None [Raw X distribution] False
1 [[1, 1, 2, 2], [3, 3]] 0.300144 True [Combination robust between X and X_dev] False
2 [[1, 1], [2, 2], [3, 3]] 0.269965 None [Not checked] False
3 [[1, 1], [2, 2, 3, 3]] 0.265643 None [Not checked] False
  • The most associated combination (the first tested out, where viability_message!=["Raw X distribution"]) groups Pclass==1 with Pclass==2 and leaves Pclass==3 as its own modality

  • For feature feature Pclass, the 1st combination is passes the tests:

    • viability_message!=["Combination robust between X and X_dev"]

    • Tschuprow’s T with Survived is 0.300144 for this combination

    • Following combinations (less associated with the target) where not tested: viability_message==["Not checked"]

  • For all combinations grouping_nan==False means that it is not a combination in which NaNs are being groupedwith other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[21]:
import json

# storing as json file
with open('binay_carver.json', 'w') as my_carver_json:
    json.dump(auto_carver.to_json(), my_carver_json)

Loading

Carvers can safely be loaded from a .json file.

[22]:
import json

from AutoCarver import load_carver

# loading json file
with open('binay_carver.json', 'r') as my_carver_json:
    auto_carver = load_carver(json.load(my_carver_json))

Applying AutoCarver

[23]:
dev_set_processed = auto_carver.transform(dev_set)
[24]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[24]:
Sex Siblings/Spouses Aboard Fare Age Pclass Parents/Children Aboard
0.0 0.665529 0.682594 0.358362 0.044369 0.450512 0.805461
1.0 0.334471 0.242321 0.535836 0.955631 0.549488 0.194539
2.0 NaN 0.075085 0.105802 NaN NaN NaN

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using BinaryCarver, hence all features are qualitative.

[25]:
features = qualitative_features + quantitative_features + ordinal_features

Number of features to select

The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[26]:
n_best = 6  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Using Selectors

[27]:
from AutoCarver.selectors import ClassificationSelector

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    qualitative_features=features,
    n_best=n_best,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
------
[Selector] Selecting from qualitative features: ['Sex', 'Siblings/Spouses Aboard', 'Fare', 'Age', 'Pclass', 'Parents/Children Aboard']
---

 - [Selector] Association between X and y
  dtype pct_nan pct_mode mode chi2_statistic tschuprowt_measure
Sex int64 0.0000 0.6364 0 169.2047 0.5337
Pclass int64 0.0000 0.5488 1 53.5114 0.3001
Fare float64 0.0000 0.5084 1.0000 65.8288 0.2799
Age float64 0.0000 0.9495 1.0000 14.6254 0.1569
Parents/Children Aboard int64 0.0000 0.7374 0 11.0576 0.1364
Siblings/Spouses Aboard int64 0.0000 0.6801 0 11.5963 0.1175

 - [Selector] Association between X and y, filtered for inter-feature assocation
  dtype pct_nan pct_mode mode chi2_statistic tschuprowt_measure
Sex int64 0.0000 0.6364 0 169.2047 0.5337
Pclass int64 0.0000 0.5488 1 53.5114 0.3001
Fare float64 0.0000 0.5084 1.0000 65.8288 0.2799
Age float64 0.0000 0.9495 1.0000 14.6254 0.1569
Parents/Children Aboard int64 0.0000 0.7374 0 11.0576 0.1364
Siblings/Spouses Aboard int64 0.0000 0.6801 0 11.5963 0.1175

 - [Selector] Selected qualitative features: ['Sex', 'Pclass', 'Fare', 'Age', 'Parents/Children Aboard', 'Siblings/Spouses Aboard']
------

  • Feature Sex is the most associated with the target Survived:

    • Tschuprow’s T value is tschuprowt_measure=0.5337

    • It has 0 % of NaNs (pct_nan=0.0)

    • Its mode, 0, represents 64 % of observed data (pct_nan=0.6364)

  • Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

What’s next?

  • Thanks to Carvers all of your features are now optimally processed for your classification task!

  • As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in binary classification tasks shines through in your meticulous use of AutoCarver’s BinaryCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The BinaryCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in binary classification tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.