Setting things up

About this notebook

In this notebook, we embark on a journey to enhance the predictive power of the Titanic Dataset through sophisticated preprocessing using the BinaryCarver pipeline. Designed to maximize associations in the data, BinaryCarver is a robust Python tool capable of discretizing any type of data—whether it be quantitative or qualitative. Our specific focus is on preparing the dataset for binary classification tasks, such as predicting survival outcomes.

The Titanic Dataset, derived from the iconic 1912 Titanic passenger information, provides a diverse set of features ranging from socio-economic status and age to cabin location. Leveraging BinaryCarver, we aim to perform association-maximizing discretization, refining both quantitative and qualitative features to create a finely tuned dataset for our binary classification endeavors.

Throughout this notebook, we’ll delve into the intricacies of BinaryCarver’s discretization pipeline, exploring its capabilities in handling a variety of data types. Whether it’s transforming passenger ages or classifying fares, BinaryCarver’s adaptability ensures that every feature is optimally represented for our classification tasks.

Join us in this exploration as we harness the power of BinaryCarver to preprocess the Titanic Dataset. Through effective feature engineering and discretization, we strive to create a dataset that not only captures the nuances of the Titanic passenger profiles but also sets the stage for the development of accurate and impactful binary classification models.

Let’s dive in and uncover the potential of BinaryCarver in transforming the Titanic Dataset for optimal predictive modeling.

Installation

[1]:

# %pip install AutoCarver[jupyter]

Titanic Data

In this example notebook, we will use the Titanic dataset.

The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.

The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).

[2]:

import pandas as pd

# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)

# Display the first few rows of the dataset
titanic_data.head()

[2]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Fare
0	0	3	Mr. Owen Harris Braund	male	22.0	1	7.2500
1	1	1	Mrs. John Bradley (Florence Briggs Thayer) Cum...	female	38.0	1	71.2833
2	1	3	Miss. Laina Heikkinen	female	26.0	0	7.9250
3	1	1	Mrs. Jacques Heath (Lily May Peel) Futrelle	female	35.0	1	53.1000
4	0	3	Mr. William Henry Allen	male	35.0	0	8.0500

Target type and Carver selection

[3]:

target = "Survived"

titanic_data[target].value_counts(dropna=False)

[3]:

Survived
0    545
1    342
Name: count, dtype: int64

The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.BinaryCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.

Data Sampling

[4]:

from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

[4]:

(np.float64(0.38552188552188554), np.float64(0.3856655290102389))

Setting up Features to Carver

[5]:

train_set.head()

[5]:

	Survived	Pclass	Name	Sex	Age	Siblings/Spouses Aboard	Parents/Children Aboard	Fare
617	0	3	Mr. Antoni Yasbeck	male	27.0	1	0	14.4542
489	0	1	Mr. Harry Markland Molson	male	55.0	0	0	30.5000
871	1	3	Miss. Adele Kiamie Najib	female	15.0	0	0	7.2250
654	0	3	Mrs. John (Catherine) Bourke	female	32.0	1	1	15.5000
653	0	3	Mr. Alexander Radeff	male	27.0	0	0	7.8958

[6]:

# column data types
train_set.dtypes

[6]:

Survived                     int64
Pclass                       int64
Name                           str
Sex                            str
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object

[7]:

# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()

[7]:

Parents/Children Aboard
0    438
1     87
2     60
3      3
5      3
4      2
6      1
Name: count, dtype: int64

[8]:

# values taken by Pclass
train_set["Pclass"].value_counts()

[8]:

Pclass
3    326
1    142
2    126
Name: count, dtype: int64

The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).

"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.

"Fare" is the only quantitative continuous features, whilst "Age", "Siblings/Spouses Aboard" and "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.

[38]:

from AutoCarver import Features

# initiating Features to carve
features = Features(
    categoricals=["Sex"],
    numericals=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},  # user-specified ordering for ordinal features
)
features["Pclass"], features["Sex"], features["Age"]

[38]:

(Ordinal('Pclass'), Categorical('Sex'), Numerical('Age'))

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used:

For quantitative features, to define the number of quantiles to initialy discretize the features with.
For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality.

[39]:

min_freq = 0.025

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Optional: Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[40]:

max_n_mod = 5

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Optional: Grouping NaNs

The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-nan values, and then test out all possible combinations with nan.

[41]:

dropna = False  # anyway, there are no nan in this dataset

Fitting AutoCarver

First, all qualitative features are discretized:
1. Using StringDiscretizer to convert them to str if not already the case
2. For qualitative ordinal features: using OrdinalDiscretizer for under-represented values (less frequent than min_freq) to be grouped with its closest modality
3. For qualitative categorical features: using CategoricalDiscretizer for under-represented values (less frequent than min_freq) to be grouped with a default value (features.default="__OTHER__")
Second, all quantitative features are discretized:
1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq)
2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2) to be grouped with its closest modality
Third, all features are carved following this recipe, for all classes of train_set[target] (except one):
1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step
2. Grouping modalities: all consecutive combinations of modalities are applied to train_set
3. Computing associations: the association metric (Tschruprow’s T, by default) is computed with the provided train_set[target]
4. Combinations are sorted in descending order by association value
5. Testing robustness: finds the first combination that checks the following:
  - Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq/2)
  - Distinct target rates per consecutive modalities on train_set and dev_set
  - No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)
6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with nan are applied to train_set and steps 3. and 4. are run
7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[42]:

from AutoCarver import BinaryCarver
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# intiating AutoCarver
auto_carver = BinaryCarver(
    features=features,
    min_freq=min_freq,
    max_n_mod=max_n_mod,
    config=ProcessingConfig(dropna=dropna, verbose=True, copy=True, ordinal_encoding=True),
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

------
--- [QuantitativeDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
 - [ContinuousDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
 - [OrdinalDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
------

------
--- [QualitativeDiscretizer] Fit Features(['Sex', 'Pclass'])
 - [StringDiscretizer] Fit Features(['Pclass'])
 - [OrdinalDiscretizer] Fit Features(['Pclass'])
 - [CategoricalDiscretizer] Fit Features(['Sex'])
------

---------
------ [BinaryCarver] Fit Features(['Sex', 'Pclass', 'Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
--- [BinaryCarver] Fit Categorical('Sex') (1/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
male	0.1878	0.6364	378
female	0.7315	0.3636	216

X_dev distribution
target_mean	frequency	count
0.1949	0.6655	195
0.7653	0.3345	98

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
male	0.1878	0.6364	378
female	0.7315	0.3636	216

X_dev distribution
target_mean	frequency	count
0.1949	0.6655	195
0.7653	0.3345	98

--- [BinaryCarver] Fit Ordinal('Pclass') (2/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
1	0.6197	0.2391	142
2	0.4683	0.2121	126
3	0.2515	0.5488	326

X_dev distribution
target_mean	frequency	count
0.6486	0.2526	74
0.4828	0.1980	58
0.2298	0.5495	161

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
1 to 2	0.5485	0.4512	268
3	0.2515	0.5488	326

X_dev distribution
target_mean	frequency	count
0.5758	0.4505	132
0.2298	0.5495	161

--- [BinaryCarver] Fit Numerical('Age') (3/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.00e+00	0.8333	0.0202	12
1.00e+00 < x <= 2.00e+00	0.5000	0.0067	4
2.00e+00 < x <= 4.00e+00	0.7143	0.0236	14
4.00e+00 < x <= 6.00e+00	0.5714	0.0118	7
6.00e+00 < x <= 8.00e+00	0.2857	0.0118	7
8.00e+00 < x <= 1.00e+01	0.0000	0.0101	6
1.00e+01 < x <= 1.40e+01	0.3333	0.0152	9
1.40e+01 < x <= 1.60e+01	0.5000	0.0303	18
1.60e+01 < x <= 1.70e+01	0.3000	0.0168	10
1.70e+01 < x <= 1.80e+01	0.3333	0.0354	21
1.80e+01 < x <= 1.90e+01	0.3913	0.0387	23
1.90e+01 < x <= 2.05e+01	0.1111	0.0303	18
2.05e+01 < x <= 2.10e+01	0.1905	0.0354	21
2.10e+01 < x <= 2.20e+01	0.4242	0.0556	33
2.20e+01 < x <= 2.35e+01	0.4000	0.0168	10
2.35e+01 < x <= 2.40e+01	0.5417	0.0404	24
2.40e+01 < x <= 2.50e+01	0.1333	0.0253	15
2.50e+01 < x <= 2.60e+01	0.4000	0.0168	10
2.60e+01 < x <= 2.70e+01	0.5000	0.0337	20
2.70e+01 < x <= 2.85e+01	0.2500	0.0337	20
2.85e+01 < x <= 2.90e+01	0.4444	0.0303	18
2.90e+01 < x <= 3.05e+01	0.2692	0.0438	26
3.05e+01 < x <= 3.10e+01	0.4545	0.0185	11
3.10e+01 < x <= 3.25e+01	0.5000	0.0303	18
3.25e+01 < x <= 3.30e+01	0.3636	0.0185	11
3.30e+01 < x <= 3.45e+01	0.2857	0.0236	14
3.45e+01 < x <= 3.50e+01	0.5455	0.0185	11
3.50e+01 < x <= 3.60e+01	0.4375	0.0269	16
3.60e+01 < x <= 3.70e+01	0.2222	0.0152	9
3.70e+01 < x <= 3.80e+01	0.6250	0.0135	8
3.80e+01 < x <= 3.90e+01	0.3333	0.0202	12
3.90e+01 < x <= 4.00e+01	0.4615	0.0219	13
4.00e+01 < x <= 4.10e+01	0.3333	0.0101	6
4.10e+01 < x <= 4.20e+01	0.4615	0.0219	13
4.20e+01 < x <= 4.40e+01	0.3333	0.0202	12
4.40e+01 < x <= 4.50e+01	0.4545	0.0185	11
4.50e+01 < x <= 4.60e+01	0.2500	0.0067	4
4.60e+01 < x <= 4.70e+01	0.2500	0.0135	8
4.70e+01 < x <= 4.80e+01	0.7778	0.0152	9
4.80e+01 < x <= 5.00e+01	0.4545	0.0185	11
5.00e+01 < x <= 5.10e+01	0.2000	0.0084	5
5.10e+01 < x <= 5.40e+01	0.4000	0.0168	10
5.40e+01 < x <= 5.60e+01	0.3750	0.0135	8
5.60e+01 < x <= 5.80e+01	0.3333	0.0101	6
5.80e+01 < x <= 6.10e+01	0.0000	0.0118	7
6.10e+01 < x <= 6.50e+01	0.2857	0.0118	7
6.50e+01 < x	0.1250	0.0135	8

X_dev distribution
target_mean	frequency	count
1.0000	0.0068	2
0.2857	0.0239	7
0.7500	0.0137	4
1.0000	0.0068	2
0.5000	0.0137	4
0.5000	0.0137	4
0.6667	0.0205	6
0.2500	0.0273	8
0.5000	0.0205	6
0.4000	0.0512	15
0.2000	0.0341	10
0.3333	0.0205	6
0.1538	0.0444	13
0.1667	0.0205	6
0.1875	0.0546	16
0.5000	0.0341	10
0.5000	0.0341	10
0.2727	0.0375	11
0.5000	0.0205	6
0.2632	0.0648	19
0.4286	0.0239	7
0.3333	0.0307	9
0.6250	0.0273	8
0.4000	0.0171	5
0.6667	0.0205	6
0.7500	0.0137	4
0.6000	0.0341	10
0.5714	0.0239	7
0.2500	0.0137	4
0.2500	0.0137	4
0.1667	0.0205	6
0.2000	0.0171	5
0.2000	0.0171	5
0.5000	0.0137	4
0.0000	0.0137	4
0.3333	0.0102	3
0.2500	0.0137	4
0.0000	0.0068	2
0.6667	0.0102	3
0.5714	0.0239	7
0.5000	0.0068	2
0.5000	0.0205	6
nan	0.0000	0
0.5000	0.0068	2
0.6667	0.0102	3
0.3333	0.0205	6
0.0000	0.0068	2

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 6.00e+00	0.7027	0.0623	37
6.00e+00 < x <= 5.80e+01	0.3738	0.9007	535
5.80e+01 < x	0.1364	0.0370	22

X_dev distribution
target_mean	frequency	count
0.6000	0.0512	15
0.3745	0.9113	267
0.3636	0.0375	11

--- [BinaryCarver] Fit Numerical('Fare') (4/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 0.0000e+00	0.0000	0.0135	8
0.0000e+00 < x <= 6.8583e+00	0.0000	0.0135	8
6.8583e+00 < x <= 7.0500e+00	0.1111	0.0152	9
7.0500e+00 < x <= 7.2250e+00	0.2143	0.0236	14
7.2250e+00 < x <= 7.2292e+00	0.2727	0.0185	11
7.2292e+00 < x <= 7.2500e+00	0.0909	0.0185	11
7.2500e+00 < x <= 7.6500e+00	0.2222	0.0152	9
7.6500e+00 < x <= 7.7500e+00	0.3871	0.0522	31
7.7500e+00 < x <= 7.7750e+00	0.3333	0.0152	9
7.7750e+00 < x <= 7.8292e+00	0.4000	0.0084	5
7.8292e+00 < x <= 7.8542e+00	0.3000	0.0168	10
7.8542e+00 < x <= 7.8875e+00	0.8000	0.0084	5
7.8875e+00 < x <= 7.8958e+00	0.0000	0.0387	23
7.8958e+00 < x <= 8.0292e+00	0.5000	0.0269	16
8.0292e+00 < x <= 8.0500e+00	0.0968	0.0522	31
8.0500e+00 < x <= 8.4583e+00	0.0000	0.0051	3
8.4583e+00 < x <= 8.6625e+00	0.1111	0.0152	9
8.6625e+00 < x <= 9.3500e+00	0.2857	0.0118	7
9.3500e+00 < x <= 9.5000e+00	0.3333	0.0101	6
9.5000e+00 < x <= 1.0500e+01	0.3684	0.0320	19
1.0500e+01 < x <= 1.2287e+01	0.6250	0.0135	8
1.2287e+01 < x <= 1.3000e+01	0.4839	0.0522	31
1.3000e+01 < x <= 1.4000e+01	0.5000	0.0135	8
1.4000e+01 < x <= 1.4458e+01	0.1111	0.0152	9
1.4458e+01 < x <= 1.5100e+01	0.3333	0.0101	6
1.5100e+01 < x <= 1.5550e+01	0.1250	0.0135	8
1.5550e+01 < x <= 1.6100e+01	0.5556	0.0152	9
1.6100e+01 < x <= 1.8750e+01	0.6250	0.0135	8
1.8750e+01 < x <= 1.9500e+01	1.0000	0.0084	5
1.9500e+01 < x <= 2.1000e+01	0.1111	0.0152	9
2.1000e+01 < x <= 2.3000e+01	0.6250	0.0135	8
2.3000e+01 < x <= 2.4150e+01	0.3750	0.0135	8
2.4150e+01 < x <= 2.6000e+01	0.3214	0.0471	28
2.6000e+01 < x <= 2.6387e+01	0.8000	0.0168	10
2.6387e+01 < x <= 2.6550e+01	0.3333	0.0152	9
2.6550e+01 < x <= 2.7900e+01	0.2500	0.0202	12
2.7900e+01 < x <= 2.9000e+01	0.6667	0.0051	3
2.9000e+01 < x <= 3.0000e+01	0.4000	0.0168	10
3.0000e+01 < x <= 3.0696e+01	0.3333	0.0101	6
3.0696e+01 < x <= 3.1387e+01	0.3333	0.0152	9
3.1387e+01 < x <= 3.3500e+01	0.4000	0.0084	5
3.3500e+01 < x <= 3.7004e+01	0.2500	0.0135	8
3.7004e+01 < x <= 3.9600e+01	0.6250	0.0135	8
3.9600e+01 < x <= 4.1579e+01	0.2857	0.0118	7
4.1579e+01 < x <= 4.6900e+01	0.0000	0.0118	7
4.6900e+01 < x <= 5.1862e+01	0.2857	0.0118	7
5.1862e+01 < x <= 5.2554e+01	0.7143	0.0118	7
5.2554e+01 < x <= 5.6496e+01	0.7273	0.0185	11
5.6496e+01 < x <= 5.7979e+01	1.0000	0.0067	4
5.7979e+01 < x <= 6.9550e+01	0.3636	0.0185	11
6.9550e+01 < x <= 7.3500e+01	0.1667	0.0101	6
7.3500e+01 < x <= 7.7287e+01	0.6667	0.0101	6
7.7287e+01 < x <= 7.9650e+01	0.7500	0.0135	8
7.9650e+01 < x <= 8.3158e+01	0.8571	0.0118	7
8.3158e+01 < x <= 9.0000e+01	0.8571	0.0118	7
9.0000e+01 < x <= 1.1088e+02	0.7143	0.0118	7
1.1088e+02 < x <= 1.3365e+02	0.8571	0.0118	7
1.3365e+02 < x <= 1.5155e+02	0.8889	0.0152	9
1.5155e+02 < x <= 2.1134e+02	0.8571	0.0118	7
2.1134e+02 < x	0.5714	0.0118	7

X_dev distribution
target_mean	frequency	count
0.1429	0.0239	7
0.0000	0.0068	2
0.0000	0.0068	2
0.2000	0.0171	5
0.2500	0.0137	4
0.0000	0.0068	2
0.2000	0.0171	5
0.2727	0.0375	11
0.0000	0.0239	7
0.4000	0.0171	5
0.0000	0.0102	3
0.0000	0.0034	1
0.0769	0.0444	13
0.3333	0.0102	3
0.1667	0.0410	12
0.2000	0.0171	5
0.1667	0.0205	6
0.0000	0.0102	3
0.0000	0.0171	5
0.2667	0.0512	15
0.4000	0.0171	5
0.3810	0.0717	21
1.0000	0.0034	1
0.0000	0.0137	4
0.0000	0.0171	5
1.0000	0.0171	5
0.5000	0.0341	10
0.3333	0.0102	3
0.6667	0.0102	3
0.6250	0.0273	8
0.4000	0.0171	5
0.1667	0.0205	6
0.7273	0.0375	11
1.0000	0.0034	1
0.8333	0.0205	6
0.2000	0.0171	5
0.0000	0.0034	1
0.5000	0.0137	4
1.0000	0.0102	3
0.4000	0.0171	5
1.0000	0.0034	1
0.4286	0.0239	7
nan	0.0000	0
0.0000	0.0102	3
nan	0.0000	0
1.0000	0.0068	2
0.3333	0.0102	3
0.6667	0.0205	6
1.0000	0.0068	2
0.4286	0.0239	7
0.5000	0.0068	2
1.0000	0.0034	1
0.6667	0.0205	6
1.0000	0.0034	1
0.7500	0.0137	4
0.8000	0.0171	5
1.0000	0.0068	2
0.0000	0.0068	2
1.0000	0.0034	1
0.7000	0.0341	10

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 5.2e+01	0.3184	0.8249	490
5.2e+01 < x	0.7019	0.1751	104

X_dev distribution
target_mean	frequency	count
0.3278	0.8225	241
0.6538	0.1775	52

--- [BinaryCarver] Fit Numerical('Siblings/Spouses Aboard') (5/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 0.00e+00	0.3614	0.6801	404
0.00e+00 < x <= 1.00e+00	0.5000	0.2323	138
1.00e+00 < x <= 2.00e+00	0.5500	0.0337	20
2.00e+00 < x <= 3.00e+00	0.1111	0.0152	9
3.00e+00 < x <= 4.00e+00	0.1667	0.0202	12
4.00e+00 < x	0.0000	0.0185	11

X_dev distribution
target_mean	frequency	count
0.3200	0.6826	200
0.6056	0.2423	71
0.2500	0.0273	8
0.4286	0.0239	7
0.1667	0.0205	6
0.0000	0.0034	1

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 0.00e+00	0.3614	0.6801	404
0.00e+00 < x <= 2.00e+00	0.5063	0.2660	158
2.00e+00 < x	0.0938	0.0539	32

X_dev distribution
target_mean	frequency	count
0.3200	0.6826	200
0.5696	0.2696	79
0.2857	0.0478	14

--- [BinaryCarver] Fit Numerical('Parents/Children Aboard') (6/6)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 0.00e+00	0.3447	0.7374	438
0.00e+00 < x <= 1.00e+00	0.5057	0.1465	87
1.00e+00 < x <= 2.00e+00	0.5167	0.1010	60
2.00e+00 < x	0.3333	0.0152	9

X_dev distribution
target_mean	frequency	count
0.3475	0.8055	236
0.6774	0.1058	31
0.4500	0.0683	20
0.1667	0.0205	6

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 0.0e+00	0.3447	0.7374	438
0.0e+00 < x	0.5000	0.2626	156

X_dev distribution
target_mean	frequency	count
0.3475	0.8055	236
0.5439	0.1945	57

AutoCarver analysis

Carving Summary

[43]:

auto_carver.summary

[43]:

					content	target_mean	frequency	dropped	dropped_reason
feature	count	tschuprowt	n_mod	label
Categorical('Sex')	378.0	0.533719	2	0	male	0.187831	0.636364	False	None
Categorical('Sex')	216.0	0.533719	2	1	female	0.731481	0.363636	False	None
Ordinal('Pclass')	268.0	0.300144	2	0	[2, 1]	0.548507	0.451178	False	None
Ordinal('Pclass')	326.0	0.300144	2	1	3	0.251534	0.548822	False	None
Numerical('Age')	37.0	0.161045	3	0	x <= 6.00e+00	0.702703	0.062290	False	None
	535.0	0.161045	3	1	6.00e+00 < x <= 5.80e+01	0.373832	0.900673	False	None
	22.0	0.161045	3	2	5.80e+01 < x	0.136364	0.037037	False	None
Numerical('Fare')	490.0	0.294937	2	0	x <= 5.2e+01	0.318367	0.824916	False	None
Numerical('Fare')	104.0	0.294937	2	1	5.2e+01 < x	0.701923	0.175084	False	None
Numerical('Siblings/Spouses Aboard')	404.0	0.162663	3	0	x <= 0.00e+00	0.361386	0.680135	False	None
	158.0	0.162663	3	1	0.00e+00 < x <= 2.00e+00	0.506329	0.265993	False	None
	32.0	0.162663	3	2	2.00e+00 < x	0.093750	0.053872	False	None
Numerical('Parents/Children Aboard')	438.0	0.136439	2	0	x <= 0.0e+00	0.344749	0.737374	False	None
Numerical('Parents/Children Aboard')	156.0	0.136439	2	1	0.0e+00 < x	0.500000	0.262626	False	None

For quantitative feature Age, the selected combination of modalities groups ages as follows:
- modality 0: lower or equal to 8 years old (content="x <= 8.0+00")
- modality 1: ages higher than 8 years old (content="8.0+00 < x ")
For qualitative categorical feature Sex, the selected combination of modalities has left modalities content="male" in modality 0 and content="female" in modality 1 (no combination possible)
For qualitative ordinal feature Pclass, the selected combination of modalities socio-economic status as follows:
- modality 0: upper and middle classes (content=[2, 1])
- modality 1: lower class (content=3).
- The user-provided ordering of modalities has been preserved.

Detailed overview of tested combinations

[44]:

features["Pclass"].history

[44]:

	info	cramerv	tschuprowt	combination	n_mod	dropna	train	viable	dev
0	Raw distribution	0.321044	0.269965	{'1': '1', '2': '2', '3': '3'}	3	False	NaN	NaN	NaN
1	Best for tschuprowt and max_n_mod=5	0.300144	0.300144	{'1': '1', '2': '1', '3': '3'}	2	False	{'viable': True, 'info': ''}	True	{'viable': True, 'info': ''}

The most associated combination (the first tested out, where info!="Raw distribution") groups Pclass==1 with Pclass==2 and leaves Pclass==3 as its own modality
For feature Pclass, the 1st combination passes the tests:
- viable=True
- info="Best for tschuprowt and max_n_mod=5"
- Tschuprow’s T with Survived is 0.300144 for this combination (by default, combinations are ranked according to this statistic)
- Following combinations (less associated with the target) where not tested: info="Not checked"
For all combinations dropna=False means that it is not a combination in which nans are being grouped with other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[45]:

auto_carver.save("binary_carver.json")

Loading

Carvers can safely be loaded from a .json file.

[46]:

auto_carver = BinaryCarver.load("binary_carver.json")

Applying AutoCarver

[47]:

dev_set_processed = auto_carver.transform(dev_set)

[48]:

dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

[48]:

	Sex	Pclass	Age	Fare	Siblings/Spouses Aboard	Parents/Children Aboard
0.0	0.665529	0.450512	0.051195	0.822526	0.682594	0.805461
1.0	0.334471	0.549488	0.911263	0.177474	0.269625	0.194539
2.0	NaN	NaN	0.037543	NaN	0.047782	NaN

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using BinaryCarver, hence all features are qualitative.

Number of features to select

The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[49]:

n_best_per_type = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Using Selectors

[50]:

from AutoCarver import ClassificationSelector
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set_processed, train_set_processed[target]).selected_features
best_features

 [ClassificationSelector] Selected Qualitative Features

	feature	Mode	TschuprowtMeasure	TschuprowtRank	TschuprowtFilter	TschuprowtWith
0	Categorical('Sex')	0.6364	0.5373	0.0000	0.0000	itself
1	Ordinal('Pclass')	0.5488	0.3036	1.0000	0.0988	Sex
3	Numerical('Fare')	0.8249	0.2995	2.0000	0.4057	Pclass
4	Numerical('Siblings/Spouses Aboard')	0.6801	0.1627	3.0000	0.2383	Pclass
2	Numerical('Age')	0.9007	0.1610	4.0000	0.2576	Siblings/Spouses Aboard
5	Numerical('Parents/Children Aboard')	0.7374	0.1404	5.0000	0.4257	Siblings/Spouses Aboard

[50]:

Features(['Sex', 'Pclass', 'Fare', 'Siblings/Spouses Aboard'])

[51]:

train_set_processed[best_features].head()

[51]:

	Sex	Pclass	Siblings/Spouses Aboard
617	0	1	1
489	0	0	0
871	1	1	0
654	1	1	1
653	0	1	0

Feature Sex is the most associated with the target Survived:
- Tschuprow’s T value is TschuprowtMeasure=0.5337
- It has 0 % of NaNs (NaNMeasure=0.0)
- Its mode represents 64 % of observed data (ModeMeasure=0.6364)
Feature Fare is strongly associated to feature Pclass:
- Tschuprow’s T value is TschuprowtFilter=0.3922 with TschuprowtWith=Pclass
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

Modeling

Fitting model on train data

[52]:

train_set_processed[best_features].apply(lambda u: u.value_counts())

[52]:

	Sex	Pclass	Fare	Siblings/Spouses Aboard
0.0	378.0	268.0	490.0	404
1.0	216.0	326.0	104.0	158
2.0	NaN	NaN	NaN	32

[53]:

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(train_set_processed[best_features], train_set_processed[target])

[53]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Saving model

[54]:

model.save_model("binary_xgboost.json")

Prediction on dev dataset and performance

[55]:

from sklearn.metrics import roc_auc_score

dev_pred = model.predict_proba(dev_set_processed[best_features])[:, 1]
roc_auc_score(dev_set_processed[target], dev_pred)

[55]:

0.8343903638151425

What’s next?

Thanks to Carvers all of your features are now optimally processed for your classification task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in binary classification tasks shines through in your meticulous use of AutoCarver’s BinaryCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The BinaryCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in binary classification tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.