Setting things up
About this notebook
In this notebook, we embark on a journey to enhance the predictive power of the Titanic Dataset through sophisticated preprocessing using the BinaryCarver pipeline. Designed to maximize associations in the data, BinaryCarver is a robust Python tool capable of discretizing any type of data—whether it be quantitative or qualitative. Our specific focus is on preparing the dataset for binary classification tasks, such as predicting survival outcomes.
The Titanic Dataset, derived from the iconic 1912 Titanic passenger information, provides a diverse set of features ranging from socio-economic status and age to cabin location. Leveraging BinaryCarver, we aim to perform association-maximizing discretization, refining both quantitative and qualitative features to create a finely tuned dataset for our binary classification endeavors.
Throughout this notebook, we’ll delve into the intricacies of BinaryCarver’s discretization pipeline, exploring its capabilities in handling a variety of data types. Whether it’s transforming passenger ages or classifying fares, BinaryCarver’s adaptability ensures that every feature is optimally represented for our classification tasks.
Join us in this exploration as we harness the power of BinaryCarver to preprocess the Titanic Dataset. Through effective feature engineering and discretization, we strive to create a dataset that not only captures the nuances of the Titanic passenger profiles but also sets the stage for the development of accurate and impactful binary classification models.
Let’s dive in and uncover the potential of BinaryCarver in transforming the Titanic Dataset for optimal predictive modeling.
Installation
[1]:
# %pip install AutoCarver[jupyter]
Titanic Data
In this example notebook, we will use the Titanic dataset.
The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.
The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).
[2]:
import pandas as pd
# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)
# Display the first few rows of the dataset
titanic_data.head()
[2]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
Target type and Carver selection
[3]:
target = "Survived"
titanic_data[target].value_counts(dropna=False)
[3]:
Survived
0 545
1 342
Name: count, dtype: int64
The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.BinaryCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.
Data Sampling
[4]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[4]:
(np.float64(0.38552188552188554), np.float64(0.3856655290102389))
Setting up Features to Carver
[5]:
train_set.head()
[5]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 617 | 0 | 3 | Mr. Antoni Yasbeck | male | 27.0 | 1 | 0 | 14.4542 |
| 489 | 0 | 1 | Mr. Harry Markland Molson | male | 55.0 | 0 | 0 | 30.5000 |
| 871 | 1 | 3 | Miss. Adele Kiamie Najib | female | 15.0 | 0 | 0 | 7.2250 |
| 654 | 0 | 3 | Mrs. John (Catherine) Bourke | female | 32.0 | 1 | 1 | 15.5000 |
| 653 | 0 | 3 | Mr. Alexander Radeff | male | 27.0 | 0 | 0 | 7.8958 |
[6]:
# column data types
train_set.dtypes
[6]:
Survived int64
Pclass int64
Name object
Sex object
Age float64
Siblings/Spouses Aboard int64
Parents/Children Aboard int64
Fare float64
dtype: object
[7]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()
[7]:
Parents/Children Aboard
0 438
1 87
2 60
3 3
5 3
4 2
6 1
Name: count, dtype: int64
[8]:
# values taken by Pclass
train_set["Pclass"].value_counts()
[8]:
Pclass
3 326
1 142
2 126
Name: count, dtype: int64
The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).
"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.
"Fare" is the only quantitative continuous features, whilst "Age", "Siblings/Spouses Aboard" and "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.
[9]:
from AutoCarver import Features
# initiating Features to carve
features = Features(
categoricals=["Sex"],
quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
ordinals={"Pclass": ["1", "2", "3"]}, # user-specified ordering for ordinal features
)
features["Pclass"], features["Sex"], features["Age"]
[9]:
(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))
Using AutoCarver
AutoCarver settings
Representativness of modalities
The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used:
For quantitative features, to define the number of quantiles to initialy discretize the features with.
For qualitative features, to define the threshold under which a modality is grouped to either a default value or its closest modality.
[10]:
min_freq = 0.05
Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)
Optional: Desired number of modalities
The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.
[11]:
max_n_mod = 5
Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)
Optional: Grouping NaNs
The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-nan values, and then test out all possible combinations with nan.
[12]:
dropna = False # anyway, there are no nan in this dataset
Fitting AutoCarver
First, all qualitative features are discretized:
Using
StringDiscretizerto convert them tostrif not already the caseFor qualitative ordinal features: using
OrdinalDiscretizerfor under-represented values (less frequent thanmin_freq) to be grouped with its closest modalityFor qualitative categorical features: using
CategoricalDiscretizerfor under-represented values (less frequent thanmin_freq) to be grouped with a default value (features.default="__OTHER__")
Second, all quantitative features are discretized:
Using
ContinuousDiscretizerfor quantile discretization that keeps track of over-represented values (more frequent thanmin_freq)Using
OrdinalDiscretizerfor any remaining under-represented values (less frequent thanmin_freq/2) to be grouped with its closest modality
Third, all features are carved following this recipe, for all classes of
train_set[target](except one):The raw distribution is printed out on provided
train_setanddev_set. It’s the output of the discretization stepGrouping modalities: all consecutive combinations of modalities are applied to
train_setComputing associations: the association metric (Tschruprow’s T, by default) is computed with the provided
train_set[target]Combinations are sorted in descending order by association value
Testing robustness: finds the first combination that checks the following:
Representativness of modalities on
train_setanddev_set(all should be more frequent thanmin_freq/2)Distinct target rates per consecutive modalities on
train_setanddev_setNo inversion of target rates between
train_setanddev_set(same ordering of modalities by target rate)
(Optional) If requested via
dropna=True, and if any, all combinations of modalities withnanare applied totrain_setand steps 3. and 4. are runThe carved distribution is printed out on provided
train_setanddev_set. It’s the output of the carving step
[13]:
from AutoCarver import BinaryCarver
# intiating AutoCarver
auto_carver = BinaryCarver(
features=features,
min_freq=min_freq,
dropna=dropna,
verbose=True, # showing statistics
copy=True, # whether or not to return a copy of the input dataset
)
# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
--- [QuantitativeDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
- [ContinuousDiscretizer] Fit Features(['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
- [OrdinalDiscretizer] Fit Features(['Age', 'Fare', 'Parents/Children Aboard'])
------
------
--- [QualitativeDiscretizer] Fit Features(['Sex', 'Pclass'])
- [StringDiscretizer] Fit Features(['Pclass'])
- [OrdinalDiscretizer] Fit Features(['Pclass'])
- [CategoricalDiscretizer] Fit Features(['Sex'])
------
---------
------ [BinaryCarver] Fit Features(['Sex', 'Pclass', 'Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard'])
--- [BinaryCarver] Fit Categorical('Sex') (1/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| male | 0.1878 | 0.6364 |
| female | 0.7315 | 0.3636 |
| target_mean | frequency |
|---|---|
| 0.1949 | 0.6655 |
| 0.7653 | 0.3345 |
Grouping modalities : 0%| | 0/1 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 1/1 [00:00<?, ?it/s]
Testing robustness : 0%| | 0/1 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| male | 0.1878 | 0.6364 |
| female | 0.7315 | 0.3636 |
| target_mean | frequency |
|---|---|
| 0.1949 | 0.6655 |
| 0.7653 | 0.3345 |
--- [BinaryCarver] Fit Ordinal('Pclass') (2/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| 1 | 0.6197 | 0.2391 |
| 2 | 0.4683 | 0.2121 |
| 3 | 0.2515 | 0.5488 |
| target_mean | frequency |
|---|---|
| 0.6486 | 0.2526 |
| 0.4828 | 0.1980 |
| 0.2298 | 0.5495 |
Grouping modalities : 67%|██████▋ | 2/3 [00:00<00:00, 1988.76it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 2981.03it/s]
Testing robustness : 0%| | 0/3 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| 1 to 2 | 0.5485 | 0.4512 |
| 3 | 0.2515 | 0.5488 |
| target_mean | frequency |
|---|---|
| 0.5758 | 0.4505 |
| 0.2298 | 0.5495 |
--- [BinaryCarver] Fit Quantitative('Age') (3/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| x <= 2.00e+00 | 0.7500 | 0.0269 |
| 2.00e+00 < x <= 4.00e+00 | 0.7143 | 0.0236 |
| 4.00e+00 < x <= 8.00e+00 | 0.4286 | 0.0236 |
| 8.00e+00 < x <= 1.40e+01 | 0.2000 | 0.0253 |
| 1.40e+01 < x <= 1.60e+01 | 0.5000 | 0.0303 |
| 1.60e+01 < x <= 1.80e+01 | 0.3226 | 0.0522 |
| 1.80e+01 < x <= 1.90e+01 | 0.3913 | 0.0387 |
| 1.90e+01 < x <= 2.05e+01 | 0.1111 | 0.0303 |
| 2.05e+01 < x <= 2.10e+01 | 0.1905 | 0.0354 |
| 2.10e+01 < x <= 2.20e+01 | 0.4242 | 0.0556 |
| 2.20e+01 < x <= 2.35e+01 | 0.4000 | 0.0168 |
| 2.35e+01 < x <= 2.40e+01 | 0.5417 | 0.0404 |
| 2.40e+01 < x <= 2.50e+01 | 0.1333 | 0.0253 |
| 2.50e+01 < x <= 2.70e+01 | 0.4667 | 0.0505 |
| 2.70e+01 < x <= 2.85e+01 | 0.2500 | 0.0337 |
| 2.85e+01 < x <= 2.90e+01 | 0.4444 | 0.0303 |
| 2.90e+01 < x <= 3.00e+01 | 0.2917 | 0.0404 |
| 3.00e+01 < x <= 3.10e+01 | 0.3846 | 0.0219 |
| 3.10e+01 < x <= 3.20e+01 | 0.5000 | 0.0269 |
| 3.20e+01 < x <= 3.30e+01 | 0.3846 | 0.0219 |
| 3.30e+01 < x <= 3.40e+01 | 0.3077 | 0.0219 |
| 3.40e+01 < x <= 3.60e+01 | 0.4643 | 0.0471 |
| 3.60e+01 < x <= 3.80e+01 | 0.4118 | 0.0286 |
| 3.80e+01 < x <= 4.10e+01 | 0.3871 | 0.0522 |
| 4.10e+01 < x <= 4.20e+01 | 0.4615 | 0.0219 |
| 4.20e+01 < x <= 4.50e+01 | 0.3913 | 0.0387 |
| 4.50e+01 < x <= 4.70e+01 | 0.2500 | 0.0202 |
| 4.70e+01 < x <= 4.90e+01 | 0.7143 | 0.0236 |
| 4.90e+01 < x <= 5.10e+01 | 0.2727 | 0.0185 |
| 5.10e+01 < x <= 5.60e+01 | 0.3889 | 0.0303 |
| 5.60e+01 < x <= 6.10e+01 | 0.1538 | 0.0219 |
| 6.10e+01 < x | 0.2000 | 0.0253 |
| target_mean | frequency |
|---|---|
| 0.4444 | 0.0307 |
| 0.7500 | 0.0137 |
| 0.6667 | 0.0205 |
| 0.6000 | 0.0341 |
| 0.2500 | 0.0273 |
| 0.4286 | 0.0717 |
| 0.2000 | 0.0341 |
| 0.3333 | 0.0205 |
| 0.1538 | 0.0444 |
| 0.1667 | 0.0205 |
| 0.1875 | 0.0546 |
| 0.5000 | 0.0341 |
| 0.5000 | 0.0341 |
| 0.3529 | 0.0580 |
| 0.2632 | 0.0648 |
| 0.4286 | 0.0239 |
| 0.3333 | 0.0307 |
| 0.6250 | 0.0273 |
| 0.4000 | 0.0171 |
| 0.6667 | 0.0205 |
| 0.7500 | 0.0137 |
| 0.5882 | 0.0580 |
| 0.2500 | 0.0273 |
| 0.1875 | 0.0546 |
| 0.5000 | 0.0137 |
| 0.1429 | 0.0239 |
| 0.1667 | 0.0205 |
| 0.5000 | 0.0205 |
| 0.6667 | 0.0205 |
| 0.5000 | 0.0205 |
| 0.6000 | 0.0171 |
| 0.2500 | 0.0273 |
Grouping modalities : 100%|█████████▉| 36455/36456 [00:04<00:00, 7444.66it/s]
Computing associations: 100%|██████████| 36456/36456 [00:10<00:00, 3595.62it/s]
Testing robustness : 1%| | 302/36456 [00:00<01:49, 328.89it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| x <= 8.0e+00 | 0.6364 | 0.0741 |
| 8.0e+00 < x | 0.3655 | 0.9259 |
| target_mean | frequency |
|---|---|
| 0.5789 | 0.0648 |
| 0.3723 | 0.9352 |
--- [BinaryCarver] Fit Quantitative('Fare') (4/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| x <= 6.858e+00 | 0.0000 | 0.0269 |
| 6.858e+00 < x <= 7.142e+00 | 0.1333 | 0.0253 |
| 7.142e+00 < x <= 7.229e+00 | 0.2632 | 0.0320 |
| 7.229e+00 < x <= 7.250e+00 | 0.0909 | 0.0185 |
| 7.250e+00 < x <= 7.750e+00 | 0.3500 | 0.0673 |
| 7.750e+00 < x <= 7.854e+00 | 0.3333 | 0.0404 |
| 7.854e+00 < x <= 7.896e+00 | 0.1429 | 0.0471 |
| 7.896e+00 < x <= 8.029e+00 | 0.5000 | 0.0269 |
| 8.029e+00 < x <= 8.050e+00 | 0.0968 | 0.0522 |
| 8.050e+00 < x <= 9.000e+00 | 0.1250 | 0.0269 |
| 9.000e+00 < x <= 9.842e+00 | 0.3571 | 0.0236 |
| 9.842e+00 < x <= 1.050e+01 | 0.3571 | 0.0236 |
| 1.050e+01 < x <= 1.300e+01 | 0.5128 | 0.0657 |
| 1.300e+01 < x <= 1.445e+01 | 0.3333 | 0.0253 |
| 1.445e+01 < x <= 1.550e+01 | 0.2000 | 0.0253 |
| 1.550e+01 < x <= 1.670e+01 | 0.5833 | 0.0202 |
| 1.670e+01 < x <= 2.025e+01 | 0.5714 | 0.0236 |
| 2.025e+01 < x <= 2.300e+01 | 0.4286 | 0.0236 |
| 2.300e+01 < x <= 2.600e+01 | 0.3333 | 0.0606 |
| 2.600e+01 < x <= 2.655e+01 | 0.5789 | 0.0320 |
| 2.655e+01 < x <= 2.790e+01 | 0.2500 | 0.0202 |
| 2.790e+01 < x <= 3.000e+01 | 0.4615 | 0.0219 |
| 3.000e+01 < x <= 3.139e+01 | 0.3333 | 0.0253 |
| 3.139e+01 < x <= 3.850e+01 | 0.2857 | 0.0236 |
| 3.850e+01 < x <= 4.240e+01 | 0.4667 | 0.0253 |
| 4.240e+01 < x <= 5.200e+01 | 0.2353 | 0.0286 |
| 5.200e+01 < x <= 5.650e+01 | 0.7857 | 0.0236 |
| 5.650e+01 < x <= 6.955e+01 | 0.5333 | 0.0253 |
| 6.955e+01 < x <= 7.729e+01 | 0.4167 | 0.0202 |
| 7.729e+01 < x <= 8.316e+01 | 0.8000 | 0.0253 |
| 8.316e+01 < x <= 1.109e+02 | 0.7857 | 0.0236 |
| 1.109e+02 < x <= 1.516e+02 | 0.8750 | 0.0269 |
| 1.516e+02 < x | 0.7143 | 0.0236 |
| target_mean | frequency |
|---|---|
| 0.1111 | 0.0307 |
| 0.0000 | 0.0102 |
| 0.2500 | 0.0273 |
| 0.0000 | 0.0068 |
| 0.2500 | 0.0546 |
| 0.1333 | 0.0512 |
| 0.0714 | 0.0478 |
| 0.3333 | 0.0102 |
| 0.1667 | 0.0410 |
| 0.1667 | 0.0410 |
| 0.0000 | 0.0273 |
| 0.2857 | 0.0478 |
| 0.3846 | 0.0887 |
| 0.2500 | 0.0137 |
| 0.4545 | 0.0375 |
| 0.5000 | 0.0341 |
| 0.4444 | 0.0307 |
| 0.6000 | 0.0341 |
| 0.5294 | 0.0580 |
| 0.8571 | 0.0239 |
| 0.2000 | 0.0171 |
| 0.4000 | 0.0171 |
| 0.6250 | 0.0273 |
| 0.5000 | 0.0273 |
| 0.0000 | 0.0102 |
| 0.6000 | 0.0171 |
| 0.6667 | 0.0205 |
| 0.5556 | 0.0307 |
| 0.6667 | 0.0102 |
| 0.7143 | 0.0239 |
| 0.7778 | 0.0307 |
| 0.5000 | 0.0137 |
| 0.7273 | 0.0375 |
Grouping modalities : 100%|█████████▉| 41447/41448 [00:06<00:00, 6897.10it/s]
Computing associations: 100%|██████████| 41448/41448 [00:09<00:00, 4172.63it/s]
Testing robustness : 0%| | 0/41448 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| x <= 5.2e+01 | 0.3198 | 0.8316 |
| 5.2e+01 < x | 0.7100 | 0.1684 |
| target_mean | frequency |
|---|---|
| 0.3279 | 0.8328 |
| 0.6735 | 0.1672 |
--- [BinaryCarver] Fit Quantitative('Siblings/Spouses Aboard') (5/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| x <= 0.00e+00 | 0.3614 | 0.6801 |
| 0.00e+00 < x <= 1.00e+00 | 0.5000 | 0.2323 |
| 1.00e+00 < x <= 2.00e+00 | 0.5500 | 0.0337 |
| 2.00e+00 < x <= 4.00e+00 | 0.1429 | 0.0354 |
| 4.00e+00 < x | 0.0000 | 0.0185 |
| target_mean | frequency |
|---|---|
| 0.3200 | 0.6826 |
| 0.6056 | 0.2423 |
| 0.2500 | 0.0273 |
| 0.3077 | 0.0444 |
| 0.0000 | 0.0034 |
Grouping modalities : 93%|█████████▎| 14/15 [00:00<00:00, 4605.87it/s]
Computing associations: 100%|██████████| 15/15 [00:00<00:00, 1820.44it/s]
Testing robustness : 67%|██████▋ | 10/15 [00:00<00:00, 322.91it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| x <= 0.00e+00 | 0.3614 | 0.6801 |
| 0.00e+00 < x <= 1.00e+00 | 0.5000 | 0.2323 |
| 1.00e+00 < x | 0.2692 | 0.0875 |
| target_mean | frequency |
|---|---|
| 0.3200 | 0.6826 |
| 0.6056 | 0.2423 |
| 0.2727 | 0.0751 |
--- [BinaryCarver] Fit Quantitative('Parents/Children Aboard') (6/6)
[BinaryCarver] Raw distribution
| target_mean | frequency | |
|---|---|---|
| x <= 0.00e+00 | 0.3447 | 0.7374 |
| 0.00e+00 < x <= 1.00e+00 | 0.5057 | 0.1465 |
| 1.00e+00 < x <= 2.00e+00 | 0.5167 | 0.1010 |
| 2.00e+00 < x | 0.3333 | 0.0152 |
| target_mean | frequency |
|---|---|
| 0.3475 | 0.8055 |
| 0.6774 | 0.1058 |
| 0.4500 | 0.0683 |
| 0.1667 | 0.0205 |
Grouping modalities : 86%|████████▌ | 6/7 [00:00<00:00, 604.67it/s]
Computing associations: 100%|██████████| 7/7 [00:00<00:00, 932.45it/s]
Testing robustness : 0%| | 0/7 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_mean | frequency | |
|---|---|---|
| x <= 0.0e+00 | 0.3447 | 0.7374 |
| 0.0e+00 < x | 0.5000 | 0.2626 |
| target_mean | frequency |
|---|---|
| 0.3475 | 0.8055 |
| 0.5439 | 0.1945 |
C:\Users\defra\Desktop\git\PROJECTS\AutoCarver\AutoCarver\discretizers\utils\base_discretizer.py:433: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
sample.X.replace(
AutoCarver analysis
Carving Summary
[14]:
auto_carver.summary
[14]:
| content | frequency | ||||||
|---|---|---|---|---|---|---|---|
| feature | target_mean | cramerv | tschuprowt | n_mod | label | ||
| Categorical('Sex') | 0.187831 | 0.533719 | 0.533719 | 2 | 0 | male | 0.636364 |
| 0.731481 | 0.533719 | 0.533719 | 2 | 1 | female | 0.363636 | |
| Ordinal('Pclass') | 0.548507 | 0.300144 | 0.300144 | 2 | 0 | [2, 1] | 0.451178 |
| 0.251534 | 0.300144 | 0.300144 | 2 | 1 | 3 | 0.548822 | |
| Quantitative('Age') | 0.636364 | 0.139166 | 0.139166 | 2 | 0 | x <= 8.0e+00 | 0.074074 |
| 0.365455 | 0.139166 | 0.139166 | 2 | 1 | 8.0e+00 < x | 0.925926 | |
| Quantitative('Fare') | 0.319838 | 0.295325 | 0.295325 | 2 | 0 | x <= 5.2e+01 | 0.831650 |
| 0.710000 | 0.295325 | 0.295325 | 2 | 1 | 5.2e+01 < x | 0.168350 | |
| Quantitative('Siblings/Spouses Aboard') | 0.361386 | 0.139722 | 0.117492 | 3 | 0 | x <= 0.00e+00 | 0.680135 |
| 0.500000 | 0.139722 | 0.117492 | 3 | 1 | 0.00e+00 < x <= 1.00e+00 | 0.232323 | |
| 0.269231 | 0.139722 | 0.117492 | 3 | 2 | 1.00e+00 < x | 0.087542 | |
| Quantitative('Parents/Children Aboard') | 0.344749 | 0.136439 | 0.136439 | 2 | 0 | x <= 0.0e+00 | 0.737374 |
| 0.500000 | 0.136439 | 0.136439 | 2 | 1 | 0.0e+00 < x | 0.262626 |
For quantitative feature
Age, the selected combination of modalities groups ages as follows:modality
0: lower or equal to 8 years old (content="x <= 8.0+00")modality
1: ages higher than 8 years old (content="8.0+00 < x ")
For qualitative categorical feature
Sex, the selected combination of modalities has left modalitiescontent="male"in modality0andcontent="female"in modality1(no combination possible)For qualitative ordinal feature
Pclass, the selected combination of modalities socio-economic status as follows:modality
0: upper and middle classes (content=[2, 1])modality
1: lower class (content=3).The user-provided ordering of modalities has been preserved.
Detailed overview of tested combinations
[15]:
features["Pclass"].history
[15]:
| info | cramerv | tschuprowt | combination | n_mod | dropna | train | viable | dev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Raw distribution | 0.321044 | 0.269965 | {'1': '1', '2': '2', '3': '3'} | 3 | False | NaN | NaN | NaN |
| 1 | Best for tschuprowt and max_n_mod=5 | 0.300144 | 0.300144 | {'1': '1', '2': '1', '3': '3'} | 2 | False | {'viable': True, 'info': ''} | True | {'viable': True, 'info': ''} |
| 2 | Not checked | 0.321044 | 0.269965 | {'1': '1', '2': '2', '3': '3'} | 3 | False | NaN | NaN | NaN |
| 3 | Not checked | 0.265643 | 0.265643 | {'1': '1', '2': '2', '3': '2'} | 2 | False | NaN | NaN | NaN |
The most associated combination (the first tested out, where
info!="Raw distribution") groupsPclass==1withPclass==2and leavesPclass==3as its own modalityFor feature
Pclass, the 1st combination passes the tests:viable=Trueinfo="Best for tschuprowt and max_n_mod=5"Tschuprow’s T with
Survivedis0.300144for this combination (by default, combinations are ranked according to this statistic)Following combinations (less associated with the target) where not tested:
info="Not checked"
For all combinations
dropna=Falsemeans that it is not a combination in whichnans are being grouped with other modalities (as requested withdropna=False)
Saving and Loading AutoCarver
Saving
All Carvers can safely be stored as a .json file.
[16]:
auto_carver.save("binary_carver.json")
Loading
Carvers can safely be loaded from a .json file.
[17]:
auto_carver = BinaryCarver.load("binary_carver.json")
Applying AutoCarver
[18]:
dev_set_processed = auto_carver.transform(dev_set)
C:\Users\defra\Desktop\git\PROJECTS\AutoCarver\AutoCarver\discretizers\utils\base_discretizer.py:433: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
sample.X.replace(
[19]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[19]:
| Sex | Pclass | Age | Fare | Siblings/Spouses Aboard | Parents/Children Aboard | |
|---|---|---|---|---|---|---|
| 0.0 | 0.665529 | 0.450512 | 0.064846 | 0.832765 | 0.682594 | 0.805461 |
| 1.0 | 0.334471 | 0.549488 | 0.935154 | 0.167235 | 0.242321 | 0.194539 |
| 2.0 | NaN | NaN | NaN | NaN | 0.075085 | NaN |
Feature Selection
Selectors settings
Features to select from
Here all features have been carved using BinaryCarver, hence all features are qualitative.
Number of features to select
The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).
[20]:
n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics
Using Selectors
[21]:
from AutoCarver import ClassificationSelector
# select the most target associated qualitative features
feature_selector = ClassificationSelector(
features=features,
n_best_per_type=n_best_per_type,
verbose=True, # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
best_features
[ClassificationSelector] Selected Qualitative Features
| feature | Nan | Mode | TschuprowtMeasure | TschuprowtRank | TschuprowtFilter | TschuprowtWith | |
|---|---|---|---|---|---|---|---|
| 0 | Categorical('Sex') | 0.0000 | 0.6364 | 0.5337 | 0.0000 | 0.0000 | itself |
| 1 | Ordinal('Pclass') | 0.0000 | 0.5488 | 0.3001 | 1.0000 | 0.0988 | Sex |
| 3 | Quantitative('Fare') | 0.0000 | 0.8316 | 0.2953 | 2.0000 | 0.3922 | Pclass |
| 2 | Quantitative('Age') | 0.0000 | 0.9259 | 0.1392 | 3.0000 | 0.1002 | Sex |
| 5 | Quantitative('Parents/Children Aboard') | 0.0000 | 0.7374 | 0.1364 | 4.0000 | 0.4666 | Age |
| 4 | Quantitative('Siblings/Spouses Aboard') | 0.0000 | 0.6801 | 0.1175 | 5.0000 | 0.4060 | Parents/Children Aboard |
[21]:
Features(['Sex', 'Pclass', 'Fare', 'Age'])
[22]:
train_set_processed[best_features].head()
[22]:
| Sex | Pclass | Fare | Age | |
|---|---|---|---|---|
| 617 | 0 | 1 | 0.0 | 1.0 |
| 489 | 0 | 0 | 0.0 | 1.0 |
| 871 | 1 | 1 | 0.0 | 1.0 |
| 654 | 1 | 1 | 0.0 | 1.0 |
| 653 | 0 | 1 | 0.0 | 1.0 |
Feature
Sexis the most associated with the targetSurvived:Tschuprow’s T value is
TschuprowtMeasure=0.5337It has 0 % of NaNs (
NaNMeasure=0.0)Its mode represents 64 % of observed data (
ModeMeasure=0.6364)
Feature
Fareis strongly associated to featurePclass:Tschuprow’s T value is
TschuprowtFilter=0.3922withTschuprowtWith=Pclass
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)
Modeling
Fitting model on train data
[23]:
from xgboost import XGBClassifier
model = XGBClassifier()
model.fit(train_set_processed[best_features], train_set_processed[target])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\IPython\core\formatters.py:974, in MimeBundleFormatter.__call__(self, obj, include, exclude)
971 method = get_real_method(obj, self.print_method)
973 if method is not None:
--> 974 return method(include=include, exclude=exclude)
975 return None
976 else:
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:469, in BaseEstimator._repr_mimebundle_(self, **kwargs)
467 output = {"text/plain": repr(self)}
468 if get_config()["display"] == "diagram":
--> 469 output["text/html"] = estimator_html_repr(self)
470 return output
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_estimator_html_repr.py:387, in estimator_html_repr(estimator)
385 else:
386 try:
--> 387 check_is_fitted(estimator)
388 status_label = "<span>Fitted</span>"
389 is_fitted_css_class = "fitted"
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\validation.py:1751, in check_is_fitted(estimator, attributes, msg, all_or_any)
1748 if not hasattr(estimator, "fit"):
1749 raise TypeError("%s is not an estimator instance." % (estimator))
-> 1751 tags = get_tags(estimator)
1753 if not tags.requires_fit and attributes is None:
1754 return
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_tags.py:430, in get_tags(estimator)
428 for klass in reversed(type(estimator).mro()):
429 if "__sklearn_tags__" in vars(klass):
--> 430 sklearn_tags_provider[klass] = klass.__sklearn_tags__(estimator) # type: ignore[attr-defined]
431 class_order.append(klass)
432 elif "_more_tags" in vars(klass):
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:540, in ClassifierMixin.__sklearn_tags__(self)
539 def __sklearn_tags__(self):
--> 540 tags = super().__sklearn_tags__()
541 tags.estimator_type = "classifier"
542 tags.classifier_tags = ClassifierTags()
AttributeError: 'super' object has no attribute '__sklearn_tags__'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\IPython\core\formatters.py:344, in BaseFormatter.__call__(self, obj)
342 method = get_real_method(obj, self.print_method)
343 if method is not None:
--> 344 return method()
345 return None
346 else:
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:463, in BaseEstimator._repr_html_inner(self)
458 def _repr_html_inner(self):
459 """This function is returned by the @property `_repr_html_` to make
460 `hasattr(estimator, "_repr_html_") return `True` or `False` depending
461 on `get_config()["display"]`.
462 """
--> 463 return estimator_html_repr(self)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_estimator_html_repr.py:387, in estimator_html_repr(estimator)
385 else:
386 try:
--> 387 check_is_fitted(estimator)
388 status_label = "<span>Fitted</span>"
389 is_fitted_css_class = "fitted"
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\validation.py:1751, in check_is_fitted(estimator, attributes, msg, all_or_any)
1748 if not hasattr(estimator, "fit"):
1749 raise TypeError("%s is not an estimator instance." % (estimator))
-> 1751 tags = get_tags(estimator)
1753 if not tags.requires_fit and attributes is None:
1754 return
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_tags.py:430, in get_tags(estimator)
428 for klass in reversed(type(estimator).mro()):
429 if "__sklearn_tags__" in vars(klass):
--> 430 sklearn_tags_provider[klass] = klass.__sklearn_tags__(estimator) # type: ignore[attr-defined]
431 class_order.append(klass)
432 elif "_more_tags" in vars(klass):
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:540, in ClassifierMixin.__sklearn_tags__(self)
539 def __sklearn_tags__(self):
--> 540 tags = super().__sklearn_tags__()
541 tags.estimator_type = "classifier"
542 tags.classifier_tags = ClassifierTags()
AttributeError: 'super' object has no attribute '__sklearn_tags__'
[23]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, random_state=None, ...)
Saving model
[24]:
model.save_model("binary_xgboost.json")
Prediction on dev dataset and performance
[26]:
from sklearn.metrics import roc_auc_score
dev_pred = model.predict_proba(dev_set_processed[best_features])[:, 1]
roc_auc_score(dev_set_processed[target], dev_pred)
[26]:
np.float64(0.8548426745329402)
What’s next?
Thanks to Carvers all of your features are now optimally processed for your classification task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!
Well done!
Your commitment to achieving optimal results in binary classification tasks shines through in your meticulous use of AutoCarver’s BinaryCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.
The BinaryCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.
We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in binary classification tasks.
As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.
Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.