Setting things up

About this notebook

In this notebook, we focus on enhancing the predictive performance of the Titanic Dataset by leveraging ClassificationSelector, a powerful tool designed to quickly identify and select the best features for classification tasks. Unlike traditional preprocessing methods, this notebook does not involve any preprocessing with BinaryCarver. Instead, our goal is to streamline the feature selection process to improve the efficiency and accuracy of our classification models.

The Titanic Dataset, derived from the historic 1912 Titanic passenger records, contains a variety of features such as socio-economic status, age, and cabin location. Using ClassificationSelector, we aim to identify the most relevant features that contribute to predicting survival outcomes, ensuring that our dataset is optimized for binary classification tasks.

Throughout this notebook, we will explore the capabilities of ClassificationSelector in evaluating and selecting features. By focusing on feature importance and relevance, we aim to build a robust dataset that enhances the performance of our classification models without the need for extensive preprocessing.

Join us as we utilize ClassificationSelector to efficiently refine the Titanic Dataset, paving the way for accurate and impactful binary classification models.

Let’s dive in and uncover the potential of ClassificationSelector in optimizing the Titanic Dataset for predictive modeling.

Installation

[1]:
# %pip install AutoCarver[jupyter]

Titanic Data

In this example notebook, we will use the Titanic dataset.

The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.

The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).

[2]:
import pandas as pd

# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)

# Display the first few rows of the dataset
titanic_data.head()
[2]:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
0 0 3 Mr. Owen Harris Braund male 22.0 1 0 7.2500
1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... female 38.0 1 0 71.2833
2 1 3 Miss. Laina Heikkinen female 26.0 0 0 7.9250
3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle female 35.0 1 0 53.1000
4 0 3 Mr. William Henry Allen male 35.0 0 0 8.0500

Target type and Selector selection

[3]:
target = "Survived"

titanic_data[target].value_counts(dropna=False)
[3]:
Survived
0    545
1    342
Name: count, dtype: int64

The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.selectors.ClassificationSelector in following code blocks.

Data Sampling

[4]:
from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[4]:
(np.float64(0.38552188552188554), np.float64(0.3856655290102389))

Setting up Features to select

[5]:
train_set.head()
[5]:
Survived Pclass Name Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare
617 0 3 Mr. Antoni Yasbeck male 27.0 1 0 14.4542
489 0 1 Mr. Harry Markland Molson male 55.0 0 0 30.5000
871 1 3 Miss. Adele Kiamie Najib female 15.0 0 0 7.2250
654 0 3 Mrs. John (Catherine) Bourke female 32.0 1 1 15.5000
653 0 3 Mr. Alexander Radeff male 27.0 0 0 7.8958
[6]:
# column data types
train_set.dtypes
[6]:
Survived                     int64
Pclass                       int64
Name                        object
Sex                         object
Age                        float64
Siblings/Spouses Aboard      int64
Parents/Children Aboard      int64
Fare                       float64
dtype: object
[7]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()
[7]:
Parents/Children Aboard
0    438
1     87
2     60
3      3
5      3
4      2
6      1
Name: count, dtype: int64
[8]:
# values taken by Pclass
train_set["Pclass"].value_counts()
[8]:
Pclass
3    326
1    142
2    126
Name: count, dtype: int64

The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).

"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.

"Fare" is the only quantitative continuous features, whilst "Age", "Siblings/Spouses Aboard" and "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.

[9]:
from AutoCarver import Features

# initiating Features to carve
features = Features(
    categoricals=["Sex"],
    quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},  # user-specified ordering for ordinal features
)
features["Pclass"], features["Sex"], features["Age"]
[9]:
(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))

Feature Selection

Selectors settings

Number of features to select

The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[24]:
n_best_per_type = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Optional: Setting association measure between X and y

Make sure to check out available association measures!

Lets say one wants to:

  • Use Cramér’s V as the association measure between each QualitativeFeature and the binary target Survived (with at least 30% association)

  • Use the coefficient of determination as the association measure between each QuantitativeFeature and the binary target Survived (with at least 7% association)

  • Remove features that have more than 30% of missing values

  • Remove features that have more than 30% of outliers according to Zscore

[25]:
from AutoCarver.selectors import CramervMeasure, RMeasure, ZscoreOutlierMeasure, NanMeasure

# adding Nan measure for all features with a threshold at 30% of missing values
measures = [NanMeasure(threshold=0.3)]

# adding Z-score outlier measure for quantitative features with a threshold at 30% of outliers
measures.append(ZscoreOutlierMeasure(threshold=0.3))

# adding Cramerv's V measure for categorical features with a threshold at 30% association
measures.append(CramervMeasure(threshold=0.3))

# adding R measure for quantitative features with a threshold at 7% association
measures.append(RMeasure(threshold=0.07))

Optional: Setting association measure columns of X

Make sure to check out available association filters!

Lets say one wants to:

  • Use Cramér’s V as the association measure between QualitativeFeatures (with at most 30% association)

  • Use Pearson’s r as the association measure between QuantitativeFeatures (with at most 30% association)

[38]:
from AutoCarver.selectors import CramervFilter, PearsonFilter

# adding Cramerv's V filter for categorical features with a threshold at 25% association
filters = [CramervFilter(threshold=0.25)]

# adding Pearson filter for quantitative features with a threshold at 25% association
filters.append(PearsonFilter(threshold=0.25))

Using Selectors

[39]:
from AutoCarver import ClassificationSelector

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    measures=measures,
    filters=filters,
    verbose=True,  # displays statistics
)
best_features = feature_selector.select(train_set, train_set[target])
best_features
 [ClassificationSelector] Selected Quantitative Features
  feature Mode Nan ZScore RMeasure RRank PearsonFilter PearsonWith
1 Quantitative('Fare') 0.0522 0.0000 0.0286 0.2782 0.0000 0.0000 itself
0 Quantitative('Age') 0.0556 0.0000 0.0034 0.0765 1.0000 0.1356 Fare
2 Quantitative('Siblings/Spouses Aboard') 0.6801 0.0000 0.0185 0.0697 nan nan nan
3 Quantitative('Parents/Children Aboard') 0.7374 0.0000 0.0152 0.0955 nan 0.2611 Fare
 [ClassificationSelector] Selected Qualitative Features
  feature Mode Nan CramervMeasure CramervRank CramervFilter CramervWith
0 Categorical('Sex') 0.6364 0.0000 0.5337 0.0000 0.0000 itself
1 Ordinal('Pclass') 0.5488 0.0000 0.3210 1.0000 0.1060 Sex
[39]:
Features(['Sex', 'Pclass', 'Fare', 'Age'])
  • Amongst qualitatives, feature Sex is the most associated with the target Survived:

    • Cramér’s V value is CramervMeasure=0.5337, which is above threshold of 0.3

    • It has 0 % of NaNs (Nan=0.0000), which is below threshold of 0.3

  • For feature Siblings/Spouses Aboard is the least associated with the target Survived:

    • coefficient of determination R’s value is RMeasure=0.0697, which is below threshold of 0.07

    • the feature is discarded

  • For feature Parents/Children Aboard is the second most associated with the target Survived:

    • coefficient of determination R’s value is RMeasure=0.0955, which is above threshold of 0.07

    • Pearson’s r with Feature Fare is PearsonFilter=0.2611, which is above threshold of 0.25

    • the feature is discarded

What’s next?

  • Thanks to Selectors, you’ve selected the best features for your classification task!

  • You can now proceed with your model, but first, make sure to ckeck out Carvers Examples in order to maximize your feature’s predictive power!