Setting things up
About this notebook
In this notebook, we focus on enhancing the predictive performance of the Titanic Dataset by leveraging ClassificationSelector, a powerful tool designed to quickly identify and select the best features for classification tasks. Unlike traditional preprocessing methods, this notebook does not involve any preprocessing with BinaryCarver. Instead, our goal is to streamline the feature selection process to improve the efficiency and accuracy of our classification models.
The Titanic Dataset, derived from the historic 1912 Titanic passenger records, contains a variety of features such as socio-economic status, age, and cabin location. Using ClassificationSelector, we aim to identify the most relevant features that contribute to predicting survival outcomes, ensuring that our dataset is optimized for binary classification tasks.
Throughout this notebook, we will explore the capabilities of ClassificationSelector in evaluating and selecting features. By focusing on feature importance and relevance, we aim to build a robust dataset that enhances the performance of our classification models without the need for extensive preprocessing.
Join us as we utilize ClassificationSelector to efficiently refine the Titanic Dataset, paving the way for accurate and impactful binary classification models.
Let’s dive in and uncover the potential of ClassificationSelector in optimizing the Titanic Dataset for predictive modeling.
Installation
[1]:
# %pip install AutoCarver[jupyter]
Titanic Data
In this example notebook, we will use the Titanic dataset.
The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.
The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).
[2]:
import pandas as pd
# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)
# Display the first few rows of the dataset
titanic_data.head()
[2]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
Target type and Selector selection
[3]:
target = "Survived"
titanic_data[target].value_counts(dropna=False)
[3]:
Survived
0 545
1 342
Name: count, dtype: int64
The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.selectors.ClassificationSelector in following code blocks.
Data Sampling
[4]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[4]:
(np.float64(0.38552188552188554), np.float64(0.3856655290102389))
Setting up Features to select
[5]:
train_set.head()
[5]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 617 | 0 | 3 | Mr. Antoni Yasbeck | male | 27.0 | 1 | 0 | 14.4542 |
| 489 | 0 | 1 | Mr. Harry Markland Molson | male | 55.0 | 0 | 0 | 30.5000 |
| 871 | 1 | 3 | Miss. Adele Kiamie Najib | female | 15.0 | 0 | 0 | 7.2250 |
| 654 | 0 | 3 | Mrs. John (Catherine) Bourke | female | 32.0 | 1 | 1 | 15.5000 |
| 653 | 0 | 3 | Mr. Alexander Radeff | male | 27.0 | 0 | 0 | 7.8958 |
[6]:
# column data types
train_set.dtypes
[6]:
Survived int64
Pclass int64
Name object
Sex object
Age float64
Siblings/Spouses Aboard int64
Parents/Children Aboard int64
Fare float64
dtype: object
[7]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()
[7]:
Parents/Children Aboard
0 438
1 87
2 60
3 3
5 3
4 2
6 1
Name: count, dtype: int64
[8]:
# values taken by Pclass
train_set["Pclass"].value_counts()
[8]:
Pclass
3 326
1 142
2 126
Name: count, dtype: int64
The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).
"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.
"Fare" is the only quantitative continuous features, whilst "Age", "Siblings/Spouses Aboard" and "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.
[9]:
from AutoCarver import Features
# initiating Features to carve
features = Features(
categoricals=["Sex"],
quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
ordinals={"Pclass": ["1", "2", "3"]}, # user-specified ordering for ordinal features
)
features["Pclass"], features["Sex"], features["Age"]
[9]:
(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))
Feature Selection
Selectors settings
Number of features to select
The attribute n_best_per_type allows one to choose the number of features to be selected per data type (quantitative and qualitative).
[24]:
n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics
Optional: Setting association measure between X and y
Make sure to check out available association measures!
Lets say one wants to:
Use Cramér’s V as the association measure between each
QualitativeFeatureand the binary targetSurvived(with at least 30% association)Use the coefficient of determination as the association measure between each
QuantitativeFeatureand the binary targetSurvived(with at least 7% association)Remove features that have more than 30% of missing values
Remove features that have more than 30% of outliers according to Zscore
[25]:
from AutoCarver.selectors import CramervMeasure, RMeasure, ZscoreOutlierMeasure, NanMeasure
# adding Nan measure for all features with a threshold at 30% of missing values
measures = [NanMeasure(threshold=0.3)]
# adding Z-score outlier measure for quantitative features with a threshold at 30% of outliers
measures.append(ZscoreOutlierMeasure(threshold=0.3))
# adding Cramerv's V measure for categorical features with a threshold at 30% association
measures.append(CramervMeasure(threshold=0.3))
# adding R measure for quantitative features with a threshold at 7% association
measures.append(RMeasure(threshold=0.07))
Optional: Setting association measure columns of X
Make sure to check out available association filters!
Lets say one wants to:
Use Cramér’s V as the association measure between
QualitativeFeatures (with at most 30% association)Use Pearson’s r as the association measure between
QuantitativeFeatures (with at most 30% association)
[38]:
from AutoCarver.selectors import CramervFilter, PearsonFilter
# adding Cramerv's V filter for categorical features with a threshold at 25% association
filters = [CramervFilter(threshold=0.25)]
# adding Pearson filter for quantitative features with a threshold at 25% association
filters.append(PearsonFilter(threshold=0.25))
Using Selectors
[39]:
from AutoCarver import ClassificationSelector
# select the most target associated qualitative features
feature_selector = ClassificationSelector(
features=features,
n_best_per_type=n_best_per_type,
measures=measures,
filters=filters,
verbose=True, # displays statistics
)
best_features = feature_selector.select(train_set, train_set[target])
best_features
[ClassificationSelector] Selected Quantitative Features
| feature | Mode | Nan | ZScore | RMeasure | RRank | PearsonFilter | PearsonWith | |
|---|---|---|---|---|---|---|---|---|
| 1 | Quantitative('Fare') | 0.0522 | 0.0000 | 0.0286 | 0.2782 | 0.0000 | 0.0000 | itself |
| 0 | Quantitative('Age') | 0.0556 | 0.0000 | 0.0034 | 0.0765 | 1.0000 | 0.1356 | Fare |
| 2 | Quantitative('Siblings/Spouses Aboard') | 0.6801 | 0.0000 | 0.0185 | 0.0697 | nan | nan | nan |
| 3 | Quantitative('Parents/Children Aboard') | 0.7374 | 0.0000 | 0.0152 | 0.0955 | nan | 0.2611 | Fare |
[ClassificationSelector] Selected Qualitative Features
| feature | Mode | Nan | CramervMeasure | CramervRank | CramervFilter | CramervWith | |
|---|---|---|---|---|---|---|---|
| 0 | Categorical('Sex') | 0.6364 | 0.0000 | 0.5337 | 0.0000 | 0.0000 | itself |
| 1 | Ordinal('Pclass') | 0.5488 | 0.0000 | 0.3210 | 1.0000 | 0.1060 | Sex |
[39]:
Features(['Sex', 'Pclass', 'Fare', 'Age'])
Amongst qualitatives, feature
Sexis the most associated with the targetSurvived:Cramér’s V value is
CramervMeasure=0.5337, which is above threshold of0.3It has 0 % of NaNs (
Nan=0.0000), which is below threshold of0.3
For feature
Siblings/Spouses Aboardis the least associated with the targetSurvived:coefficient of determination R’s value is
RMeasure=0.0697, which is below threshold of0.07the feature is discarded
For feature
Parents/Children Aboardis the second most associated with the targetSurvived:coefficient of determination R’s value is
RMeasure=0.0955, which is above threshold of0.07Pearson’s r with Feature
FareisPearsonFilter=0.2611, which is above threshold of0.25the feature is discarded
What’s next?
Thanks to Selectors, you’ve selected the best features for your classification task!
You can now proceed with your model, but first, make sure to ckeck out Carvers Examples in order to maximize your feature’s predictive power!