{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting things up\n", "\n", "## About this notebook\n", "\n", "In this notebook, we focus on enhancing the predictive performance of the Titanic Dataset by leveraging ``ClassificationSelector``, a powerful tool designed to quickly identify and select the best features for classification tasks. Unlike traditional preprocessing methods, this notebook does not involve any preprocessing with BinaryCarver. Instead, our goal is to streamline the feature selection process to improve the efficiency and accuracy of our classification models.\n", "\n", "The Titanic Dataset, derived from the historic 1912 Titanic passenger records, contains a variety of features such as socio-economic status, age, and cabin location. Using ``ClassificationSelector``, we aim to identify the most relevant features that contribute to predicting survival outcomes, ensuring that our dataset is optimized for binary classification tasks.\n", "\n", "Throughout this notebook, we will explore the capabilities of ``ClassificationSelector`` in evaluating and selecting features. By focusing on feature importance and relevance, we aim to build a robust dataset that enhances the performance of our classification models without the need for extensive preprocessing.\n", "\n", "Join us as we utilize ``ClassificationSelector`` to efficiently refine the Titanic Dataset, paving the way for accurate and impactful binary classification models.\n", "\n", "Let’s dive in and uncover the potential of ``ClassificationSelector`` in optimizing the Titanic Dataset for predictive modeling.\n", "\n", "\n", "## Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %pip install AutoCarver[jupyter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Titanic Data\n", "\n", "In this example notebook, we will use the Titanic dataset.\n", "\n", "The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.\n", "\n", "The dataset includes various features such as passengers' names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSiblings/Spouses AboardParents/Children AboardFare
003Mr. Owen Harris Braundmale22.0107.2500
111Mrs. John Bradley (Florence Briggs Thayer) Cum...female38.01071.2833
213Miss. Laina Heikkinenfemale26.0007.9250
311Mrs. Jacques Heath (Lily May Peel) Futrellefemale35.01053.1000
403Mr. William Henry Allenmale35.0008.0500
\n", "
" ], "text/plain": [ " Survived Pclass Name \\\n", "0 0 3 Mr. Owen Harris Braund \n", "1 1 1 Mrs. John Bradley (Florence Briggs Thayer) Cum... \n", "2 1 3 Miss. Laina Heikkinen \n", "3 1 1 Mrs. Jacques Heath (Lily May Peel) Futrelle \n", "4 0 3 Mr. William Henry Allen \n", "\n", " Sex Age Siblings/Spouses Aboard Parents/Children Aboard Fare \n", "0 male 22.0 1 0 7.2500 \n", "1 female 38.0 1 0 71.2833 \n", "2 female 26.0 0 0 7.9250 \n", "3 female 35.0 1 0 53.1000 \n", "4 male 35.0 0 0 8.0500 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "# URL to the Titanic dataset on Kaggle\n", "titanic_url = \"https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv\"\n", "\n", "# Use pandas to read the CSV file directly from the URL\n", "titanic_data = pd.read_csv(titanic_url)\n", "\n", "# Display the first few rows of the dataset\n", "titanic_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target type and Selector selection" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Survived\n", "0 545\n", "1 342\n", "Name: count, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target = \"Survived\"\n", "\n", "titanic_data[target].value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target ``\"Survived\"`` is a binary target of type ``int64`` used in a classification task. Hence we will use ``AutoCarver.selectors.ClassificationSelector`` in following code blocks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Sampling" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(np.float64(0.38552188552188554), np.float64(0.3856655290102389))" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# stratified sampling by target\n", "train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])\n", "\n", "# checking target rate per dataset\n", "train_set[target].mean(), dev_set[target].mean()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting up Features to select" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassNameSexAgeSiblings/Spouses AboardParents/Children AboardFare
61703Mr. Antoni Yasbeckmale27.01014.4542
48901Mr. Harry Markland Molsonmale55.00030.5000
87113Miss. Adele Kiamie Najibfemale15.0007.2250
65403Mrs. John (Catherine) Bourkefemale32.01115.5000
65303Mr. Alexander Radeffmale27.0007.8958
\n", "
" ], "text/plain": [ " Survived Pclass Name Sex Age \\\n", "617 0 3 Mr. Antoni Yasbeck male 27.0 \n", "489 0 1 Mr. Harry Markland Molson male 55.0 \n", "871 1 3 Miss. Adele Kiamie Najib female 15.0 \n", "654 0 3 Mrs. John (Catherine) Bourke female 32.0 \n", "653 0 3 Mr. Alexander Radeff male 27.0 \n", "\n", " Siblings/Spouses Aboard Parents/Children Aboard Fare \n", "617 1 0 14.4542 \n", "489 0 0 30.5000 \n", "871 0 0 7.2250 \n", "654 1 1 15.5000 \n", "653 0 0 7.8958 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_set.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Survived int64\n", "Pclass int64\n", "Name object\n", "Sex object\n", "Age float64\n", "Siblings/Spouses Aboard int64\n", "Parents/Children Aboard int64\n", "Fare float64\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# column data types\n", "train_set.dtypes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Parents/Children Aboard\n", "0 438\n", "1 87\n", "2 60\n", "3 3\n", "5 3\n", "4 2\n", "6 1\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# values taken by Parents/Children Aboard\n", "train_set[\"Parents/Children Aboard\"].value_counts()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass\n", "3 326\n", "1 142\n", "2 126\n", "Name: count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# values taken by Pclass\n", "train_set[\"Pclass\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The feature ``\"Pclass\"`` is of type ``\"int64\"``, but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ``ordinal_features`` and set the ordering of its values in ``values_orders`` (string values). \n", "\n", "``\"Sex\"`` is the only quantitative categorical feature, it's added to the list of ``qualitative_features``.\n", "\n", "``\"Fare\"`` is the only quantitative continuous features, whilst ``\"Age\"``, ``\"Siblings/Spouses Aboard\"`` and ``\"Parents/Children Aboard\"`` can be considered as quantitative discrete features. Those four features will be added to the list of ``quantitative_features``." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AutoCarver import Features\n", "\n", "# initiating Features to carve\n", "features = Features(\n", " categoricals=[\"Sex\"],\n", " quantitatives=[\"Age\", \"Fare\", \"Siblings/Spouses Aboard\", \"Parents/Children Aboard\"],\n", " ordinals={\"Pclass\": [\"1\", \"2\", \"3\"]}, # user-specified ordering for ordinal features\n", ")\n", "features[\"Pclass\"], features[\"Sex\"], features[\"Age\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "## Selectors settings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of features to select\n", "\n", "The attribute ``n_best_per_type`` allows one to choose the number of features to be selected per data type (quantitative and qualitative)." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *Optional:* Setting association measure between X and y\n", "Make sure to check out available [association measures](https://autocarver.readthedocs.io/en/latest/selectors.html#association-measures-x-by-y)!\n", "\n", "Lets say one wants to:\n", "\n", "* Use Cramér's V as the association measure between each `QualitativeFeature` and the binary target ``Survived`` (with at least 30% association)\n", "* Use the coefficient of determination as the association measure between each `QuantitativeFeature` and the binary target ``Survived`` (with at least 7% association)\n", "\n", "* Remove features that have more than 30% of missing values\n", "* Remove features that have more than 30% of outliers according to Zscore" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "from AutoCarver.selectors import CramervMeasure, RMeasure, ZscoreOutlierMeasure, NanMeasure\n", "\n", "# adding Nan measure for all features with a threshold at 30% of missing values\n", "measures = [NanMeasure(threshold=0.3)]\n", "\n", "# adding Z-score outlier measure for quantitative features with a threshold at 30% of outliers\n", "measures.append(ZscoreOutlierMeasure(threshold=0.3))\n", "\n", "# adding Cramerv's V measure for categorical features with a threshold at 30% association\n", "measures.append(CramervMeasure(threshold=0.3))\n", "\n", "# adding R measure for quantitative features with a threshold at 7% association\n", "measures.append(RMeasure(threshold=0.07))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### *Optional:* Setting association measure columns of X\n", "Make sure to check out available [association filters](https://autocarver.readthedocs.io/en/latest/selectors.html#association-filters-x-by-x)!\n", "\n", "Lets say one wants to:\n", "\n", "* Use Cramér's V as the association measure between `QualitativeFeature`s (with at most 30% association)\n", "* Use Pearson's r as the association measure between `QuantitativeFeature`s (with at most 30% association)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "from AutoCarver.selectors import CramervFilter, PearsonFilter\n", "\n", "# adding Cramerv's V filter for categorical features with a threshold at 25% association\n", "filters = [CramervFilter(threshold=0.25)]\n", "\n", "# adding Pearson filter for quantitative features with a threshold at 25% association\n", "filters.append(PearsonFilter(threshold=0.25))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Selectors" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [ClassificationSelector] Selected Quantitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureModeNanZScoreRMeasureRRankPearsonFilterPearsonWith
1Quantitative('Fare')0.05220.00000.02860.27820.00000.0000itself
0Quantitative('Age')0.05560.00000.00340.07651.00000.1356Fare
2Quantitative('Siblings/Spouses Aboard')0.68010.00000.01850.0697nannannan
3Quantitative('Parents/Children Aboard')0.73740.00000.01520.0955nan0.2611Fare
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ " [ClassificationSelector] Selected Qualitative Features \n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureModeNanCramervMeasureCramervRankCramervFilterCramervWith
0Categorical('Sex')0.63640.00000.53370.00000.0000itself
1Ordinal('Pclass')0.54880.00000.32101.00000.1060Sex
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Features(['Sex', 'Pclass', 'Fare', 'Age'])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AutoCarver import ClassificationSelector\n", "\n", "# select the most target associated qualitative features\n", "feature_selector = ClassificationSelector(\n", " features=features,\n", " n_best_per_type=n_best_per_type,\n", " measures=measures,\n", " filters=filters,\n", " verbose=True, # displays statistics\n", ")\n", "best_features = feature_selector.select(train_set, train_set[target])\n", "best_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Amongst qualitatives, feature ``Sex`` is the most associated with the target ``Survived``:\n", " - Cramér's V value is ``CramervMeasure=0.5337``, which is above threshold of ``0.3``\n", " - It has 0 % of NaNs (``Nan=0.0000``), which is below threshold of ``0.3``\n", "\n", "* For feature ``Siblings/Spouses Aboard`` is the least associated with the target ``Survived``:\n", " - coefficient of determination R's value is ``RMeasure=0.0697``, which is below threshold of ``0.07``\n", " - the feature is discarded\n", "\n", "* For feature ``Parents/Children Aboard`` is the second most associated with the target ``Survived``:\n", " - coefficient of determination R's value is ``RMeasure=0.0955``, which is above threshold of ``0.07``\n", " - Pearson's r with Feature ``Fare`` is ``PearsonFilter=0.2611``, which is above threshold of ``0.25``\n", " - the feature is discarded" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "\n", "* Thanks to **Selectors**, you've selected the best features for your classification task! \n", "* You can now proceed with your model, but first, make sure to ckeck out [Carvers Examples](https://autocarver.readthedocs.io/en/latest/carvers_examples.html) in order to maximize your feature's predictive power!" ] } ], "metadata": { "kernelspec": { "display_name": "autocarver-i96ERKJw-py3.9", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }