{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Setting things up\n",
    "\n",
    "## About this notebook\n",
    "\n",
    "In this notebook, we focus on enhancing the predictive performance of the Titanic Dataset by leveraging ``ClassificationSelector``, a powerful tool designed to quickly identify and select the best features for classification tasks. Unlike traditional preprocessing methods, this notebook does not involve any preprocessing with BinaryCarver. Instead, our goal is to streamline the feature selection process to improve the efficiency and accuracy of our classification models.\n",
    "\n",
    "The Titanic Dataset, derived from the historic 1912 Titanic passenger records, contains a variety of features such as socio-economic status, age, and cabin location. Using ``ClassificationSelector``, we aim to identify the most relevant features that contribute to predicting survival outcomes, ensuring that our dataset is optimized for binary classification tasks.\n",
    "\n",
    "Throughout this notebook, we will explore the capabilities of ``ClassificationSelector`` in evaluating and selecting features. By focusing on feature importance and relevance, we aim to build a robust dataset that enhances the performance of our classification models without the need for extensive preprocessing.\n",
    "\n",
    "Join us as we utilize ``ClassificationSelector`` to efficiently refine the Titanic Dataset, paving the way for accurate and impactful binary classification models.\n",
    "\n",
    "Let’s dive in and uncover the potential of ``ClassificationSelector`` in optimizing the Titanic Dataset for predictive modeling.\n",
    "\n",
    "\n",
    "## Installation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# %pip install AutoCarver[jupyter]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Titanic Data\n",
    "\n",
    "In this example notebook, we will use the Titanic dataset.\n",
    "\n",
    "The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.\n",
    "\n",
    "The dataset includes various features such as passengers' names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>Siblings/Spouses Aboard</th>\n",
       "      <th>Parents/Children Aboard</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Mr. Owen Harris Braund</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mrs. John Bradley (Florence Briggs Thayer) Cum...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>71.2833</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Miss. Laina Heikkinen</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.9250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Mrs. Jacques Heath (Lily May Peel) Futrelle</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>53.1000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Mr. William Henry Allen</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>8.0500</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Survived  Pclass                                               Name  \\\n",
       "0         0       3                             Mr. Owen Harris Braund   \n",
       "1         1       1  Mrs. John Bradley (Florence Briggs Thayer) Cum...   \n",
       "2         1       3                              Miss. Laina Heikkinen   \n",
       "3         1       1        Mrs. Jacques Heath (Lily May Peel) Futrelle   \n",
       "4         0       3                            Mr. William Henry Allen   \n",
       "\n",
       "      Sex   Age  Siblings/Spouses Aboard  Parents/Children Aboard     Fare  \n",
       "0    male  22.0                        1                        0   7.2500  \n",
       "1  female  38.0                        1                        0  71.2833  \n",
       "2  female  26.0                        0                        0   7.9250  \n",
       "3  female  35.0                        1                        0  53.1000  \n",
       "4    male  35.0                        0                        0   8.0500  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "# URL to the Titanic dataset on Kaggle\n",
    "titanic_url = \"https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv\"\n",
    "\n",
    "# Use pandas to read the CSV file directly from the URL\n",
    "titanic_data = pd.read_csv(titanic_url)\n",
    "\n",
    "# Display the first few rows of the dataset\n",
    "titanic_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Target type and Selector selection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived\n",
       "0    545\n",
       "1    342\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target = \"Survived\"\n",
    "\n",
    "titanic_data[target].value_counts(dropna=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The target ``\"Survived\"`` is a binary target of type ``int64`` used in a classification task. Hence we will use ``AutoCarver.selectors.ClassificationSelector`` in following code blocks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Sampling"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(np.float64(0.38552188552188554), np.float64(0.3856655290102389))"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# stratified sampling by target\n",
    "train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])\n",
    "\n",
    "# checking target rate per dataset\n",
    "train_set[target].mean(), dev_set[target].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up Features to select"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>Siblings/Spouses Aboard</th>\n",
       "      <th>Parents/Children Aboard</th>\n",
       "      <th>Fare</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>617</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Mr. Antoni Yasbeck</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>14.4542</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>489</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>Mr. Harry Markland Molson</td>\n",
       "      <td>male</td>\n",
       "      <td>55.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>30.5000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>871</th>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Miss. Adele Kiamie Najib</td>\n",
       "      <td>female</td>\n",
       "      <td>15.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.2250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>654</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Mrs. John (Catherine) Bourke</td>\n",
       "      <td>female</td>\n",
       "      <td>32.0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>15.5000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>653</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Mr. Alexander Radeff</td>\n",
       "      <td>male</td>\n",
       "      <td>27.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>7.8958</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     Survived  Pclass                          Name     Sex   Age  \\\n",
       "617         0       3            Mr. Antoni Yasbeck    male  27.0   \n",
       "489         0       1     Mr. Harry Markland Molson    male  55.0   \n",
       "871         1       3      Miss. Adele Kiamie Najib  female  15.0   \n",
       "654         0       3  Mrs. John (Catherine) Bourke  female  32.0   \n",
       "653         0       3          Mr. Alexander Radeff    male  27.0   \n",
       "\n",
       "     Siblings/Spouses Aboard  Parents/Children Aboard     Fare  \n",
       "617                        1                        0  14.4542  \n",
       "489                        0                        0  30.5000  \n",
       "871                        0                        0   7.2250  \n",
       "654                        1                        1  15.5000  \n",
       "653                        0                        0   7.8958  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_set.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Survived                     int64\n",
       "Pclass                       int64\n",
       "Name                        object\n",
       "Sex                         object\n",
       "Age                        float64\n",
       "Siblings/Spouses Aboard      int64\n",
       "Parents/Children Aboard      int64\n",
       "Fare                       float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# column data types\n",
    "train_set.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Parents/Children Aboard\n",
       "0    438\n",
       "1     87\n",
       "2     60\n",
       "3      3\n",
       "5      3\n",
       "4      2\n",
       "6      1\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# values taken by Parents/Children Aboard\n",
    "train_set[\"Parents/Children Aboard\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pclass\n",
       "3    326\n",
       "1    142\n",
       "2    126\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# values taken by Pclass\n",
    "train_set[\"Pclass\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The feature ``\"Pclass\"`` is of type ``\"int64\"``, but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (socio-economic status). Thus we will add it to the list of ``ordinal_features`` and set the ordering of its values in ``values_orders`` (string values). \n",
    "\n",
    "``\"Sex\"`` is the only quantitative categorical feature, it's added to the list of ``qualitative_features``.\n",
    "\n",
    "``\"Fare\"`` is the only quantitative continuous features, whilst ``\"Age\"``, ``\"Siblings/Spouses Aboard\"`` and ``\"Parents/Children Aboard\"`` can be considered as quantitative discrete features. Those four features will be added to the list of ``quantitative_features``."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(Ordinal('Pclass'), Categorical('Sex'), Quantitative('Age'))"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from AutoCarver import Features\n",
    "\n",
    "# initiating Features to carve\n",
    "features = Features(\n",
    "    categoricals=[\"Sex\"],\n",
    "    quantitatives=[\"Age\", \"Fare\", \"Siblings/Spouses Aboard\", \"Parents/Children Aboard\"],\n",
    "    ordinals={\"Pclass\": [\"1\", \"2\", \"3\"]},  # user-specified ordering for ordinal features\n",
    ")\n",
    "features[\"Pclass\"], features[\"Sex\"], features[\"Age\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Selection\n",
    "## Selectors settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Number of features to select\n",
    "\n",
    "The attribute ``n_best_per_type`` allows one to choose the number of features to be selected per data type (quantitative and qualitative)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_best_per_type = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### *Optional:* Setting association measure between X and y\n",
    "Make sure to check out available [association measures](https://autocarver.readthedocs.io/en/latest/selectors.html#association-measures-x-by-y)!\n",
    "\n",
    "Lets say one wants to:\n",
    "\n",
    "* Use Cramér's V as the association measure between each `QualitativeFeature` and the binary target ``Survived`` (with at least 30% association)\n",
    "* Use the coefficient of determination as the association measure between each `QuantitativeFeature` and the binary target ``Survived`` (with at least 7% association)\n",
    "\n",
    "* Remove features that have more than 30% of missing values\n",
    "* Remove features that have more than 30% of outliers according to Zscore"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "from AutoCarver.selectors import CramervMeasure, RMeasure, ZscoreOutlierMeasure, NanMeasure\n",
    "\n",
    "# adding Nan measure for all features with a threshold at 30% of missing values\n",
    "measures = [NanMeasure(threshold=0.3)]\n",
    "\n",
    "# adding Z-score outlier measure for quantitative features with a threshold at 30% of outliers\n",
    "measures.append(ZscoreOutlierMeasure(threshold=0.3))\n",
    "\n",
    "# adding Cramerv's V measure for categorical features with a threshold at 30% association\n",
    "measures.append(CramervMeasure(threshold=0.3))\n",
    "\n",
    "# adding R measure for quantitative features with a threshold at 7% association\n",
    "measures.append(RMeasure(threshold=0.07))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### *Optional:* Setting association measure columns of X\n",
    "Make sure to check out available [association filters](https://autocarver.readthedocs.io/en/latest/selectors.html#association-filters-x-by-x)!\n",
    "\n",
    "Lets say one wants to:\n",
    "\n",
    "* Use Cramér's V as the association measure between `QualitativeFeature`s (with at most 30% association)\n",
    "* Use Pearson's r as the association measure between `QuantitativeFeature`s (with at most 30% association)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "from AutoCarver.selectors import CramervFilter, PearsonFilter\n",
    "\n",
    "# adding Cramerv's V filter for categorical features with a threshold at 25% association\n",
    "filters = [CramervFilter(threshold=0.25)]\n",
    "\n",
    "# adding Pearson filter for quantitative features with a threshold at 25% association\n",
    "filters.append(PearsonFilter(threshold=0.25))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using Selectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " [ClassificationSelector] Selected Quantitative Features \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_19c04_row0_col4, #T_19c04_row3_col6 {\n",
       "  background-color: #b40426;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_19c04_row0_col6, #T_19c04_row2_col4 {\n",
       "  background-color: #3b4cc0;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_19c04_row1_col4 {\n",
       "  background-color: #445acc;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_19c04_row1_col6 {\n",
       "  background-color: #e2dad5;\n",
       "  color: #000000;\n",
       "}\n",
       "#T_19c04_row2_col6 {\n",
       "  background-color: #000000;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_19c04_row3_col4 {\n",
       "  background-color: #6180e9;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_19c04\" style='display:inline'>\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_19c04_level0_col0\" class=\"col_heading level0 col0\" >feature</th>\n",
       "      <th id=\"T_19c04_level0_col1\" class=\"col_heading level0 col1\" >Mode</th>\n",
       "      <th id=\"T_19c04_level0_col2\" class=\"col_heading level0 col2\" >Nan</th>\n",
       "      <th id=\"T_19c04_level0_col3\" class=\"col_heading level0 col3\" >ZScore</th>\n",
       "      <th id=\"T_19c04_level0_col4\" class=\"col_heading level0 col4\" >RMeasure</th>\n",
       "      <th id=\"T_19c04_level0_col5\" class=\"col_heading level0 col5\" >RRank</th>\n",
       "      <th id=\"T_19c04_level0_col6\" class=\"col_heading level0 col6\" >PearsonFilter</th>\n",
       "      <th id=\"T_19c04_level0_col7\" class=\"col_heading level0 col7\" >PearsonWith</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_19c04_level0_row0\" class=\"row_heading level0 row0\" >1</th>\n",
       "      <td id=\"T_19c04_row0_col0\" class=\"data row0 col0\" >Quantitative('Fare')</td>\n",
       "      <td id=\"T_19c04_row0_col1\" class=\"data row0 col1\" >0.0522</td>\n",
       "      <td id=\"T_19c04_row0_col2\" class=\"data row0 col2\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row0_col3\" class=\"data row0 col3\" >0.0286</td>\n",
       "      <td id=\"T_19c04_row0_col4\" class=\"data row0 col4\" >0.2782</td>\n",
       "      <td id=\"T_19c04_row0_col5\" class=\"data row0 col5\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row0_col6\" class=\"data row0 col6\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row0_col7\" class=\"data row0 col7\" >itself</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_19c04_level0_row1\" class=\"row_heading level0 row1\" >0</th>\n",
       "      <td id=\"T_19c04_row1_col0\" class=\"data row1 col0\" >Quantitative('Age')</td>\n",
       "      <td id=\"T_19c04_row1_col1\" class=\"data row1 col1\" >0.0556</td>\n",
       "      <td id=\"T_19c04_row1_col2\" class=\"data row1 col2\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row1_col3\" class=\"data row1 col3\" >0.0034</td>\n",
       "      <td id=\"T_19c04_row1_col4\" class=\"data row1 col4\" >0.0765</td>\n",
       "      <td id=\"T_19c04_row1_col5\" class=\"data row1 col5\" >1.0000</td>\n",
       "      <td id=\"T_19c04_row1_col6\" class=\"data row1 col6\" >0.1356</td>\n",
       "      <td id=\"T_19c04_row1_col7\" class=\"data row1 col7\" >Fare</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_19c04_level0_row2\" class=\"row_heading level0 row2\" >2</th>\n",
       "      <td id=\"T_19c04_row2_col0\" class=\"data row2 col0\" >Quantitative('Siblings/Spouses Aboard')</td>\n",
       "      <td id=\"T_19c04_row2_col1\" class=\"data row2 col1\" >0.6801</td>\n",
       "      <td id=\"T_19c04_row2_col2\" class=\"data row2 col2\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row2_col3\" class=\"data row2 col3\" >0.0185</td>\n",
       "      <td id=\"T_19c04_row2_col4\" class=\"data row2 col4\" >0.0697</td>\n",
       "      <td id=\"T_19c04_row2_col5\" class=\"data row2 col5\" >nan</td>\n",
       "      <td id=\"T_19c04_row2_col6\" class=\"data row2 col6\" >nan</td>\n",
       "      <td id=\"T_19c04_row2_col7\" class=\"data row2 col7\" >nan</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_19c04_level0_row3\" class=\"row_heading level0 row3\" >3</th>\n",
       "      <td id=\"T_19c04_row3_col0\" class=\"data row3 col0\" >Quantitative('Parents/Children Aboard')</td>\n",
       "      <td id=\"T_19c04_row3_col1\" class=\"data row3 col1\" >0.7374</td>\n",
       "      <td id=\"T_19c04_row3_col2\" class=\"data row3 col2\" >0.0000</td>\n",
       "      <td id=\"T_19c04_row3_col3\" class=\"data row3 col3\" >0.0152</td>\n",
       "      <td id=\"T_19c04_row3_col4\" class=\"data row3 col4\" >0.0955</td>\n",
       "      <td id=\"T_19c04_row3_col5\" class=\"data row3 col5\" >nan</td>\n",
       "      <td id=\"T_19c04_row3_col6\" class=\"data row3 col6\" >0.2611</td>\n",
       "      <td id=\"T_19c04_row3_col7\" class=\"data row3 col7\" >Fare</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " [ClassificationSelector] Selected Qualitative Features \n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style type=\"text/css\">\n",
       "#T_3a655_row0_col3, #T_3a655_row1_col5 {\n",
       "  background-color: #b40426;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "#T_3a655_row0_col5, #T_3a655_row1_col3 {\n",
       "  background-color: #3b4cc0;\n",
       "  color: #f1f1f1;\n",
       "}\n",
       "</style>\n",
       "<table id=\"T_3a655\" style='display:inline'>\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th class=\"blank level0\" >&nbsp;</th>\n",
       "      <th id=\"T_3a655_level0_col0\" class=\"col_heading level0 col0\" >feature</th>\n",
       "      <th id=\"T_3a655_level0_col1\" class=\"col_heading level0 col1\" >Mode</th>\n",
       "      <th id=\"T_3a655_level0_col2\" class=\"col_heading level0 col2\" >Nan</th>\n",
       "      <th id=\"T_3a655_level0_col3\" class=\"col_heading level0 col3\" >CramervMeasure</th>\n",
       "      <th id=\"T_3a655_level0_col4\" class=\"col_heading level0 col4\" >CramervRank</th>\n",
       "      <th id=\"T_3a655_level0_col5\" class=\"col_heading level0 col5\" >CramervFilter</th>\n",
       "      <th id=\"T_3a655_level0_col6\" class=\"col_heading level0 col6\" >CramervWith</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th id=\"T_3a655_level0_row0\" class=\"row_heading level0 row0\" >0</th>\n",
       "      <td id=\"T_3a655_row0_col0\" class=\"data row0 col0\" >Categorical('Sex')</td>\n",
       "      <td id=\"T_3a655_row0_col1\" class=\"data row0 col1\" >0.6364</td>\n",
       "      <td id=\"T_3a655_row0_col2\" class=\"data row0 col2\" >0.0000</td>\n",
       "      <td id=\"T_3a655_row0_col3\" class=\"data row0 col3\" >0.5337</td>\n",
       "      <td id=\"T_3a655_row0_col4\" class=\"data row0 col4\" >0.0000</td>\n",
       "      <td id=\"T_3a655_row0_col5\" class=\"data row0 col5\" >0.0000</td>\n",
       "      <td id=\"T_3a655_row0_col6\" class=\"data row0 col6\" >itself</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th id=\"T_3a655_level0_row1\" class=\"row_heading level0 row1\" >1</th>\n",
       "      <td id=\"T_3a655_row1_col0\" class=\"data row1 col0\" >Ordinal('Pclass')</td>\n",
       "      <td id=\"T_3a655_row1_col1\" class=\"data row1 col1\" >0.5488</td>\n",
       "      <td id=\"T_3a655_row1_col2\" class=\"data row1 col2\" >0.0000</td>\n",
       "      <td id=\"T_3a655_row1_col3\" class=\"data row1 col3\" >0.3210</td>\n",
       "      <td id=\"T_3a655_row1_col4\" class=\"data row1 col4\" >1.0000</td>\n",
       "      <td id=\"T_3a655_row1_col5\" class=\"data row1 col5\" >0.1060</td>\n",
       "      <td id=\"T_3a655_row1_col6\" class=\"data row1 col6\" >Sex</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "Features(['Sex', 'Pclass', 'Fare', 'Age'])"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from AutoCarver import ClassificationSelector\n",
    "\n",
    "# select the most target associated qualitative features\n",
    "feature_selector = ClassificationSelector(\n",
    "    features=features,\n",
    "    n_best_per_type=n_best_per_type,\n",
    "    measures=measures,\n",
    "    filters=filters,\n",
    "    verbose=True,  # displays statistics\n",
    ")\n",
    "best_features = feature_selector.select(train_set, train_set[target])\n",
    "best_features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Amongst qualitatives, feature ``Sex`` is the most associated with the target ``Survived``:\n",
    "    - Cramér's V value is ``CramervMeasure=0.5337``, which is above threshold of ``0.3``\n",
    "    - It has 0 % of NaNs (``Nan=0.0000``), which is below threshold of ``0.3``\n",
    "\n",
    "* For feature ``Siblings/Spouses Aboard`` is the least associated with the target ``Survived``:\n",
    "    - coefficient of determination R's value is ``RMeasure=0.0697``, which is below threshold of ``0.07``\n",
    "    - the feature is discarded\n",
    "\n",
    "* For feature ``Parents/Children Aboard`` is the second most associated with the target ``Survived``:\n",
    "    - coefficient of determination R's value is ``RMeasure=0.0955``, which is above threshold of ``0.07``\n",
    "    - Pearson's r with Feature ``Fare`` is ``PearsonFilter=0.2611``, which is above threshold of ``0.25``\n",
    "    - the feature is discarded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What's next?\n",
    "\n",
    "* Thanks to **Selectors**, you've selected the best features for your classification task! \n",
    "* You can now proceed with your model, but first, make sure to ckeck out [Carvers Examples](https://autocarver.readthedocs.io/en/latest/carvers_examples.html) in order to maximize your feature's predictive power!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "autocarver-i96ERKJw-py3.9",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}