{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Setting things up\n", "\n", "## About this notebook\n", "\n", "In this notebook, we embark on a journey to refine the Iris Dataset for optimal performance in multiclass classification tasks, leveraging the capabilities of the ``MulticlassCarver`` pipeline. Recognized for its association-maximizing discretization, ``MulticlassCarver`` is a versatile Python tool that gracefully handles diverse data types—be they quantitative or qualitative. Our specific objective is to prepare the dataset for multiclass classification, illuminating the distinctive characteristics of Iris flower species.\n", "\n", "The Iris Dataset, a classic in the realm of machine learning, presents features such as sepal and petal dimensions for three different Iris species. By employing ``MulticlassCarver``, our goal is to discretize both quantitative and qualitative features seamlessly, tailoring them for effective representation in our multiclass classification models.\n", "\n", "Throughout this notebook, we'll unravel the intricacies of ``MulticlassCarver``'s discretization pipeline, showcasing its adaptability to various data types. Whether it involves transforming petal lengths or encoding species information, ``MulticlassCarver`` ensures that each feature is finely tuned for our multiclass classification tasks.\n", "\n", "Join us in this exploration as we harness the power of ``MulticlassCarver`` to preprocess the Iris Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that not only distinguishes between Iris species but also sets the stage for the development of accurate and impactful multiclass classification models.\n", "\n", "Let's dive in and uncover the potential of ``MulticlassCarver`` in transforming the Iris Dataset for optimal predictive modeling.\n", "\n", "\n", "## Installation" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %pip install AutoCarver[jupyter]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iris Data\n", "\n", "In this example notebook, we will use the Iris dataset.\n", "\n", "The Iris dataset is a classic and widely used dataset in the field of machine learning and pattern recognition. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 and has since become a benchmark dataset for various classification and clustering tasks.\n", "\n", "The dataset consists of measurements from 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Four features are included for each flower: sepal length, sepal width, petal length, and petal width, all measured in centimeters.\n", "\n", "The primary objective of the Iris dataset is typically to classify iris flowers into one of the three species based on these four features (multiclass classification)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)iris_type
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "0 5.1 3.5 1.4 0.2 \n", "1 4.9 3.0 1.4 0.2 \n", "2 4.7 3.2 1.3 0.2 \n", "3 4.6 3.1 1.5 0.2 \n", "4 5.0 3.6 1.4 0.2 \n", "\n", " iris_type \n", "0 setosa \n", "1 setosa \n", "2 setosa \n", "3 setosa \n", "4 setosa " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import datasets\n", "\n", "# Load dataset directly from sklearn\n", "iris = datasets.load_iris(as_frame=True)\n", "\n", "# conversion to pandas\n", "iris_data = iris[\"data\"]\n", "iris_data[\"iris_type\"] = list(map(lambda u: iris[\"target_names\"][u], iris[\"target\"]))\n", "\n", "# Display the first few rows of the dataset\n", "iris_data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Target type and Carver selection" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "iris_type\n", "setosa 50\n", "versicolor 50\n", "virginica 50\n", "Name: count, dtype: int64" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "target = \"iris_type\"\n", "\n", "iris_data[target].value_counts(dropna=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The target ``\"iris_type\"`` is a multiclass target of type ``str`` used in a classification task. Hence we will use ``AutoCarver.MulticlassCarver`` and ``AutoCarver.selectors.ClassificationSelector`` in following code blocks." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Sampling" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(iris_type\n", " setosa 0.34\n", " virginica 0.33\n", " versicolor 0.33\n", " Name: proportion, dtype: float64,\n", " iris_type\n", " virginica 0.34\n", " versicolor 0.34\n", " setosa 0.32\n", " Name: proportion, dtype: float64)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "# stratified sampling by target\n", "train_set, dev_set = train_test_split(iris_data, test_size=0.33, random_state=42, stratify=iris_data[target])\n", "\n", "# checking target rate per dataset\n", "train_set[target].value_counts(dropna=False, normalize=True), dev_set[target].value_counts(dropna=False, normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Picking up columns to Carve" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)sepal width (cm)petal length (cm)petal width (cm)iris_type
1366.33.45.62.4virginica
175.13.51.40.3setosa
1425.82.75.11.9virginica
595.22.73.91.4versicolor
64.63.41.40.3setosa
\n", "
" ], "text/plain": [ " sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) \\\n", "136 6.3 3.4 5.6 2.4 \n", "17 5.1 3.5 1.4 0.3 \n", "142 5.8 2.7 5.1 1.9 \n", "59 5.2 2.7 3.9 1.4 \n", "6 4.6 3.4 1.4 0.3 \n", "\n", " iris_type \n", "136 virginica \n", "17 setosa \n", "142 virginica \n", "59 versicolor \n", "6 setosa " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_set.head()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sepal length (cm) float64\n", "sepal width (cm) float64\n", "petal length (cm) float64\n", "petal width (cm) float64\n", "iris_type object\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# column data types\n", "train_set.dtypes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']\n" ] } ], "source": [ "print(iris[\"feature_names\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All features are quantitative continuous features. Those features will be added to the list of ``quantitative_features``." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\defra\\Desktop\\git\\PROJECTS\\AutoCarver\\AutoCarver\\combinations\\utils\\combination_evaluator.py:10: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from tqdm.autonotebook import tqdm\n" ] } ], "source": [ "from AutoCarver import Features\n", "\n", "# lists of features per data type\n", "features = Features(quantitatives=[\"sepal length (cm)\", \"sepal width (cm)\", \"petal length (cm)\", \"petal width (cm)\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Using AutoCarver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoCarver settings\n", "\n", "### Representativness of modalities\n", "\n", "The attribute ``min_freq`` allows one to choose the minimum frequency per basic modalities. It is used by **Discretizers**:\n", "\n", "- For quantitative features, it defines the number of quantiles to initialy discretize the features with.\n", "\n", "- For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "min_freq = 0.05" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip:** should be set between ``0.01`` (slower, preciser, less robust) and ``0.2`` (faster, more robust)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Desired number of modalities\n", "\n", "The attribute ``max_n_mod`` allows one to choose the maximum number of modalities per carved feature. It is used by **Carvers** has the upper limit of number of modalities per consecutive combination of modalities." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "max_n_mod = 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Tip:** should be set between ``3`` (faster, more robust) and ``7`` (slower, preciser, less robust)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping NaNs\n", "\n", "The attribute ``dropna`` allows one to choose whether or not ``nan`` should be grouped with another modality. If set to ``True``, **Carvers** will first find the most suitable combination of non-NaN values, and then test out all possible combinations with ``nan``." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "dropna = False # anyway, there are no nan in this dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Type of output carved features\n", "\n", "The attribute ``ordinal_encoding`` allows one to choose the output type:\n", "\n", "* Use ``True`` for integer output of ranked modalities (default)\n", "* Use ``False`` for string output of modalities" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "ordinal_encoding = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fitting AutoCarver\n", "\n", "* First, all quantitative features are discretized:\n", " 1. Using ``ContinuousDiscretizer`` for quantile discretization that keeps track of over-represented values (more frequent than ``min_freq``)\n", " 2. Using ``OrdinalDiscretizer`` for any remaining under-represented values (less frequent than ``min_freq/2``) to be grouped with its closest modality\n", "\n", "* Second, all features are carved following this recipe, for all classes of ``train_set[target]`` (except one):\n", " 1. The raw distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the discretization step\n", " 2. Grouping modalities: all consecutive combinations of modalities are applied to ``train_set``\n", " 3. Computing associations: the association metric (Tschruprow's T, by default) is computed with the provided target ``train_set[target]``\n", " 4. Combinations are sorted in descending order by association value\n", " 5. Testing robustness: finds the first combination that checks the following:\n", " - Representativness of modalities on ``train_set`` and ``dev_set`` (all should be more frequent than ``min_freq/2``)\n", " - Distinct target rates per consecutive modalities on ``train_set`` and ``dev_set`` \n", " - No inversion of target rates between ``train_set`` and ``dev_set`` (same ordering of modalities by target rate)\n", " 6. (Optional) If requested via ``dropna=True``, and if any, all combinations of modalities with ``nan`` are applied to ``train_set`` and steps 3. and 4. are run\n", " 7. The carved distribution is printed out on provided ``train_set`` and ``dev_set``. It's the output of the carving step" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING: can't set copy=True for MulticlassCarver (no inplace DataFrame.assign).\n", "\n", "---------\n", "[MulticlassCarver] Fit y=versicolor (1/2)\n", "------\n", "------\n", "--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])\n", " - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])\n", " - [OrdinalDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])\n", "------\n", "\n", "---------\n", "------ [BinaryCarver] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])\n", "--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=versicolor') (1/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 4.40e+000.00000.0200
4.40e+00 < x <= 4.70e+000.00000.0400
4.70e+00 < x <= 4.80e+000.00000.0500
4.80e+00 < x <= 4.90e+000.00000.0300
4.90e+00 < x <= 5.00e+000.14290.0700
5.00e+00 < x <= 5.10e+000.16670.0600
5.10e+00 < x <= 5.20e+000.33330.0300
5.20e+00 < x <= 5.40e+000.00000.0400
5.40e+00 < x <= 5.50e+000.66670.0600
5.50e+00 < x <= 5.60e+000.66670.0300
5.60e+00 < x <= 5.70e+000.50000.0400
5.70e+00 < x <= 5.80e+000.40000.0500
5.80e+00 < x <= 5.90e+000.66670.0300
5.90e+00 < x <= 6.00e+000.66670.0300
6.00e+00 < x <= 6.10e+001.00000.0300
6.10e+00 < x <= 6.20e+000.66670.0300
6.20e+00 < x <= 6.30e+000.28570.0700
6.30e+00 < x <= 6.40e+000.25000.0400
6.40e+00 < x <= 6.50e+000.50000.0200
6.50e+00 < x <= 6.70e+000.66670.0600
6.70e+00 < x <= 6.80e+000.33330.0300
6.80e+00 < x <= 6.90e+000.33330.0300
6.90e+00 < x <= 7.10e+000.50000.0200
7.10e+00 < x <= 7.20e+000.00000.0300
7.20e+00 < x <= 7.60e+000.00000.0200
7.60e+00 < x0.00000.0400
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.0400
0.00000.0600
nan0.0000
0.33330.0600
0.33330.0600
0.00000.0600
0.00000.0200
0.33330.0600
1.00000.0200
1.00000.0600
0.75000.0800
0.50000.0400
nan0.0000
0.66670.0600
0.33330.0600
0.00000.0200
0.50000.0400
0.33330.0600
0.00000.0600
0.25000.0800
nan0.0000
0.00000.0200
nan0.0000
nan0.0000
0.00000.0200
0.00000.0200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 15274/15275 [00:02<00:00, 7187.52it/s]\n", "Computing associations: 100%|██████████| 15275/15275 [00:03<00:00, 4329.98it/s]\n", "Testing robustness : 0%| | 3/15275 [00:00<04:02, 63.07it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " [BinaryCarver] Carved distribution\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 5.4e+000.08820.3400
5.4e+00 < x <= 6.2e+000.63330.3000
6.2e+00 < x0.30560.3600
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.16670.3600
0.64710.3400
0.20000.3000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=versicolor') (2/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 2.20e+000.75000.0400
2.20e+00 < x <= 2.40e+001.00000.0300
2.40e+00 < x <= 2.50e+000.60000.0500
2.50e+00 < x <= 2.60e+000.75000.0400
2.60e+00 < x <= 2.70e+000.57140.0700
2.70e+00 < x <= 2.80e+000.44440.0900
2.80e+00 < x <= 2.90e+000.66670.0600
2.90e+00 < x <= 3.00e+000.28570.1400
3.00e+00 < x <= 3.10e+000.33330.0900
3.10e+00 < x <= 3.20e+000.22220.0900
3.20e+00 < x <= 3.30e+000.00000.0400
3.30e+00 < x <= 3.40e+000.00000.0600
3.40e+00 < x <= 3.50e+000.00000.0600
3.50e+00 < x <= 3.70e+000.00000.0400
3.70e+00 < x <= 3.80e+000.00000.0500
3.80e+00 < x <= 4.10e+000.00000.0300
4.10e+00 < x0.00000.0200
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
nan0.0000
0.75000.0800
0.33330.0600
0.00000.0200
0.50000.0400
0.40000.1000
0.75000.0800
0.33330.2400
0.00000.0400
0.25000.0800
0.50000.0400
0.16670.1200
nan0.0000
0.00000.0600
0.00000.0200
0.00000.0200
nan0.0000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 2515/2516 [00:00<00:00, 7890.53it/s]\n", "Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4457.58it/s]\n", "Testing robustness : 0%| | 0/2516 [00:00\n", "#T_18ec8_row0_col0, #T_18ec8_row1_col1 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "#T_18ec8_row0_col1, #T_18ec8_row1_col0 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 2.9e+000.63160.3800
2.9e+00 < x0.14520.6200
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.52630.3800
0.22580.6200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=versicolor') (3/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.30e+000.00000.0400
1.30e+00 < x <= 1.40e+000.00000.1100
1.40e+00 < x <= 1.50e+000.00000.0900
1.50e+00 < x <= 1.60e+000.00000.0700
1.60e+00 < x <= 1.90e+000.00000.0300
1.90e+00 < x <= 3.50e+001.00000.0300
3.50e+00 < x <= 3.70e+001.00000.0200
3.70e+00 < x <= 4.00e+001.00000.0700
4.00e+00 < x <= 4.20e+001.00000.0300
4.20e+00 < x <= 4.30e+001.00000.0200
4.30e+00 < x <= 4.50e+001.00000.0500
4.50e+00 < x <= 4.60e+001.00000.0300
4.60e+00 < x <= 4.70e+001.00000.0300
4.70e+00 < x <= 4.80e+000.66670.0300
4.80e+00 < x <= 4.90e+000.50000.0400
4.90e+00 < x <= 5.00e+000.00000.0300
5.00e+00 < x <= 5.10e+000.16670.0600
5.10e+00 < x <= 5.40e+000.00000.0200
5.40e+00 < x <= 5.60e+000.00000.0500
5.60e+00 < x <= 5.70e+000.00000.0300
5.70e+00 < x <= 5.90e+000.00000.0300
5.90e+00 < x <= 6.10e+000.00000.0500
6.10e+00 < x <= 6.60e+000.00000.0200
6.60e+00 < x0.00000.0200
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.1400
0.00000.0400
0.00000.0800
nan0.0000
0.00000.0600
1.00000.0400
nan0.0000
1.00000.0400
1.00000.0800
nan0.0000
0.85710.1400
nan0.0000
1.00000.0400
0.00000.0200
0.00000.0200
1.00000.0200
0.00000.0400
0.00000.0800
0.00000.0800
nan0.0000
0.00000.0400
nan0.0000
0.00000.0200
0.00000.0200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 10901/10902 [00:01<00:00, 7971.25it/s]\n", "Computing associations: 100%|██████████| 10902/10902 [00:02<00:00, 4194.36it/s]\n", "Testing robustness : 0%| | 0/10902 [00:00\n", "#T_774bf_row0_col0, #T_774bf_row1_col1 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "#T_774bf_row0_col1 {\n", " background-color: #f4987a;\n", " color: #000000;\n", "}\n", "#T_774bf_row1_col0, #T_774bf_row2_col1 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "#T_774bf_row2_col0 {\n", " background-color: #5572df;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.9e+000.00000.3400
1.9e+00 < x <= 4.8e+000.96770.3100
4.8e+00 < x0.08570.3500
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.3200
0.88890.3600
0.06250.3200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=versicolor') (4/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.00e-010.00000.0500
1.00e-01 < x <= 2.00e-010.00000.1700
2.00e-01 < x <= 3.00e-010.00000.0500
3.00e-01 < x <= 6.00e-010.00000.0700
6.00e-01 < x <= 1.00e+001.00000.0400
1.00e+00 < x <= 1.10e+001.00000.0200
1.10e+00 < x <= 1.20e+001.00000.0500
1.20e+00 < x <= 1.30e+001.00000.0800
1.30e+00 < x <= 1.40e+001.00000.0600
1.40e+00 < x <= 1.50e+000.85710.0700
1.50e+00 < x <= 1.60e+000.50000.0200
1.60e+00 < x <= 1.80e+000.14290.0700
1.80e+00 < x <= 1.90e+000.00000.0400
1.90e+00 < x <= 2.00e+000.00000.0400
2.00e+00 < x <= 2.20e+000.00000.0700
2.20e+00 < x <= 2.30e+000.00000.0500
2.30e+00 < x <= 2.40e+000.00000.0200
2.40e+00 < x0.00000.0300
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
nan0.0000
0.00000.2400
0.00000.0400
0.00000.0400
1.00000.0600
1.00000.0200
nan0.0000
1.00000.1000
0.50000.0400
0.80000.1000
1.00000.0400
0.14290.1400
0.00000.0200
0.00000.0400
0.00000.0400
0.00000.0600
0.00000.0200
nan0.0000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 3212/3213 [00:00<00:00, 8682.61it/s]\n", "Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 4119.92it/s]\n", "Testing robustness : 0%| | 0/3213 [00:00\n", "#T_2b119_row0_col0, #T_2b119_row1_col1 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "#T_2b119_row0_col1, #T_2b119_row1_col0, #T_2b119_row2_col1 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "#T_2b119_row2_col0 {\n", " background-color: #4c66d6;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 6.0e-010.00000.3400
6.0e-01 < x <= 1.5e+000.96880.3200
1.5e+00 < x0.05880.3400
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.3200
0.87500.3200
0.16670.3600
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "---------\n", "\n", "\n", "---------\n", "[MulticlassCarver] Fit y=virginica (2/2)\n", "------\n", "------\n", "--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])\n", " - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])\n", " - [OrdinalDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])\n", "------\n", "\n", "---------\n", "------ [BinaryCarver] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])\n", "--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=virginica') (1/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 4.40e+000.00000.0200
4.40e+00 < x <= 4.70e+000.00000.0400
4.70e+00 < x <= 4.80e+000.00000.0500
4.80e+00 < x <= 4.90e+000.00000.0300
4.90e+00 < x <= 5.00e+000.00000.0700
5.00e+00 < x <= 5.10e+000.00000.0600
5.10e+00 < x <= 5.30e+000.00000.0400
5.30e+00 < x <= 5.40e+000.00000.0300
5.40e+00 < x <= 5.50e+000.00000.0600
5.50e+00 < x <= 5.60e+000.33330.0300
5.60e+00 < x <= 5.70e+000.25000.0400
5.70e+00 < x <= 5.80e+000.60000.0500
5.80e+00 < x <= 5.90e+000.33330.0300
5.90e+00 < x <= 6.00e+000.33330.0300
6.00e+00 < x <= 6.10e+000.00000.0300
6.10e+00 < x <= 6.20e+000.33330.0300
6.20e+00 < x <= 6.30e+000.71430.0700
6.30e+00 < x <= 6.40e+000.75000.0400
6.40e+00 < x <= 6.50e+000.50000.0200
6.50e+00 < x <= 6.70e+000.33330.0600
6.70e+00 < x <= 6.80e+000.66670.0300
6.80e+00 < x <= 6.90e+000.66670.0300
6.90e+00 < x <= 7.10e+000.50000.0200
7.10e+00 < x <= 7.20e+001.00000.0300
7.20e+00 < x <= 7.60e+001.00000.0200
7.60e+00 < x1.00000.0400
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.0400
0.00000.0600
nan0.0000
0.33330.0600
0.00000.0600
0.00000.0600
0.00000.0200
0.00000.0600
0.00000.0200
0.00000.0600
0.00000.0800
0.00000.0400
nan0.0000
0.33330.0600
0.66670.0600
1.00000.0200
0.50000.0400
0.66670.0600
1.00000.0600
0.75000.0800
nan0.0000
1.00000.0200
nan0.0000
nan0.0000
1.00000.0200
1.00000.0200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 15274/15275 [00:02<00:00, 7499.41it/s]\n", "Computing associations: 100%|██████████| 15275/15275 [00:03<00:00, 4444.34it/s]\n", "Testing robustness : 0%| | 0/15275 [00:00\n", "#T_16a1b_row0_col0, #T_16a1b_row1_col1 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "#T_16a1b_row0_col1, #T_16a1b_row1_col0 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 6.2e+000.12500.6400
6.2e+00 < x0.69440.3600
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.14290.7000
0.80000.3000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=virginica') (2/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 2.20e+000.25000.0400
2.20e+00 < x <= 2.40e+000.00000.0300
2.40e+00 < x <= 2.50e+000.40000.0500
2.50e+00 < x <= 2.60e+000.25000.0400
2.60e+00 < x <= 2.70e+000.42860.0700
2.70e+00 < x <= 2.80e+000.55560.0900
2.80e+00 < x <= 2.90e+000.16670.0600
2.90e+00 < x <= 3.00e+000.42860.1400
3.00e+00 < x <= 3.10e+000.22220.0900
3.10e+00 < x <= 3.20e+000.55560.0900
3.20e+00 < x <= 3.30e+000.75000.0400
3.30e+00 < x <= 3.40e+000.16670.0600
3.40e+00 < x <= 3.50e+000.00000.0600
3.50e+00 < x <= 3.70e+000.25000.0400
3.70e+00 < x <= 3.80e+000.40000.0500
3.80e+00 < x <= 4.10e+000.00000.0300
4.10e+00 < x0.00000.0200
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
nan0.0000
0.00000.0800
0.66670.0600
1.00000.0200
0.50000.0400
0.60000.1000
0.25000.0800
0.50000.2400
1.00000.0400
0.00000.0800
0.00000.0400
0.16670.1200
nan0.0000
0.00000.0600
0.00000.0200
0.00000.0200
nan0.0000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 2515/2516 [00:00<00:00, 7545.34it/s]\n", "Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4394.20it/s]\n", "Testing robustness : 1%| | 29/2516 [00:00<00:08, 302.81it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n", " [BinaryCarver] Carved distribution\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 2.4e+000.14290.0700
2.4e+00 < x <= 3.3e+000.41790.6700
3.3e+00 < x0.15380.2600
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.0800
0.45710.7000
0.09090.2200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=virginica') (3/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.30e+000.00000.0400
1.30e+00 < x <= 1.40e+000.00000.1100
1.40e+00 < x <= 1.50e+000.00000.0900
1.50e+00 < x <= 1.60e+000.00000.0700
1.60e+00 < x <= 1.90e+000.00000.0300
1.90e+00 < x <= 3.50e+000.00000.0300
3.50e+00 < x <= 3.70e+000.00000.0200
3.70e+00 < x <= 4.00e+000.00000.0700
4.00e+00 < x <= 4.20e+000.00000.0300
4.20e+00 < x <= 4.30e+000.00000.0200
4.30e+00 < x <= 4.50e+000.00000.0500
4.50e+00 < x <= 4.60e+000.00000.0300
4.60e+00 < x <= 4.70e+000.00000.0300
4.70e+00 < x <= 4.80e+000.33330.0300
4.80e+00 < x <= 4.90e+000.50000.0400
4.90e+00 < x <= 5.00e+001.00000.0300
5.00e+00 < x <= 5.10e+000.83330.0600
5.10e+00 < x <= 5.40e+001.00000.0200
5.40e+00 < x <= 5.60e+001.00000.0500
5.60e+00 < x <= 5.70e+001.00000.0300
5.70e+00 < x <= 5.90e+001.00000.0300
5.90e+00 < x <= 6.10e+001.00000.0500
6.10e+00 < x <= 6.60e+001.00000.0200
6.60e+00 < x1.00000.0200
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.00000.1400
0.00000.0400
0.00000.0800
nan0.0000
0.00000.0600
0.00000.0400
nan0.0000
0.00000.0400
0.00000.0800
nan0.0000
0.14290.1400
nan0.0000
0.00000.0400
1.00000.0200
1.00000.0200
0.00000.0200
1.00000.0400
1.00000.0800
1.00000.0800
nan0.0000
1.00000.0400
nan0.0000
1.00000.0200
1.00000.0200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 10901/10902 [00:01<00:00, 7786.74it/s]\n", "Computing associations: 100%|██████████| 10902/10902 [00:02<00:00, 4258.68it/s]\n", "Testing robustness : 0%| | 0/10902 [00:00\n", "#T_688f4_row0_col0, #T_688f4_row1_col1 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "#T_688f4_row0_col1, #T_688f4_row1_col0 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 4.8e+000.01540.6500
4.8e+00 < x0.91430.3500
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.05880.6800
0.93750.3200
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=virginica') (4/4)\n", " [BinaryCarver] Raw distribution\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.00e-010.00000.0500
1.00e-01 < x <= 2.00e-010.00000.1700
2.00e-01 < x <= 3.00e-010.00000.0500
3.00e-01 < x <= 6.00e-010.00000.0700
6.00e-01 < x <= 1.00e+000.00000.0400
1.00e+00 < x <= 1.10e+000.00000.0200
1.10e+00 < x <= 1.20e+000.00000.0500
1.20e+00 < x <= 1.30e+000.00000.0800
1.30e+00 < x <= 1.40e+000.00000.0600
1.40e+00 < x <= 1.50e+000.14290.0700
1.50e+00 < x <= 1.60e+000.50000.0200
1.60e+00 < x <= 1.80e+000.85710.0700
1.80e+00 < x <= 1.90e+001.00000.0400
1.90e+00 < x <= 2.00e+001.00000.0400
2.00e+00 < x <= 2.20e+001.00000.0700
2.20e+00 < x <= 2.30e+001.00000.0500
2.30e+00 < x <= 2.40e+001.00000.0200
2.40e+00 < x1.00000.0300
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
nan0.0000
0.00000.2400
0.00000.0400
0.00000.0400
0.00000.0600
0.00000.0200
nan0.0000
0.00000.1000
0.50000.0400
0.20000.1000
0.00000.0400
0.85710.1400
1.00000.0200
1.00000.0400
1.00000.0400
1.00000.0600
1.00000.0200
nan0.0000
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "Grouping modalities : 100%|█████████▉| 3212/3213 [00:00<00:00, 7088.72it/s]\n", "Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 3949.35it/s]\n", "Testing robustness : 0%| | 0/3213 [00:00\n", "#T_89d06_row0_col0, #T_89d06_row1_col1 {\n", " background-color: #3b4cc0;\n", " color: #f1f1f1;\n", "}\n", "#T_89d06_row0_col1, #T_89d06_row1_col0 {\n", " background-color: #b40426;\n", " color: #f1f1f1;\n", "}\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X distribution
 target_ratefrequency
x <= 1.5e+000.01520.6600
1.5e+00 < x0.94120.3400
\n", " \n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
X_dev distribution
target_ratefrequency
0.06250.6400
0.83330.3600
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "---------\n", "\n" ] } ], "source": [ "from AutoCarver import MulticlassCarver\n", "\n", "# intiating AutoCarver\n", "auto_carver = MulticlassCarver(\n", " features=features,\n", " ordinal_encoding=ordinal_encoding,\n", " max_n_mod=max_n_mod,\n", " min_freq=min_freq,\n", " dropna=dropna,\n", " verbose=True, # showing statistics\n", " copy=True, # whether or not to return a copy of the input dataset\n", ")\n", "\n", "# fitting on training sample, a dev sample can be specified to evaluate carving robustness\n", "train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## AutoCarver analysis\n", "\n", "### Carving Summary" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
contenttarget_ratefrequency
featurecramervtschuprowtn_modlabel
Quantitative('sepal length (cm)__y=versicolor')0.4644360.39054330x <= 5.4e+000.0882350.34
15.4e+00 < x <= 6.2e+000.6333330.30
26.2e+00 < x0.3055560.36
Quantitative('sepal width (cm)__y=versicolor')0.4802070.48020720x <= 2.9e+000.6315790.38
12.9e+00 < x0.1451610.62
Quantitative('petal length (cm)__y=versicolor')0.9122370.76709630x <= 1.9e+000.0000000.34
11.9e+00 < x <= 4.8e+000.9677420.31
24.8e+00 < x0.0857140.35
Quantitative('petal width (cm)__y=versicolor')0.9333000.78480930x <= 6.0e-010.0000000.34
16.0e-01 < x <= 1.5e+000.9687500.32
21.5e+00 < x0.0588240.34
Quantitative('sepal length (cm)__y=virginica')0.5591440.55914420x <= 6.2e+000.1250000.64
16.2e+00 < x0.6944440.36
Quantitative('sepal width (cm)__y=virginica')0.2664520.22405830x <= 2.4e+000.1428570.07
12.4e+00 < x <= 3.3e+000.4179100.67
23.3e+00 < x0.1538460.26
Quantitative('petal length (cm)__y=virginica')0.8895240.88952420x <= 4.8e+000.0153850.65
14.8e+00 < x0.9142860.35
Quantitative('petal width (cm)__y=virginica')0.9104630.91046320x <= 1.5e+000.0151520.66
11.5e+00 < x0.9411760.34
\n", "
" ], "text/plain": [ " content \\\n", "feature cramerv tschuprowt n_mod label \n", "Quantitative('sepal length (cm)__y=versicolor') 0.464436 0.390543 3 0 x <= 5.4e+00 \n", " 1 5.4e+00 < x <= 6.2e+00 \n", " 2 6.2e+00 < x \n", "Quantitative('sepal width (cm)__y=versicolor') 0.480207 0.480207 2 0 x <= 2.9e+00 \n", " 1 2.9e+00 < x \n", "Quantitative('petal length (cm)__y=versicolor') 0.912237 0.767096 3 0 x <= 1.9e+00 \n", " 1 1.9e+00 < x <= 4.8e+00 \n", " 2 4.8e+00 < x \n", "Quantitative('petal width (cm)__y=versicolor') 0.933300 0.784809 3 0 x <= 6.0e-01 \n", " 1 6.0e-01 < x <= 1.5e+00 \n", " 2 1.5e+00 < x \n", "Quantitative('sepal length (cm)__y=virginica') 0.559144 0.559144 2 0 x <= 6.2e+00 \n", " 1 6.2e+00 < x \n", "Quantitative('sepal width (cm)__y=virginica') 0.266452 0.224058 3 0 x <= 2.4e+00 \n", " 1 2.4e+00 < x <= 3.3e+00 \n", " 2 3.3e+00 < x \n", "Quantitative('petal length (cm)__y=virginica') 0.889524 0.889524 2 0 x <= 4.8e+00 \n", " 1 4.8e+00 < x \n", "Quantitative('petal width (cm)__y=virginica') 0.910463 0.910463 2 0 x <= 1.5e+00 \n", " 1 1.5e+00 < x \n", "\n", " target_rate \\\n", "feature cramerv tschuprowt n_mod label \n", "Quantitative('sepal length (cm)__y=versicolor') 0.464436 0.390543 3 0 0.088235 \n", " 1 0.633333 \n", " 2 0.305556 \n", "Quantitative('sepal width (cm)__y=versicolor') 0.480207 0.480207 2 0 0.631579 \n", " 1 0.145161 \n", "Quantitative('petal length (cm)__y=versicolor') 0.912237 0.767096 3 0 0.000000 \n", " 1 0.967742 \n", " 2 0.085714 \n", "Quantitative('petal width (cm)__y=versicolor') 0.933300 0.784809 3 0 0.000000 \n", " 1 0.968750 \n", " 2 0.058824 \n", "Quantitative('sepal length (cm)__y=virginica') 0.559144 0.559144 2 0 0.125000 \n", " 1 0.694444 \n", "Quantitative('sepal width (cm)__y=virginica') 0.266452 0.224058 3 0 0.142857 \n", " 1 0.417910 \n", " 2 0.153846 \n", "Quantitative('petal length (cm)__y=virginica') 0.889524 0.889524 2 0 0.015385 \n", " 1 0.914286 \n", "Quantitative('petal width (cm)__y=virginica') 0.910463 0.910463 2 0 0.015152 \n", " 1 0.941176 \n", "\n", " frequency \n", "feature cramerv tschuprowt n_mod label \n", "Quantitative('sepal length (cm)__y=versicolor') 0.464436 0.390543 3 0 0.34 \n", " 1 0.30 \n", " 2 0.36 \n", "Quantitative('sepal width (cm)__y=versicolor') 0.480207 0.480207 2 0 0.38 \n", " 1 0.62 \n", "Quantitative('petal length (cm)__y=versicolor') 0.912237 0.767096 3 0 0.34 \n", " 1 0.31 \n", " 2 0.35 \n", "Quantitative('petal width (cm)__y=versicolor') 0.933300 0.784809 3 0 0.34 \n", " 1 0.32 \n", " 2 0.34 \n", "Quantitative('sepal length (cm)__y=virginica') 0.559144 0.559144 2 0 0.64 \n", " 1 0.36 \n", "Quantitative('sepal width (cm)__y=virginica') 0.266452 0.224058 3 0 0.07 \n", " 1 0.67 \n", " 2 0.26 \n", "Quantitative('petal length (cm)__y=virginica') 0.889524 0.889524 2 0 0.65 \n", " 1 0.35 \n", "Quantitative('petal width (cm)__y=virginica') 0.910463 0.910463 2 0 0.66 \n", " 1 0.34 " ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "auto_carver.summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* As requested with ``ordinal_encoding=True``, output labels are integers of modalities\n", "\n", "* Features have been carved for two distinct binary targets:\n", " * ``y=versicolor``: dummy of target ``iris_type`` taking value ``\"versicolor\"``\n", " * ``y=virginica``: dummy of target ``iris_type`` taking value ``\"virginica\"``\n", "\n", "* For quantitative feature ``petal width (cm)``, for ``y=virginica``, the selected combination of modalities groups petal widths as follows:\n", " * label ``0``: lower or equal to 1.5cm (``content=\"x <= 1.5e+00\"``)\n", " * label ``1``: higher than 1.5cm (``content=\"1.5e+00 < x\"``)\n", "\n", "### Detailed overview of tested combinations" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
infocramervtschuprowtcombinationn_moddropnatrainviabledev
0Raw distribution (n_mod=26>max_n_mod=5)0.5904260.264047{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...26FalseNaNNaNNaN
1Not viable0.4832880.406395{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...3False{'viable': True, 'info': ''}False{'viable': False, 'info': 'Non-representative ...
2Not viable0.5161140.392162{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...4False{'viable': True, 'info': ''}False{'viable': False, 'info': 'Non-representative ...
3Not viable0.4650450.391055{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...3False{'viable': True, 'info': ''}False{'viable': False, 'info': 'Inversion of target...
4Best for tschuprowt and max_n_mod=50.4644360.390543{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...3False{'viable': True, 'info': ''}True{'viable': True, 'info': ''}
5Not checked0.4638210.390025{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...3FalseNaNNaNNaN
6Not checked0.4628850.389238{'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <...3FalseNaNNaNNaN
\n", "
" ], "text/plain": [ " info cramerv tschuprowt \\\n", "0 Raw distribution (n_mod=26>max_n_mod=5) 0.590426 0.264047 \n", "1 Not viable 0.483288 0.406395 \n", "2 Not viable 0.516114 0.392162 \n", "3 Not viable 0.465045 0.391055 \n", "4 Best for tschuprowt and max_n_mod=5 0.464436 0.390543 \n", "5 Not checked 0.463821 0.390025 \n", "6 Not checked 0.462885 0.389238 \n", "\n", " combination n_mod dropna \\\n", "0 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 26 False \n", "1 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 3 False \n", "2 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 4 False \n", "3 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 3 False \n", "4 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 3 False \n", "5 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 3 False \n", "6 {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... 3 False \n", "\n", " train viable \\\n", "0 NaN NaN \n", "1 {'viable': True, 'info': ''} False \n", "2 {'viable': True, 'info': ''} False \n", "3 {'viable': True, 'info': ''} False \n", "4 {'viable': True, 'info': ''} True \n", "5 NaN NaN \n", "6 NaN NaN \n", "\n", " dev \n", "0 NaN \n", "1 {'viable': False, 'info': 'Non-representative ... \n", "2 {'viable': False, 'info': 'Non-representative ... \n", "3 {'viable': False, 'info': 'Inversion of target... \n", "4 {'viable': True, 'info': ''} \n", "5 NaN \n", "6 NaN " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features['sepal length (cm)__y=versicolor'].history.head(7)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'viable': False, 'info': 'Non-representative modality for min_freq=5.00%'}" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "features['sepal length (cm)__y=versicolor'].history.dev[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The most associated combination of feature ``sepal length (cm)__y=versicolor`` (the first tested out, where ``info!=\"Raw distribution\"``) did not pass the viability tests. When looking in ``history.dev``:\n", " * ``\"Non-representative modality for min_freq=5.00%\"``: tells us that a modality is unstable between ``train_set`` and ``dev_set``\n", "\n", "* For feature ``sepal length (cm)__y=versicolor``, the 4th combination is the first to pass the tests:\n", " - ``info=\"Best for tschuprowt and max_n_mod=5\"``\n", " - Tschuprow's T with ``ìris_type`` is ``0.390543`` for this combination (by default, combinations are ranked according to this statistic)\n", " - Following combinations (less associated with the target) where not tested: ``info=\"Not checked\"``\n", "\n", "* For all combinations ``dropna=False`` means that it is not a combination in which ``nan``s are being groupedwith other modalities (as requested with ``dropna=False``)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Saving and Loading AutoCarver" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Saving\n", "\n", "All **Carvers** can safely be stored as a .json file." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "auto_carver.save(\"multiclass_carver.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Loading\n", "\n", "**Carvers** can safely be loaded from a .json file." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from AutoCarver import MulticlassCarver\n", "\n", "auto_carver = MulticlassCarver.load(\"multiclass_carver.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Applying AutoCarver" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "dev_set_processed = auto_carver.transform(dev_set)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal length (cm)__y=versicolorsepal width (cm)__y=versicolorpetal length (cm)__y=versicolorpetal width (cm)__y=versicolorsepal length (cm)__y=virginicasepal width (cm)__y=virginicapetal length (cm)__y=virginicapetal width (cm)__y=virginica
0.00.360.380.320.320.70.080.680.64
1.00.340.620.360.320.30.700.320.36
2.00.30NaN0.320.36NaN0.22NaNNaN
\n", "
" ], "text/plain": [ " sepal length (cm)__y=versicolor sepal width (cm)__y=versicolor \\\n", "0.0 0.36 0.38 \n", "1.0 0.34 0.62 \n", "2.0 0.30 NaN \n", "\n", " petal length (cm)__y=versicolor petal width (cm)__y=versicolor \\\n", "0.0 0.32 0.32 \n", "1.0 0.36 0.32 \n", "2.0 0.32 0.36 \n", "\n", " sepal length (cm)__y=virginica sepal width (cm)__y=virginica \\\n", "0.0 0.7 0.08 \n", "1.0 0.3 0.70 \n", "2.0 NaN 0.22 \n", "\n", " petal length (cm)__y=virginica petal width (cm)__y=virginica \n", "0.0 0.68 0.64 \n", "1.0 0.32 0.36 \n", "2.0 NaN NaN " ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Feature Selection\n", "## Selectors settings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Features to select from\n", "\n", "Here all features have been carved using ``MulticlassCarver``, hence all features are qualitative." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Number of features to select\n", "\n", "The attribute ``n_best`` allows one to choose the number of features to be selected per data type (quantitative and qualitative)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Selectors" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " [ClassificationSelector] Selected Features\n" ] }, { "data": { "text/html": [ "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
 featureNanMeasureModeMeasureTschuprowtMeasureTschuprowtRankTschuprowtFilterTschuprowtWith
3Quantitative('petal width (cm)__y=versicolor')0.00000.34000.955800.0000itself
2Quantitative('petal length (cm)__y=versicolor')0.00000.35000.942110.9274petal width (cm)__y=versicolor
7Quantitative('petal width (cm)__y=virginica')0.00000.66000.785720.8409petal width (cm)__y=versicolor
6Quantitative('petal length (cm)__y=virginica')0.00000.65000.769530.8675petal width (cm)__y=virginica
0Quantitative('sepal length (cm)__y=versicolor')0.00000.36000.672840.6888petal length (cm)__y=versicolor
4Quantitative('sepal length (cm)__y=virginica')0.00000.64000.544150.8409sepal length (cm)__y=versicolor
1Quantitative('sepal width (cm)__y=versicolor')0.00000.62000.495060.5069petal width (cm)__y=versicolor
5Quantitative('sepal width (cm)__y=virginica')0.00000.67000.486870.5168petal width (cm)__y=versicolor
\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "Features(['petal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=virginica', 'petal length (cm)__y=virginica'])" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from AutoCarver.selectors import ClassificationSelector\n", "\n", "# select the most target associated qualitative features\n", "feature_selector = ClassificationSelector(\n", " features=features,\n", " n_best_per_type=n_best_per_type,\n", " verbose=True, # displays statistics\n", ")\n", "best_features = feature_selector.select(train_set_processed, train_set_processed[target])\n", "best_features" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal width (cm)__y=versicolorpetal length (cm)__y=versicolorpetal width (cm)__y=virginicapetal length (cm)__y=virginica
1362.02.01.01.0
170.00.00.00.0
1422.02.01.01.0
591.01.00.00.0
60.00.00.00.0
\n", "
" ], "text/plain": [ " petal width (cm)__y=versicolor petal length (cm)__y=versicolor \\\n", "136 2.0 2.0 \n", "17 0.0 0.0 \n", "142 2.0 2.0 \n", "59 1.0 1.0 \n", "6 0.0 0.0 \n", "\n", " petal width (cm)__y=virginica petal length (cm)__y=virginica \n", "136 1.0 1.0 \n", "17 0.0 0.0 \n", "142 1.0 1.0 \n", "59 0.0 0.0 \n", "6 0.0 0.0 " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_set_processed[best_features].head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Feature ``petal width (cm)_versicolor`` is the most associated with the target ``iris_type``:\n", " - Tschuprow's T value is ``TschuprowtMeasure=0.9558``\n", " - Its has 0 % of NaNs (``NanMeasure=0.0``) \n", " - Its mode, ``0``, represents 31 % of observed data (``pct_nan=0.3100``)\n", "\n", "* Feature ``petal length (cm)__y=versicolor`` is strongly associated to feature ``petal width (cm)_versicolor``:\n", " - Tschuprow's T value is ``TschuprowtFilter=0.9274`` for ``TschuprowtWith=petal width (cm)_versicolor``\n", "\n", "* Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Modeling\n", "Fitting model on train data" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'super' object has no attribute '__sklearn_tags__'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:974\u001b[0m, in \u001b[0;36mMimeBundleFormatter.__call__\u001b[1;34m(self, obj, include, exclude)\u001b[0m\n\u001b[0;32m 971\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 973\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 974\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43minclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43minclude\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mexclude\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mexclude\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 975\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 976\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:469\u001b[0m, in \u001b[0;36mBaseEstimator._repr_mimebundle_\u001b[1;34m(self, **kwargs)\u001b[0m\n\u001b[0;32m 467\u001b[0m output \u001b[38;5;241m=\u001b[39m {\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/plain\u001b[39m\u001b[38;5;124m\"\u001b[39m: \u001b[38;5;28mrepr\u001b[39m(\u001b[38;5;28mself\u001b[39m)}\n\u001b[0;32m 468\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m get_config()[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdisplay\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mdiagram\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m--> 469\u001b[0m output[\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtext/html\u001b[39m\u001b[38;5;124m\"\u001b[39m] \u001b[38;5;241m=\u001b[39m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[0;32m 470\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m output\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:540\u001b[0m, in \u001b[0;36mClassifierMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 539\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 540\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 541\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 542\u001b[0m tags\u001b[38;5;241m.\u001b[39mclassifier_tags \u001b[38;5;241m=\u001b[39m ClassifierTags()\n", "\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'" ] }, { "ename": "AttributeError", "evalue": "'super' object has no attribute '__sklearn_tags__'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\IPython\\core\\formatters.py:344\u001b[0m, in \u001b[0;36mBaseFormatter.__call__\u001b[1;34m(self, obj)\u001b[0m\n\u001b[0;32m 342\u001b[0m method \u001b[38;5;241m=\u001b[39m get_real_method(obj, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mprint_method)\n\u001b[0;32m 343\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m method \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[1;32m--> 344\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mmethod\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 345\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[0;32m 346\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:463\u001b[0m, in \u001b[0;36mBaseEstimator._repr_html_inner\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 458\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m_repr_html_inner\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[0;32m 459\u001b[0m \u001b[38;5;250m \u001b[39m\u001b[38;5;124;03m\"\"\"This function is returned by the @property `_repr_html_` to make\u001b[39;00m\n\u001b[0;32m 460\u001b[0m \u001b[38;5;124;03m `hasattr(estimator, \"_repr_html_\") return `True` or `False` depending\u001b[39;00m\n\u001b[0;32m 461\u001b[0m \u001b[38;5;124;03m on `get_config()[\"display\"]`.\u001b[39;00m\n\u001b[0;32m 462\u001b[0m \u001b[38;5;124;03m \"\"\"\u001b[39;00m\n\u001b[1;32m--> 463\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mestimator_html_repr\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m)\u001b[49m\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_estimator_html_repr.py:387\u001b[0m, in \u001b[0;36mestimator_html_repr\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 385\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[0;32m 386\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m--> 387\u001b[0m \u001b[43mcheck_is_fitted\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 388\u001b[0m status_label \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mFitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 389\u001b[0m is_fitted_css_class \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfitted\u001b[39m\u001b[38;5;124m\"\u001b[39m\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\validation.py:1751\u001b[0m, in \u001b[0;36mcheck_is_fitted\u001b[1;34m(estimator, attributes, msg, all_or_any)\u001b[0m\n\u001b[0;32m 1748\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28mhasattr\u001b[39m(estimator, \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mfit\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[0;32m 1749\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mTypeError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m is not an estimator instance.\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m (estimator))\n\u001b[1;32m-> 1751\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[43mget_tags\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m\n\u001b[0;32m 1753\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m tags\u001b[38;5;241m.\u001b[39mrequires_fit \u001b[38;5;129;01mand\u001b[39;00m attributes \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m:\n\u001b[0;32m 1754\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\utils\\_tags.py:430\u001b[0m, in \u001b[0;36mget_tags\u001b[1;34m(estimator)\u001b[0m\n\u001b[0;32m 428\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m klass \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mreversed\u001b[39m(\u001b[38;5;28mtype\u001b[39m(estimator)\u001b[38;5;241m.\u001b[39mmro()):\n\u001b[0;32m 429\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m__sklearn_tags__\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n\u001b[1;32m--> 430\u001b[0m sklearn_tags_provider[klass] \u001b[38;5;241m=\u001b[39m \u001b[43mklass\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m\u001b[43m(\u001b[49m\u001b[43mestimator\u001b[49m\u001b[43m)\u001b[49m \u001b[38;5;66;03m# type: ignore[attr-defined]\u001b[39;00m\n\u001b[0;32m 431\u001b[0m class_order\u001b[38;5;241m.\u001b[39mappend(klass)\n\u001b[0;32m 432\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m_more_tags\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28mvars\u001b[39m(klass):\n", "File \u001b[1;32mc:\\Users\\defra\\AppData\\Local\\pypoetry\\Cache\\virtualenvs\\autocarver-i96ERKJw-py3.9\\lib\\site-packages\\sklearn\\base.py:540\u001b[0m, in \u001b[0;36mClassifierMixin.__sklearn_tags__\u001b[1;34m(self)\u001b[0m\n\u001b[0;32m 539\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21m__sklearn_tags__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m--> 540\u001b[0m tags \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43msuper\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m__sklearn_tags__\u001b[49m()\n\u001b[0;32m 541\u001b[0m tags\u001b[38;5;241m.\u001b[39mestimator_type \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mclassifier\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[0;32m 542\u001b[0m tags\u001b[38;5;241m.\u001b[39mclassifier_tags \u001b[38;5;241m=\u001b[39m ClassifierTags()\n", "\u001b[1;31mAttributeError\u001b[0m: 'super' object has no attribute '__sklearn_tags__'" ] }, { "data": { "text/plain": [ "XGBClassifier(base_score=None, booster=None, callbacks=None,\n", " colsample_bylevel=None, colsample_bynode=None,\n", " colsample_bytree=None, device=None, early_stopping_rounds=None,\n", " enable_categorical=False, eval_metric=None, feature_types=None,\n", " gamma=None, grow_policy=None, importance_type=None,\n", " interaction_constraints=None, learning_rate=None, max_bin=None,\n", " max_cat_threshold=None, max_cat_to_onehot=None,\n", " max_delta_step=None, max_depth=None, max_leaves=None,\n", " min_child_weight=None, missing=nan, monotone_constraints=None,\n", " multi_strategy=None, n_estimators=None, n_jobs=None,\n", " num_parallel_tree=None, objective='multi:softmax', ...)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from xgboost import XGBClassifier\n", "from sklearn.preprocessing import LabelEncoder\n", "\n", "# Encode string labels to integers\n", "label_encoder = LabelEncoder()\n", "train_set_processed[target] = label_encoder.fit_transform(train_set[target])\n", "\n", "model = XGBClassifier(objective='multi:softmax')\n", "model.fit(train_set_processed[best_features], train_set_processed[target])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Saving model" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "model.save_model(\"multiclass_xgboost.json\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Prediction on dev dataset and performance" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import accuracy_score\n", "\n", "dev_set_processed[target] = label_encoder.transform(dev_set[target])\n", "dev_pred = model.predict(dev_set_processed[best_features])\n", "accuracy_score(dev_set_processed[target], dev_pred)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's next?\n", "\n", "* Thanks to **Carvers** all of your features are now optimally processed for your classification task!\n", "* As a final step towards your model, **Selectors** can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out [Selectors Examples](https://autocarver.readthedocs.io/en/latest/selectors_examples.html)!\n", "\n", "## Well done!\n", "\n", "Your commitment to achieving optimal results in multiclass classification tasks shines through in your meticulous use of **AutoCarver**'s ``MulticlassCarver`` for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.\n", "\n", "The ``MulticlassCarver`` has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.\n", "\n", "We extend our sincere appreciation for choosing **AutoCarver** as your companion in the data preprocessing journey. Your use of **AutoCarver** demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in multiclass classification tasks.\n", "\n", "As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We're excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.\n", "\n", "Thank you for trusting **AutoCarver**, and we wish you continued success in your data-driven ventures." ] } ], "metadata": { "kernelspec": { "display_name": "autocarver-i96ERKJw-py3.9", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 2 }