Setting things up

About this notebook

In this notebook, we embark on a journey to refine the Iris Dataset for optimal performance in multiclass classification tasks, leveraging the capabilities of the OneVsRestCarver pipeline. Recognized for its association-maximizing discretization, OneVsRestCarver is a versatile Python tool that gracefully handles diverse data types—be they quantitative or qualitative. Our specific objective is to prepare the dataset for multiclass classification, illuminating the distinctive characteristics of Iris flower species.

The Iris Dataset, a classic in the realm of machine learning, presents features such as sepal and petal dimensions for three different Iris species. By employing OneVsRestCarver, our goal is to discretize both quantitative and qualitative features seamlessly, tailoring them for effective representation in our multiclass classification models.

Throughout this notebook, we’ll unravel the intricacies of OneVsRestCarver’s discretization pipeline, showcasing its adaptability to various data types. Whether it involves transforming petal lengths or encoding species information, OneVsRestCarver ensures that each feature is finely tuned for our multiclass classification tasks.

Join us in this exploration as we harness the power of OneVsRestCarver to preprocess the Iris Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that not only distinguishes between Iris species but also sets the stage for the development of accurate and impactful multiclass classification models.

Let’s dive in and uncover the potential of OneVsRestCarver in transforming the Iris Dataset for optimal predictive modeling.

Installation

[1]:

# %pip install AutoCarver[jupyter]

Iris Data

In this example notebook, we will use the Iris dataset.

The Iris dataset is a classic and widely used dataset in the field of machine learning and pattern recognition. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 and has since become a benchmark dataset for various classification and clustering tasks.

The dataset consists of measurements from 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Four features are included for each flower: sepal length, sepal width, petal length, and petal width, all measured in centimeters.

The primary objective of the Iris dataset is typically to classify iris flowers into one of the three species based on these four features (multiclass classification).

[1]:

from sklearn import datasets

# Load dataset directly from sklearn
iris = datasets.load_iris(as_frame=True)

# conversion to pandas
iris_data = iris["data"]
iris_data["iris_type"] = list(map(lambda u: iris["target_names"][u], iris["target"]))

# Display the first few rows of the dataset
iris_data.head()

[1]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	iris_type
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

Target type and Carver selection

[2]:

target = "iris_type"

iris_data[target].value_counts(dropna=False)

[2]:

iris_type
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64

The target "iris_type" is a multiclass target of type str used in a classification task. Hence we will use AutoCarver.OneVsRestCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.

Data Sampling

[3]:

from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(iris_data, test_size=0.33, random_state=42, stratify=iris_data[target])

# checking target rate per dataset
train_set[target].value_counts(dropna=False, normalize=True), dev_set[target].value_counts(dropna=False, normalize=True)

[3]:

(iris_type
 setosa        0.34
 virginica     0.33
 versicolor    0.33
 Name: proportion, dtype: float64,
 iris_type
 virginica     0.34
 versicolor    0.34
 setosa        0.32
 Name: proportion, dtype: float64)

Picking up columns to Carve

[4]:

train_set.head()

[4]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	iris_type
136	6.3	3.4	5.6	2.4	virginica
17	5.1	3.5	1.4	0.3	setosa
142	5.8	2.7	5.1	1.9	virginica
59	5.2	2.7	3.9	1.4	versicolor
6	4.6	3.4	1.4	0.3	setosa

[5]:

# column data types
train_set.dtypes

[5]:

sepal length (cm)    float64
sepal width (cm)     float64
petal length (cm)    float64
petal width (cm)     float64
iris_type                str
dtype: object

[6]:

print(iris["feature_names"])

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

All features are quantitative continuous features. Those features will be added to the list of quantitative_features.

[7]:

from AutoCarver import Features

# lists of features per data type
features = Features(numericals=["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"])

Using AutoCarver

AutoCarver settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:

For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

[8]:

min_freq = 0.05

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Desired number of modalities

The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.

[9]:

max_n_mod = 5

Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)

Grouping NaNs

The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with nan.

[10]:

dropna = False  # anyway, there are no nan in this dataset

Type of output carved features

The attribute ordinal_encoding allows one to choose the output type:

Use True for integer output of ranked modalities (default)
Use False for string output of modalities

[11]:

ordinal_encoding = True

Fitting AutoCarver

First, all quantitative features are discretized:
1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq)
2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2) to be grouped with its closest modality
Second, all features are carved following this recipe, for all classes of train_set[target] (except one):
1. The raw distribution is printed out on provided train_set and dev_set. It’s the output of the discretization step
2. Grouping modalities: all consecutive combinations of modalities are applied to train_set
3. Computing associations: the association metric (Tschruprow’s T, by default) is computed with the provided target train_set[target]
4. Combinations are sorted in descending order by association value
5. Testing robustness: finds the first combination that checks the following:
  - Representativness of modalities on train_set and dev_set (all should be more frequent than min_freq/2)
  - Distinct target rates per consecutive modalities on train_set and dev_set
  - No inversion of target rates between train_set and dev_set (same ordering of modalities by target rate)
6. (Optional) If requested via dropna=True, and if any, all combinations of modalities with nan are applied to train_set and steps 3. and 4. are run
7. The carved distribution is printed out on provided train_set and dev_set. It’s the output of the carving step

[12]:

from AutoCarver import OneVsRestCarver
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# intiating AutoCarver
auto_carver = OneVsRestCarver(
    features=features,
    max_n_mod=max_n_mod,
    min_freq=min_freq,
    config=ProcessingConfig(ordinal_encoding=ordinal_encoding, dropna=dropna, verbose=True, copy=True),
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

WARNING: can't set copy=True for OneVsRestCarver (no inplace DataFrame.assign).

---------
[OneVsRestCarver] Fit y=versicolor (1/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
 - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
------

---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=versicolor') (1/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 4.400e+00	0.0000	0.0200	2
4.400e+00 < x <= 4.600e+00	0.0000	0.0300	3
4.600e+00 < x <= 4.700e+00	0.0000	0.0100	1
4.700e+00 < x <= 4.800e+00	0.0000	0.0500	5
4.800e+00 < x <= 4.900e+00	0.0000	0.0300	3
4.900e+00 < x <= 5.000e+00	0.1429	0.0700	7
5.000e+00 < x <= 5.100e+00	0.1667	0.0600	6
5.100e+00 < x <= 5.200e+00	0.3333	0.0300	3
5.200e+00 < x <= 5.300e+00	0.0000	0.0100	1
5.300e+00 < x <= 5.400e+00	0.0000	0.0300	3
5.400e+00 < x <= 5.500e+00	0.6667	0.0600	6
5.500e+00 < x <= 5.600e+00	0.6667	0.0300	3
5.600e+00 < x <= 5.700e+00	0.5000	0.0400	4
5.700e+00 < x <= 5.800e+00	0.4000	0.0500	5
5.800e+00 < x <= 5.900e+00	0.6667	0.0300	3
5.900e+00 < x <= 6.000e+00	0.6667	0.0300	3
6.000e+00 < x <= 6.100e+00	1.0000	0.0300	3
6.100e+00 < x <= 6.200e+00	0.6667	0.0300	3
6.200e+00 < x <= 6.300e+00	0.2857	0.0700	7
6.300e+00 < x <= 6.400e+00	0.2500	0.0400	4
6.400e+00 < x <= 6.500e+00	0.5000	0.0200	2
6.500e+00 < x <= 6.700e+00	0.6667	0.0600	6
6.700e+00 < x <= 6.800e+00	0.3333	0.0300	3
6.800e+00 < x <= 6.900e+00	0.3333	0.0300	3
6.900e+00 < x <= 7.100e+00	0.5000	0.0200	2
7.100e+00 < x <= 7.200e+00	0.0000	0.0300	3
7.200e+00 < x <= 7.600e+00	0.0000	0.0200	2
7.600e+00 < x <= 7.700e+00	0.0000	0.0300	3
7.700e+00 < x <= 7.900e+00	0.0000	0.0100	1
7.900e+00 < x	nan	0.0000	0

X_dev distribution
target_mean	frequency	count
0.0000	0.0400	2
0.0000	0.0400	2
0.0000	0.0200	1
nan	0.0000	0
0.3333	0.0600	3
0.3333	0.0600	3
0.0000	0.0600	3
0.0000	0.0200	1
nan	0.0000	0
0.3333	0.0600	3
1.0000	0.0200	1
1.0000	0.0600	3
0.7500	0.0800	4
0.5000	0.0400	2
nan	0.0000	0
0.6667	0.0600	3
0.3333	0.0600	3
0.0000	0.0200	1
0.5000	0.0400	2
0.3333	0.0600	3
0.0000	0.0600	3
0.2500	0.0800	4
nan	0.0000	0
0.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0
0.0000	0.0200	1
0.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 5.40e+00	0.0882	0.3400	34
5.40e+00 < x <= 7.10e+00	0.5263	0.5700	57
7.10e+00 < x	0.0000	0.0900	9

X_dev distribution
target_mean	frequency	count
0.1667	0.3600	18
0.4667	0.6000	30
0.0000	0.0400	2

--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=versicolor') (2/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 2.000e+00	1.0000	0.0100	1
2.000e+00 < x <= 2.200e+00	0.6667	0.0300	3
2.200e+00 < x <= 2.400e+00	1.0000	0.0300	3
2.400e+00 < x <= 2.500e+00	0.6000	0.0500	5
2.500e+00 < x <= 2.600e+00	0.7500	0.0400	4
2.600e+00 < x <= 2.700e+00	0.5714	0.0700	7
2.700e+00 < x <= 2.800e+00	0.4444	0.0900	9
2.800e+00 < x <= 2.900e+00	0.6667	0.0600	6
2.900e+00 < x <= 3.000e+00	0.2857	0.1400	14
3.000e+00 < x <= 3.100e+00	0.3333	0.0900	9
3.100e+00 < x <= 3.200e+00	0.2222	0.0900	9
3.200e+00 < x <= 3.300e+00	0.0000	0.0400	4
3.300e+00 < x <= 3.400e+00	0.0000	0.0600	6
3.400e+00 < x <= 3.500e+00	0.0000	0.0600	6
3.500e+00 < x <= 3.600e+00	0.0000	0.0300	3
3.600e+00 < x <= 3.700e+00	0.0000	0.0100	1
3.700e+00 < x <= 3.800e+00	0.0000	0.0500	5
3.800e+00 < x <= 4.100e+00	0.0000	0.0300	3
4.100e+00 < x	0.0000	0.0200	2

X_dev distribution
target_mean	frequency	count
nan	0.0000	0
nan	0.0000	0
0.7500	0.0800	4
0.3333	0.0600	3
0.0000	0.0200	1
0.5000	0.0400	2
0.4000	0.1000	5
0.7500	0.0800	4
0.3333	0.2400	12
0.0000	0.0400	2
0.2500	0.0800	4
0.5000	0.0400	2
0.1667	0.1200	6
nan	0.0000	0
0.0000	0.0200	1
0.0000	0.0400	2
0.0000	0.0200	1
0.0000	0.0200	1
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 2.9e+00	0.6316	0.3800	38
2.9e+00 < x	0.1452	0.6200	62

X_dev distribution
target_mean	frequency	count
0.5263	0.3800	19
0.2258	0.6200	31

--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=versicolor') (3/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.100e+00	0.0000	0.0100	1
1.100e+00 < x <= 1.300e+00	0.0000	0.0300	3
1.300e+00 < x <= 1.400e+00	0.0000	0.1100	11
1.400e+00 < x <= 1.500e+00	0.0000	0.0900	9
1.500e+00 < x <= 1.600e+00	0.0000	0.0700	7
1.600e+00 < x <= 1.900e+00	0.0000	0.0300	3
1.900e+00 < x <= 3.500e+00	1.0000	0.0300	3
3.500e+00 < x <= 3.700e+00	1.0000	0.0200	2
3.700e+00 < x <= 4.000e+00	1.0000	0.0700	7
4.000e+00 < x <= 4.200e+00	1.0000	0.0300	3
4.200e+00 < x <= 4.300e+00	1.0000	0.0200	2
4.300e+00 < x <= 4.400e+00	1.0000	0.0400	4
4.400e+00 < x <= 4.500e+00	1.0000	0.0100	1
4.500e+00 < x <= 4.600e+00	1.0000	0.0300	3
4.600e+00 < x <= 4.700e+00	1.0000	0.0300	3
4.700e+00 < x <= 4.800e+00	0.6667	0.0300	3
4.800e+00 < x <= 4.900e+00	0.5000	0.0400	4
4.900e+00 < x <= 5.000e+00	0.0000	0.0300	3
5.000e+00 < x <= 5.100e+00	0.1667	0.0600	6
5.100e+00 < x <= 5.400e+00	0.0000	0.0200	2
5.400e+00 < x <= 5.600e+00	0.0000	0.0500	5
5.600e+00 < x <= 5.700e+00	0.0000	0.0300	3
5.700e+00 < x <= 5.900e+00	0.0000	0.0300	3
5.900e+00 < x <= 6.100e+00	0.0000	0.0500	5
6.100e+00 < x <= 6.600e+00	0.0000	0.0200	2
6.600e+00 < x	0.0000	0.0200	2

X_dev distribution
target_mean	frequency	count
0.0000	0.0200	1
0.0000	0.1200	6
0.0000	0.0400	2
0.0000	0.0800	4
nan	0.0000	0
0.0000	0.0600	3
1.0000	0.0400	2
nan	0.0000	0
1.0000	0.0400	2
1.0000	0.0800	4
nan	0.0000	0
nan	0.0000	0
0.8571	0.1400	7
nan	0.0000	0
1.0000	0.0400	2
0.0000	0.0200	1
0.0000	0.0200	1
1.0000	0.0200	1
0.0000	0.0400	2
0.0000	0.0800	4
0.0000	0.0800	4
nan	0.0000	0
0.0000	0.0400	2
nan	0.0000	0
0.0000	0.0200	1
0.0000	0.0200	1

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 1.90e+00	0.0000	0.3400	34
1.90e+00 < x <= 4.80e+00	0.9677	0.3100	31
4.80e+00 < x	0.0857	0.3500	35

X_dev distribution
target_mean	frequency	count
0.0000	0.3200	16
0.8889	0.3600	18
0.0625	0.3200	16

--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=versicolor') (4/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.000e-01	0.0000	0.0500	5
1.000e-01 < x <= 2.000e-01	0.0000	0.1700	17
2.000e-01 < x <= 3.000e-01	0.0000	0.0500	5
3.000e-01 < x <= 4.000e-01	0.0000	0.0600	6
4.000e-01 < x <= 6.000e-01	0.0000	0.0100	1
6.000e-01 < x <= 1.000e+00	1.0000	0.0400	4
1.000e+00 < x <= 1.100e+00	1.0000	0.0200	2
1.100e+00 < x <= 1.200e+00	1.0000	0.0500	5
1.200e+00 < x <= 1.300e+00	1.0000	0.0800	8
1.300e+00 < x <= 1.400e+00	1.0000	0.0600	6
1.400e+00 < x <= 1.500e+00	0.8571	0.0700	7
1.500e+00 < x <= 1.600e+00	0.5000	0.0200	2
1.600e+00 < x <= 1.800e+00	0.1429	0.0700	7
1.800e+00 < x <= 1.900e+00	0.0000	0.0400	4
1.900e+00 < x <= 2.000e+00	0.0000	0.0400	4
2.000e+00 < x <= 2.100e+00	0.0000	0.0600	6
2.100e+00 < x <= 2.200e+00	0.0000	0.0100	1
2.200e+00 < x <= 2.300e+00	0.0000	0.0500	5
2.300e+00 < x <= 2.400e+00	0.0000	0.0200	2
2.400e+00 < x <= 2.500e+00	0.0000	0.0300	3
2.500e+00 < x	nan	0.0000	0

X_dev distribution
target_mean	frequency	count
nan	0.0000	0
0.0000	0.2400	12
0.0000	0.0400	2
0.0000	0.0200	1
0.0000	0.0200	1
1.0000	0.0600	3
1.0000	0.0200	1
nan	0.0000	0
1.0000	0.1000	5
0.5000	0.0400	2
0.8000	0.1000	5
1.0000	0.0400	2
0.1429	0.1400	7
0.0000	0.0200	1
0.0000	0.0400	2
nan	0.0000	0
0.0000	0.0400	2
0.0000	0.0600	3
0.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 6.00e-01	0.0000	0.3400	34
6.00e-01 < x <= 1.60e+00	0.9412	0.3400	34
1.60e+00 < x	0.0312	0.3200	32

X_dev distribution
target_mean	frequency	count
0.0000	0.3200	16
0.8889	0.3600	18
0.0625	0.3200	16

---------


---------
[OneVsRestCarver] Fit y=virginica (2/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
 - [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
------

---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=virginica') (1/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 4.400e+00	0.0000	0.0200	2
4.400e+00 < x <= 4.600e+00	0.0000	0.0300	3
4.600e+00 < x <= 4.700e+00	0.0000	0.0100	1
4.700e+00 < x <= 4.800e+00	0.0000	0.0500	5
4.800e+00 < x <= 4.900e+00	0.0000	0.0300	3
4.900e+00 < x <= 5.000e+00	0.0000	0.0700	7
5.000e+00 < x <= 5.100e+00	0.0000	0.0600	6
5.100e+00 < x <= 5.200e+00	0.0000	0.0300	3
5.200e+00 < x <= 5.300e+00	0.0000	0.0100	1
5.300e+00 < x <= 5.400e+00	0.0000	0.0300	3
5.400e+00 < x <= 5.500e+00	0.0000	0.0600	6
5.500e+00 < x <= 5.600e+00	0.3333	0.0300	3
5.600e+00 < x <= 5.700e+00	0.2500	0.0400	4
5.700e+00 < x <= 5.800e+00	0.6000	0.0500	5
5.800e+00 < x <= 5.900e+00	0.3333	0.0300	3
5.900e+00 < x <= 6.000e+00	0.3333	0.0300	3
6.000e+00 < x <= 6.100e+00	0.0000	0.0300	3
6.100e+00 < x <= 6.200e+00	0.3333	0.0300	3
6.200e+00 < x <= 6.300e+00	0.7143	0.0700	7
6.300e+00 < x <= 6.400e+00	0.7500	0.0400	4
6.400e+00 < x <= 6.500e+00	0.5000	0.0200	2
6.500e+00 < x <= 6.700e+00	0.3333	0.0600	6
6.700e+00 < x <= 6.800e+00	0.6667	0.0300	3
6.800e+00 < x <= 6.900e+00	0.6667	0.0300	3
6.900e+00 < x <= 7.100e+00	0.5000	0.0200	2
7.100e+00 < x <= 7.200e+00	1.0000	0.0300	3
7.200e+00 < x <= 7.600e+00	1.0000	0.0200	2
7.600e+00 < x <= 7.700e+00	1.0000	0.0300	3
7.700e+00 < x <= 7.900e+00	1.0000	0.0100	1
7.900e+00 < x	nan	0.0000	0

X_dev distribution
target_mean	frequency	count
0.0000	0.0400	2
0.0000	0.0400	2
0.0000	0.0200	1
nan	0.0000	0
0.3333	0.0600	3
0.0000	0.0600	3
0.0000	0.0600	3
0.0000	0.0200	1
nan	0.0000	0
0.0000	0.0600	3
0.0000	0.0200	1
0.0000	0.0600	3
0.0000	0.0800	4
0.0000	0.0400	2
nan	0.0000	0
0.3333	0.0600	3
0.6667	0.0600	3
1.0000	0.0200	1
0.5000	0.0400	2
0.6667	0.0600	3
1.0000	0.0600	3
0.7500	0.0800	4
nan	0.0000	0
1.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0
1.0000	0.0200	1
1.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 6.2e+00	0.1250	0.6400	64
6.2e+00 < x	0.6944	0.3600	36

X_dev distribution
target_mean	frequency	count
0.1429	0.7000	35
0.8000	0.3000	15

--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=virginica') (2/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 2.000e+00	0.0000	0.0100	1
2.000e+00 < x <= 2.200e+00	0.3333	0.0300	3
2.200e+00 < x <= 2.400e+00	0.0000	0.0300	3
2.400e+00 < x <= 2.500e+00	0.4000	0.0500	5
2.500e+00 < x <= 2.600e+00	0.2500	0.0400	4
2.600e+00 < x <= 2.700e+00	0.4286	0.0700	7
2.700e+00 < x <= 2.800e+00	0.5556	0.0900	9
2.800e+00 < x <= 2.900e+00	0.1667	0.0600	6
2.900e+00 < x <= 3.000e+00	0.4286	0.1400	14
3.000e+00 < x <= 3.100e+00	0.2222	0.0900	9
3.100e+00 < x <= 3.200e+00	0.5556	0.0900	9
3.200e+00 < x <= 3.300e+00	0.7500	0.0400	4
3.300e+00 < x <= 3.400e+00	0.1667	0.0600	6
3.400e+00 < x <= 3.500e+00	0.0000	0.0600	6
3.500e+00 < x <= 3.600e+00	0.3333	0.0300	3
3.600e+00 < x <= 3.700e+00	0.0000	0.0100	1
3.700e+00 < x <= 3.800e+00	0.4000	0.0500	5
3.800e+00 < x <= 4.100e+00	0.0000	0.0300	3
4.100e+00 < x	0.0000	0.0200	2

X_dev distribution
target_mean	frequency	count
nan	0.0000	0
nan	0.0000	0
0.0000	0.0800	4
0.6667	0.0600	3
1.0000	0.0200	1
0.5000	0.0400	2
0.6000	0.1000	5
0.2500	0.0800	4
0.5000	0.2400	12
1.0000	0.0400	2
0.0000	0.0800	4
0.0000	0.0400	2
0.1667	0.1200	6
nan	0.0000	0
0.0000	0.0200	1
0.0000	0.0400	2
0.0000	0.0200	1
0.0000	0.0200	1
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 2.40e+00	0.1429	0.0700	7
2.40e+00 < x <= 3.30e+00	0.4179	0.6700	67
3.30e+00 < x	0.1538	0.2600	26

X_dev distribution
target_mean	frequency	count
0.0000	0.0800	4
0.4571	0.7000	35
0.0909	0.2200	11

--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=virginica') (3/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.100e+00	0.0000	0.0100	1
1.100e+00 < x <= 1.300e+00	0.0000	0.0300	3
1.300e+00 < x <= 1.400e+00	0.0000	0.1100	11
1.400e+00 < x <= 1.500e+00	0.0000	0.0900	9
1.500e+00 < x <= 1.600e+00	0.0000	0.0700	7
1.600e+00 < x <= 1.900e+00	0.0000	0.0300	3
1.900e+00 < x <= 3.500e+00	0.0000	0.0300	3
3.500e+00 < x <= 3.700e+00	0.0000	0.0200	2
3.700e+00 < x <= 4.000e+00	0.0000	0.0700	7
4.000e+00 < x <= 4.200e+00	0.0000	0.0300	3
4.200e+00 < x <= 4.300e+00	0.0000	0.0200	2
4.300e+00 < x <= 4.400e+00	0.0000	0.0400	4
4.400e+00 < x <= 4.500e+00	0.0000	0.0100	1
4.500e+00 < x <= 4.600e+00	0.0000	0.0300	3
4.600e+00 < x <= 4.700e+00	0.0000	0.0300	3
4.700e+00 < x <= 4.800e+00	0.3333	0.0300	3
4.800e+00 < x <= 4.900e+00	0.5000	0.0400	4
4.900e+00 < x <= 5.000e+00	1.0000	0.0300	3
5.000e+00 < x <= 5.100e+00	0.8333	0.0600	6
5.100e+00 < x <= 5.400e+00	1.0000	0.0200	2
5.400e+00 < x <= 5.600e+00	1.0000	0.0500	5
5.600e+00 < x <= 5.700e+00	1.0000	0.0300	3
5.700e+00 < x <= 5.900e+00	1.0000	0.0300	3
5.900e+00 < x <= 6.100e+00	1.0000	0.0500	5
6.100e+00 < x <= 6.600e+00	1.0000	0.0200	2
6.600e+00 < x	1.0000	0.0200	2

X_dev distribution
target_mean	frequency	count
0.0000	0.0200	1
0.0000	0.1200	6
0.0000	0.0400	2
0.0000	0.0800	4
nan	0.0000	0
0.0000	0.0600	3
0.0000	0.0400	2
nan	0.0000	0
0.0000	0.0400	2
0.0000	0.0800	4
nan	0.0000	0
nan	0.0000	0
0.1429	0.1400	7
nan	0.0000	0
0.0000	0.0400	2
1.0000	0.0200	1
1.0000	0.0200	1
0.0000	0.0200	1
1.0000	0.0400	2
1.0000	0.0800	4
1.0000	0.0800	4
nan	0.0000	0
1.0000	0.0400	2
nan	0.0000	0
1.0000	0.0200	1
1.0000	0.0200	1

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 4.8e+00	0.0154	0.6500	65
4.8e+00 < x	0.9143	0.3500	35

X_dev distribution
target_mean	frequency	count
0.0588	0.6800	34
0.9375	0.3200	16

--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=virginica') (4/4)
 [BinaryCarver] Raw distribution

X distribution
	target_mean	frequency	count
x <= 1.000e-01	0.0000	0.0500	5
1.000e-01 < x <= 2.000e-01	0.0000	0.1700	17
2.000e-01 < x <= 3.000e-01	0.0000	0.0500	5
3.000e-01 < x <= 4.000e-01	0.0000	0.0600	6
4.000e-01 < x <= 6.000e-01	0.0000	0.0100	1
6.000e-01 < x <= 1.000e+00	0.0000	0.0400	4
1.000e+00 < x <= 1.100e+00	0.0000	0.0200	2
1.100e+00 < x <= 1.200e+00	0.0000	0.0500	5
1.200e+00 < x <= 1.300e+00	0.0000	0.0800	8
1.300e+00 < x <= 1.400e+00	0.0000	0.0600	6
1.400e+00 < x <= 1.500e+00	0.1429	0.0700	7
1.500e+00 < x <= 1.600e+00	0.5000	0.0200	2
1.600e+00 < x <= 1.800e+00	0.8571	0.0700	7
1.800e+00 < x <= 1.900e+00	1.0000	0.0400	4
1.900e+00 < x <= 2.000e+00	1.0000	0.0400	4
2.000e+00 < x <= 2.100e+00	1.0000	0.0600	6
2.100e+00 < x <= 2.200e+00	1.0000	0.0100	1
2.200e+00 < x <= 2.300e+00	1.0000	0.0500	5
2.300e+00 < x <= 2.400e+00	1.0000	0.0200	2
2.400e+00 < x <= 2.500e+00	1.0000	0.0300	3
2.500e+00 < x	nan	0.0000	0

X_dev distribution
target_mean	frequency	count
nan	0.0000	0
0.0000	0.2400	12
0.0000	0.0400	2
0.0000	0.0200	1
0.0000	0.0200	1
0.0000	0.0600	3
0.0000	0.0200	1
nan	0.0000	0
0.0000	0.1000	5
0.5000	0.0400	2
0.2000	0.1000	5
0.0000	0.0400	2
0.8571	0.1400	7
1.0000	0.0200	1
1.0000	0.0400	2
nan	0.0000	0
1.0000	0.0400	2
1.0000	0.0600	3
1.0000	0.0200	1
nan	0.0000	0
nan	0.0000	0

 [BinaryCarver] Carved distribution

X distribution
	target_mean	frequency	count
x <= 1.5e+00	0.0152	0.6600	66
1.5e+00 < x	0.9412	0.3400	34

X_dev distribution
target_mean	frequency	count
0.0625	0.6400	32
0.8333	0.3600	18

---------

AutoCarver analysis

Carving Summary

[13]:

auto_carver.summary

[13]:

						content	target_mean	frequency	dropped	dropped_reason
feature	count	cramerv	tschuprowt	n_mod	label
Quantitative('sepal length (cm)__y=versicolor')	34.0	0.483288	0.406395	3	0	x <= 5.40e+00	0.088235	0.34	False	None
	57.0	0.483288	0.406395	3	1	5.40e+00 < x <= 7.10e+00	0.526316	0.57	False	None
	9.0	0.483288	0.406395	3	2	7.10e+00 < x	0.000000	0.09	False	None
Quantitative('sepal width (cm)__y=versicolor')	38.0	0.480207	0.480207	2	0	x <= 2.9e+00	0.631579	0.38	False	None
Quantitative('sepal width (cm)__y=versicolor')	62.0	0.480207	0.480207	2	1	2.9e+00 < x	0.145161	0.62	False	None
Quantitative('petal length (cm)__y=versicolor')	34.0	0.912237	0.767096	3	0	x <= 1.90e+00	0.000000	0.34	False	None
	31.0	0.912237	0.767096	3	1	1.90e+00 < x <= 4.80e+00	0.967742	0.31	False	None
	35.0	0.912237	0.767096	3	2	4.80e+00 < x	0.085714	0.35	False	None
Quantitative('petal width (cm)__y=versicolor')	34.0	0.933300	0.784809	3	0	x <= 6.00e-01	0.000000	0.34	False	None
	34.0	0.933300	0.784809	3	1	6.00e-01 < x <= 1.60e+00	0.941176	0.34	False	None
	32.0	0.933300	0.784809	3	2	1.60e+00 < x	0.031250	0.32	False	None
Quantitative('sepal length (cm)__y=virginica')	64.0	0.559144	0.559144	2	0	x <= 6.2e+00	0.125000	0.64	False	None
Quantitative('sepal length (cm)__y=virginica')	36.0	0.559144	0.559144	2	1	6.2e+00 < x	0.694444	0.36	False	None
Quantitative('sepal width (cm)__y=virginica')	7.0	0.266452	0.224058	3	0	x <= 2.40e+00	0.142857	0.07	False	None
	67.0	0.266452	0.224058	3	1	2.40e+00 < x <= 3.30e+00	0.417910	0.67	False	None
	26.0	0.266452	0.224058	3	2	3.30e+00 < x	0.153846	0.26	False	None
Quantitative('petal length (cm)__y=virginica')	65.0	0.889524	0.889524	2	0	x <= 4.8e+00	0.015385	0.65	False	None
Quantitative('petal length (cm)__y=virginica')	35.0	0.889524	0.889524	2	1	4.8e+00 < x	0.914286	0.35	False	None
Quantitative('petal width (cm)__y=virginica')	66.0	0.910463	0.910463	2	0	x <= 1.5e+00	0.015152	0.66	False	None
Quantitative('petal width (cm)__y=virginica')	34.0	0.910463	0.910463	2	1	1.5e+00 < x	0.941176	0.34	False	None

As requested with ordinal_encoding=True, output labels are integers of modalities
Features have been carved for two distinct binary targets:
- y=versicolor: dummy of target iris_type taking value "versicolor"
- y=virginica: dummy of target iris_type taking value "virginica"
For quantitative feature petal width (cm), for y=virginica, the selected combination of modalities groups petal widths as follows:
- label 0: lower or equal to 1.5cm (content="x <= 1.5e+00")
- label 1: higher than 1.5cm (content="1.5e+00 < x")

Detailed overview of tested combinations

[14]:

features['sepal length (cm)__y=versicolor'].history.head(7)

[14]:

	info	cramerv	tschuprowt	combination	n_mod	dropna	train	viable	dev
0	Raw distribution (n_mod=30>max_n_mod=5)	0.590426	0.254429	{'x <= 4.400e+00': 'x <= 4.400e+00', '4.400e+0...	30	False	NaN	NaN	NaN
1	Best for tschuprowt and max_n_mod=5	0.483288	0.406395	{'x <= 4.400e+00': 'x <= 4.400e+00', '4.400e+0...	3	False	{'viable': True, 'info': ''}	True	{'viable': True, 'info': ''}

[15]:

features['sepal length (cm)__y=versicolor'].history.dev[1]

[15]:

{'viable': True, 'info': ''}

The most associated combination of feature sepal length (cm)__y=versicolor (the first tested out, where info!="Raw distribution") did not pass the viability tests. When looking in history.dev:
- "Non-representative modality for min_freq=5.00%": tells us that a modality is unstable between train_set and dev_set
For feature sepal length (cm)__y=versicolor, the 4th combination is the first to pass the tests:
- info="Best for tschuprowt and max_n_mod=5"
- Tschuprow’s T with ìris_type is 0.390543 for this combination (by default, combinations are ranked according to this statistic)
- Following combinations (less associated with the target) where not tested: info="Not checked"
For all combinations dropna=False means that it is not a combination in which nans are being groupedwith other modalities (as requested with dropna=False)

Saving and Loading AutoCarver

Saving

All Carvers can safely be stored as a .json file.

[16]:

auto_carver.save("multiclass_carver.json")

Loading

Carvers can safely be loaded from a .json file.

[17]:

from AutoCarver import OneVsRestCarver

auto_carver = OneVsRestCarver.load("multiclass_carver.json")

WARNING: can't set copy=True for OneVsRestCarver (no inplace DataFrame.assign).

Applying AutoCarver

[18]:

dev_set_processed = auto_carver.transform(dev_set)

[19]:

dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))

[19]:

	sepal length (cm)__y=versicolor	sepal width (cm)__y=versicolor	petal length (cm)__y=versicolor	petal width (cm)__y=versicolor	sepal length (cm)__y=virginica	sepal width (cm)__y=virginica	petal length (cm)__y=virginica	petal width (cm)__y=virginica
0.0	0.36	0.38	0.32	0.32	0.7	0.08	0.68	0.64
1.0	0.60	0.62	0.36	0.36	0.3	0.70	0.32	0.36
2.0	0.04	NaN	0.32	0.32	NaN	0.22	NaN	NaN

Feature Selection

Selectors settings

Features to select from

Here all features have been carved using OneVsRestCarver, hence all features are qualitative.

Number of features to select

The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).

[20]:

n_best_per_type = 4  # here the number of features is low, ClassificationSelector will only be used to compute useful statistics

Using Selectors

[21]:

from AutoCarver.selectors import ClassificationSelector
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig

# select the most target associated qualitative features
feature_selector = ClassificationSelector(
    features=features,
    n_best_per_type=n_best_per_type,
    config=ProcessingConfig(verbose=True),  # displays statistics
)
best_features = feature_selector.fit(train_set_processed, train_set_processed[target]).selected_features
best_features

 [ClassificationSelector] Selected Qualitative Features

	feature	Mode	TschuprowtMeasure	TschuprowtRank	TschuprowtFilter	TschuprowtWith
3	Quantitative('petal width (cm)__y=versicolor')	0.3400	0.9558	0.0000	0.0000	itself
2	Quantitative('petal length (cm)__y=versicolor')	0.3500	0.9421	1.0000	0.9018	petal width (cm)__y=versicolor
7	Quantitative('petal width (cm)__y=virginica')	0.6600	0.7857	2.0000	0.8049	petal width (cm)__y=versicolor
6	Quantitative('petal length (cm)__y=virginica')	0.6500	0.7695	3.0000	0.8675	petal width (cm)__y=virginica
0	Quantitative('sepal length (cm)__y=versicolor')	0.5700	0.6713	4.0000	0.6649	petal length (cm)__y=versicolor
4	Quantitative('sepal length (cm)__y=virginica')	0.6400	0.5441	5.0000	0.6071	petal length (cm)__y=virginica
1	Quantitative('sepal width (cm)__y=versicolor')	0.6200	0.4950	6.0000	0.5042	petal width (cm)__y=versicolor
5	Quantitative('sepal width (cm)__y=virginica')	0.6700	0.4868	7.0000	0.5107	petal width (cm)__y=versicolor

[21]:

Features(['petal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=virginica', 'petal length (cm)__y=virginica'])

[22]:

train_set_processed[best_features].head()

[22]:

	petal width (cm)__y=versicolor	petal length (cm)__y=versicolor	petal width (cm)__y=virginica	petal length (cm)__y=virginica
136	2.0	2.0	1.0	1.0
17	0.0	0.0	0.0	0.0
142	2.0	2.0	1.0	1.0
59	1.0	1.0	0.0	0.0
6	0.0	0.0	0.0	0.0

Feature petal width (cm)_versicolor is the most associated with the target iris_type:
- Tschuprow’s T value is TschuprowtMeasure=0.9558
- Its has 0 % of NaNs (NanMeasure=0.0)
- Its mode, 0, represents 31 % of observed data (pct_nan=0.3100)
Feature petal length (cm)__y=versicolor is strongly associated to feature petal width (cm)_versicolor:
- Tschuprow’s T value is TschuprowtFilter=0.9274 for TschuprowtWith=petal width (cm)_versicolor
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)

Modeling

Fitting model on train data

[23]:

from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder

# Encode string labels to integers
label_encoder = LabelEncoder()
train_set_processed[target] = label_encoder.fit_transform(train_set[target])

model = XGBClassifier(objective='multi:softmax')
model.fit(train_set_processed[best_features], train_set_processed[target])

[23]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              feature_weights=None, gamma=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Saving model

[24]:

model.save_model("multiclass_xgboost.json")

Prediction on dev dataset and performance

[25]:

from sklearn.metrics import accuracy_score

dev_set_processed[target] = label_encoder.transform(dev_set[target])
dev_pred = model.predict(dev_set_processed[best_features])
accuracy_score(dev_set_processed[target], dev_pred)

[25]:

0.94

What’s next?

Thanks to Carvers all of your features are now optimally processed for your classification task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!

Well done!

Your commitment to achieving optimal results in multiclass classification tasks shines through in your meticulous use of AutoCarver’s OneVsRestCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.

The OneVsRestCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.

We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in multiclass classification tasks.

As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.

Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.