Setting things up
About this notebook
In this notebook, we embark on a journey to refine the Iris Dataset for optimal performance in multiclass classification tasks, leveraging the capabilities of the MulticlassCarver pipeline. Recognized for its association-maximizing discretization, MulticlassCarver is a versatile Python tool that gracefully handles diverse data types—be they quantitative or qualitative. Our specific objective is to prepare the dataset for multiclass classification, illuminating the distinctive
characteristics of Iris flower species.
The Iris Dataset, a classic in the realm of machine learning, presents features such as sepal and petal dimensions for three different Iris species. By employing MulticlassCarver, our goal is to discretize both quantitative and qualitative features seamlessly, tailoring them for effective representation in our multiclass classification models.
Throughout this notebook, we’ll unravel the intricacies of MulticlassCarver’s discretization pipeline, showcasing its adaptability to various data types. Whether it involves transforming petal lengths or encoding species information, MulticlassCarver ensures that each feature is finely tuned for our multiclass classification tasks.
Join us in this exploration as we harness the power of MulticlassCarver to preprocess the Iris Dataset. Through effective feature engineering and discretization, our aim is to create a dataset that not only distinguishes between Iris species but also sets the stage for the development of accurate and impactful multiclass classification models.
Let’s dive in and uncover the potential of MulticlassCarver in transforming the Iris Dataset for optimal predictive modeling.
Installation
[1]:
# %pip install AutoCarver[jupyter]
Iris Data
In this example notebook, we will use the Iris dataset.
The Iris dataset is a classic and widely used dataset in the field of machine learning and pattern recognition. It was introduced by the British biologist and statistician Sir Ronald A. Fisher in 1936 and has since become a benchmark dataset for various classification and clustering tasks.
The dataset consists of measurements from 150 iris flowers, belonging to three different species: setosa, versicolor, and virginica. Four features are included for each flower: sepal length, sepal width, petal length, and petal width, all measured in centimeters.
The primary objective of the Iris dataset is typically to classify iris flowers into one of the three species based on these four features (multiclass classification).
[2]:
from sklearn import datasets
# Load dataset directly from sklearn
iris = datasets.load_iris(as_frame=True)
# conversion to pandas
iris_data = iris["data"]
iris_data["iris_type"] = list(map(lambda u: iris["target_names"][u], iris["target"]))
# Display the first few rows of the dataset
iris_data.head()
[2]:
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | iris_type | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Target type and Carver selection
[3]:
target = "iris_type"
iris_data[target].value_counts(dropna=False)
[3]:
iris_type
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
The target "iris_type" is a multiclass target of type str used in a classification task. Hence we will use AutoCarver.MulticlassCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.
Data Sampling
[4]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(iris_data, test_size=0.33, random_state=42, stratify=iris_data[target])
# checking target rate per dataset
train_set[target].value_counts(dropna=False, normalize=True), dev_set[target].value_counts(dropna=False, normalize=True)
[4]:
(iris_type
setosa 0.34
virginica 0.33
versicolor 0.33
Name: proportion, dtype: float64,
iris_type
virginica 0.34
versicolor 0.34
setosa 0.32
Name: proportion, dtype: float64)
Picking up columns to Carve
[5]:
train_set.head()
[5]:
| sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | iris_type | |
|---|---|---|---|---|---|
| 136 | 6.3 | 3.4 | 5.6 | 2.4 | virginica |
| 17 | 5.1 | 3.5 | 1.4 | 0.3 | setosa |
| 142 | 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 59 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
| 6 | 4.6 | 3.4 | 1.4 | 0.3 | setosa |
[6]:
# column data types
train_set.dtypes
[6]:
sepal length (cm) float64
sepal width (cm) float64
petal length (cm) float64
petal width (cm) float64
iris_type object
dtype: object
[7]:
print(iris["feature_names"])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
All features are quantitative continuous features. Those features will be added to the list of quantitative_features.
[8]:
from AutoCarver import Features
# lists of features per data type
features = Features(quantitatives=["sepal length (cm)", "sepal width (cm)", "petal length (cm)", "petal width (cm)"])
C:\Users\defra\Desktop\git\PROJECTS\AutoCarver\AutoCarver\combinations\utils\combination_evaluator.py:10: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from tqdm.autonotebook import tqdm
Using AutoCarver
AutoCarver settings
Representativness of modalities
The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:
For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.
[9]:
min_freq = 0.05
Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)
Desired number of modalities
The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.
[10]:
max_n_mod = 5
Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)
Grouping NaNs
The attribute dropna allows one to choose whether or not nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with nan.
[11]:
dropna = False # anyway, there are no nan in this dataset
Type of output carved features
The attribute ordinal_encoding allows one to choose the output type:
Use
Truefor integer output of ranked modalities (default)Use
Falsefor string output of modalities
[12]:
ordinal_encoding = True
Fitting AutoCarver
First, all quantitative features are discretized:
Using
ContinuousDiscretizerfor quantile discretization that keeps track of over-represented values (more frequent thanmin_freq)Using
OrdinalDiscretizerfor any remaining under-represented values (less frequent thanmin_freq/2) to be grouped with its closest modality
Second, all features are carved following this recipe, for all classes of
train_set[target](except one):The raw distribution is printed out on provided
train_setanddev_set. It’s the output of the discretization stepGrouping modalities: all consecutive combinations of modalities are applied to
train_setComputing associations: the association metric (Tschruprow’s T, by default) is computed with the provided target
train_set[target]Combinations are sorted in descending order by association value
Testing robustness: finds the first combination that checks the following:
Representativness of modalities on
train_setanddev_set(all should be more frequent thanmin_freq/2)Distinct target rates per consecutive modalities on
train_setanddev_setNo inversion of target rates between
train_setanddev_set(same ordering of modalities by target rate)
(Optional) If requested via
dropna=True, and if any, all combinations of modalities withnanare applied totrain_setand steps 3. and 4. are runThe carved distribution is printed out on provided
train_setanddev_set. It’s the output of the carving step
[13]:
from AutoCarver import MulticlassCarver
# intiating AutoCarver
auto_carver = MulticlassCarver(
features=features,
ordinal_encoding=ordinal_encoding,
max_n_mod=max_n_mod,
min_freq=min_freq,
dropna=dropna,
verbose=True, # showing statistics
copy=True, # whether or not to return a copy of the input dataset
)
# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
WARNING: can't set copy=True for MulticlassCarver (no inplace DataFrame.assign).
---------
[MulticlassCarver] Fit y=versicolor (1/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
- [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
- [OrdinalDiscretizer] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
------
---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=versicolor', 'sepal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=versicolor'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=versicolor') (1/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 4.40e+00 | 0.0000 | 0.0200 |
| 4.40e+00 < x <= 4.70e+00 | 0.0000 | 0.0400 |
| 4.70e+00 < x <= 4.80e+00 | 0.0000 | 0.0500 |
| 4.80e+00 < x <= 4.90e+00 | 0.0000 | 0.0300 |
| 4.90e+00 < x <= 5.00e+00 | 0.1429 | 0.0700 |
| 5.00e+00 < x <= 5.10e+00 | 0.1667 | 0.0600 |
| 5.10e+00 < x <= 5.20e+00 | 0.3333 | 0.0300 |
| 5.20e+00 < x <= 5.40e+00 | 0.0000 | 0.0400 |
| 5.40e+00 < x <= 5.50e+00 | 0.6667 | 0.0600 |
| 5.50e+00 < x <= 5.60e+00 | 0.6667 | 0.0300 |
| 5.60e+00 < x <= 5.70e+00 | 0.5000 | 0.0400 |
| 5.70e+00 < x <= 5.80e+00 | 0.4000 | 0.0500 |
| 5.80e+00 < x <= 5.90e+00 | 0.6667 | 0.0300 |
| 5.90e+00 < x <= 6.00e+00 | 0.6667 | 0.0300 |
| 6.00e+00 < x <= 6.10e+00 | 1.0000 | 0.0300 |
| 6.10e+00 < x <= 6.20e+00 | 0.6667 | 0.0300 |
| 6.20e+00 < x <= 6.30e+00 | 0.2857 | 0.0700 |
| 6.30e+00 < x <= 6.40e+00 | 0.2500 | 0.0400 |
| 6.40e+00 < x <= 6.50e+00 | 0.5000 | 0.0200 |
| 6.50e+00 < x <= 6.70e+00 | 0.6667 | 0.0600 |
| 6.70e+00 < x <= 6.80e+00 | 0.3333 | 0.0300 |
| 6.80e+00 < x <= 6.90e+00 | 0.3333 | 0.0300 |
| 6.90e+00 < x <= 7.10e+00 | 0.5000 | 0.0200 |
| 7.10e+00 < x <= 7.20e+00 | 0.0000 | 0.0300 |
| 7.20e+00 < x <= 7.60e+00 | 0.0000 | 0.0200 |
| 7.60e+00 < x | 0.0000 | 0.0400 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.0400 |
| 0.0000 | 0.0600 |
| nan | 0.0000 |
| 0.3333 | 0.0600 |
| 0.3333 | 0.0600 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.3333 | 0.0600 |
| 1.0000 | 0.0200 |
| 1.0000 | 0.0600 |
| 0.7500 | 0.0800 |
| 0.5000 | 0.0400 |
| nan | 0.0000 |
| 0.6667 | 0.0600 |
| 0.3333 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.5000 | 0.0400 |
| 0.3333 | 0.0600 |
| 0.0000 | 0.0600 |
| 0.2500 | 0.0800 |
| nan | 0.0000 |
| 0.0000 | 0.0200 |
| nan | 0.0000 |
| nan | 0.0000 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0200 |
Grouping modalities : 100%|█████████▉| 15274/15275 [00:02<00:00, 7187.52it/s]
Computing associations: 100%|██████████| 15275/15275 [00:03<00:00, 4329.98it/s]
Testing robustness : 0%| | 3/15275 [00:00<04:02, 63.07it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 5.4e+00 | 0.0882 | 0.3400 |
| 5.4e+00 < x <= 6.2e+00 | 0.6333 | 0.3000 |
| 6.2e+00 < x | 0.3056 | 0.3600 |
| target_rate | frequency |
|---|---|
| 0.1667 | 0.3600 |
| 0.6471 | 0.3400 |
| 0.2000 | 0.3000 |
--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=versicolor') (2/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.20e+00 | 0.7500 | 0.0400 |
| 2.20e+00 < x <= 2.40e+00 | 1.0000 | 0.0300 |
| 2.40e+00 < x <= 2.50e+00 | 0.6000 | 0.0500 |
| 2.50e+00 < x <= 2.60e+00 | 0.7500 | 0.0400 |
| 2.60e+00 < x <= 2.70e+00 | 0.5714 | 0.0700 |
| 2.70e+00 < x <= 2.80e+00 | 0.4444 | 0.0900 |
| 2.80e+00 < x <= 2.90e+00 | 0.6667 | 0.0600 |
| 2.90e+00 < x <= 3.00e+00 | 0.2857 | 0.1400 |
| 3.00e+00 < x <= 3.10e+00 | 0.3333 | 0.0900 |
| 3.10e+00 < x <= 3.20e+00 | 0.2222 | 0.0900 |
| 3.20e+00 < x <= 3.30e+00 | 0.0000 | 0.0400 |
| 3.30e+00 < x <= 3.40e+00 | 0.0000 | 0.0600 |
| 3.40e+00 < x <= 3.50e+00 | 0.0000 | 0.0600 |
| 3.50e+00 < x <= 3.70e+00 | 0.0000 | 0.0400 |
| 3.70e+00 < x <= 3.80e+00 | 0.0000 | 0.0500 |
| 3.80e+00 < x <= 4.10e+00 | 0.0000 | 0.0300 |
| 4.10e+00 < x | 0.0000 | 0.0200 |
| target_rate | frequency |
|---|---|
| nan | 0.0000 |
| 0.7500 | 0.0800 |
| 0.3333 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.5000 | 0.0400 |
| 0.4000 | 0.1000 |
| 0.7500 | 0.0800 |
| 0.3333 | 0.2400 |
| 0.0000 | 0.0400 |
| 0.2500 | 0.0800 |
| 0.5000 | 0.0400 |
| 0.1667 | 0.1200 |
| nan | 0.0000 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0200 |
| nan | 0.0000 |
Grouping modalities : 100%|█████████▉| 2515/2516 [00:00<00:00, 7890.53it/s]
Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4457.58it/s]
Testing robustness : 0%| | 0/2516 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.9e+00 | 0.6316 | 0.3800 |
| 2.9e+00 < x | 0.1452 | 0.6200 |
| target_rate | frequency |
|---|---|
| 0.5263 | 0.3800 |
| 0.2258 | 0.6200 |
--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=versicolor') (3/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.30e+00 | 0.0000 | 0.0400 |
| 1.30e+00 < x <= 1.40e+00 | 0.0000 | 0.1100 |
| 1.40e+00 < x <= 1.50e+00 | 0.0000 | 0.0900 |
| 1.50e+00 < x <= 1.60e+00 | 0.0000 | 0.0700 |
| 1.60e+00 < x <= 1.90e+00 | 0.0000 | 0.0300 |
| 1.90e+00 < x <= 3.50e+00 | 1.0000 | 0.0300 |
| 3.50e+00 < x <= 3.70e+00 | 1.0000 | 0.0200 |
| 3.70e+00 < x <= 4.00e+00 | 1.0000 | 0.0700 |
| 4.00e+00 < x <= 4.20e+00 | 1.0000 | 0.0300 |
| 4.20e+00 < x <= 4.30e+00 | 1.0000 | 0.0200 |
| 4.30e+00 < x <= 4.50e+00 | 1.0000 | 0.0500 |
| 4.50e+00 < x <= 4.60e+00 | 1.0000 | 0.0300 |
| 4.60e+00 < x <= 4.70e+00 | 1.0000 | 0.0300 |
| 4.70e+00 < x <= 4.80e+00 | 0.6667 | 0.0300 |
| 4.80e+00 < x <= 4.90e+00 | 0.5000 | 0.0400 |
| 4.90e+00 < x <= 5.00e+00 | 0.0000 | 0.0300 |
| 5.00e+00 < x <= 5.10e+00 | 0.1667 | 0.0600 |
| 5.10e+00 < x <= 5.40e+00 | 0.0000 | 0.0200 |
| 5.40e+00 < x <= 5.60e+00 | 0.0000 | 0.0500 |
| 5.60e+00 < x <= 5.70e+00 | 0.0000 | 0.0300 |
| 5.70e+00 < x <= 5.90e+00 | 0.0000 | 0.0300 |
| 5.90e+00 < x <= 6.10e+00 | 0.0000 | 0.0500 |
| 6.10e+00 < x <= 6.60e+00 | 0.0000 | 0.0200 |
| 6.60e+00 < x | 0.0000 | 0.0200 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.1400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0800 |
| nan | 0.0000 |
| 0.0000 | 0.0600 |
| 1.0000 | 0.0400 |
| nan | 0.0000 |
| 1.0000 | 0.0400 |
| 1.0000 | 0.0800 |
| nan | 0.0000 |
| 0.8571 | 0.1400 |
| nan | 0.0000 |
| 1.0000 | 0.0400 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0200 |
| 1.0000 | 0.0200 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0800 |
| 0.0000 | 0.0800 |
| nan | 0.0000 |
| 0.0000 | 0.0400 |
| nan | 0.0000 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0200 |
Grouping modalities : 100%|█████████▉| 10901/10902 [00:01<00:00, 7971.25it/s]
Computing associations: 100%|██████████| 10902/10902 [00:02<00:00, 4194.36it/s]
Testing robustness : 0%| | 0/10902 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.9e+00 | 0.0000 | 0.3400 |
| 1.9e+00 < x <= 4.8e+00 | 0.9677 | 0.3100 |
| 4.8e+00 < x | 0.0857 | 0.3500 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.3200 |
| 0.8889 | 0.3600 |
| 0.0625 | 0.3200 |
--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=versicolor') (4/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.00e-01 | 0.0000 | 0.0500 |
| 1.00e-01 < x <= 2.00e-01 | 0.0000 | 0.1700 |
| 2.00e-01 < x <= 3.00e-01 | 0.0000 | 0.0500 |
| 3.00e-01 < x <= 6.00e-01 | 0.0000 | 0.0700 |
| 6.00e-01 < x <= 1.00e+00 | 1.0000 | 0.0400 |
| 1.00e+00 < x <= 1.10e+00 | 1.0000 | 0.0200 |
| 1.10e+00 < x <= 1.20e+00 | 1.0000 | 0.0500 |
| 1.20e+00 < x <= 1.30e+00 | 1.0000 | 0.0800 |
| 1.30e+00 < x <= 1.40e+00 | 1.0000 | 0.0600 |
| 1.40e+00 < x <= 1.50e+00 | 0.8571 | 0.0700 |
| 1.50e+00 < x <= 1.60e+00 | 0.5000 | 0.0200 |
| 1.60e+00 < x <= 1.80e+00 | 0.1429 | 0.0700 |
| 1.80e+00 < x <= 1.90e+00 | 0.0000 | 0.0400 |
| 1.90e+00 < x <= 2.00e+00 | 0.0000 | 0.0400 |
| 2.00e+00 < x <= 2.20e+00 | 0.0000 | 0.0700 |
| 2.20e+00 < x <= 2.30e+00 | 0.0000 | 0.0500 |
| 2.30e+00 < x <= 2.40e+00 | 0.0000 | 0.0200 |
| 2.40e+00 < x | 0.0000 | 0.0300 |
| target_rate | frequency |
|---|---|
| nan | 0.0000 |
| 0.0000 | 0.2400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0400 |
| 1.0000 | 0.0600 |
| 1.0000 | 0.0200 |
| nan | 0.0000 |
| 1.0000 | 0.1000 |
| 0.5000 | 0.0400 |
| 0.8000 | 0.1000 |
| 1.0000 | 0.0400 |
| 0.1429 | 0.1400 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| nan | 0.0000 |
Grouping modalities : 100%|█████████▉| 3212/3213 [00:00<00:00, 8682.61it/s]
Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 4119.92it/s]
Testing robustness : 0%| | 0/3213 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 6.0e-01 | 0.0000 | 0.3400 |
| 6.0e-01 < x <= 1.5e+00 | 0.9688 | 0.3200 |
| 1.5e+00 < x | 0.0588 | 0.3400 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.3200 |
| 0.8750 | 0.3200 |
| 0.1667 | 0.3600 |
---------
---------
[MulticlassCarver] Fit y=virginica (2/2)
------
------
--- [QuantitativeDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
- [ContinuousDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
- [OrdinalDiscretizer] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
------
---------
------ [BinaryCarver] Fit Features(['sepal length (cm)__y=virginica', 'sepal width (cm)__y=virginica', 'petal length (cm)__y=virginica', 'petal width (cm)__y=virginica'])
--- [BinaryCarver] Fit Quantitative('sepal length (cm)__y=virginica') (1/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 4.40e+00 | 0.0000 | 0.0200 |
| 4.40e+00 < x <= 4.70e+00 | 0.0000 | 0.0400 |
| 4.70e+00 < x <= 4.80e+00 | 0.0000 | 0.0500 |
| 4.80e+00 < x <= 4.90e+00 | 0.0000 | 0.0300 |
| 4.90e+00 < x <= 5.00e+00 | 0.0000 | 0.0700 |
| 5.00e+00 < x <= 5.10e+00 | 0.0000 | 0.0600 |
| 5.10e+00 < x <= 5.30e+00 | 0.0000 | 0.0400 |
| 5.30e+00 < x <= 5.40e+00 | 0.0000 | 0.0300 |
| 5.40e+00 < x <= 5.50e+00 | 0.0000 | 0.0600 |
| 5.50e+00 < x <= 5.60e+00 | 0.3333 | 0.0300 |
| 5.60e+00 < x <= 5.70e+00 | 0.2500 | 0.0400 |
| 5.70e+00 < x <= 5.80e+00 | 0.6000 | 0.0500 |
| 5.80e+00 < x <= 5.90e+00 | 0.3333 | 0.0300 |
| 5.90e+00 < x <= 6.00e+00 | 0.3333 | 0.0300 |
| 6.00e+00 < x <= 6.10e+00 | 0.0000 | 0.0300 |
| 6.10e+00 < x <= 6.20e+00 | 0.3333 | 0.0300 |
| 6.20e+00 < x <= 6.30e+00 | 0.7143 | 0.0700 |
| 6.30e+00 < x <= 6.40e+00 | 0.7500 | 0.0400 |
| 6.40e+00 < x <= 6.50e+00 | 0.5000 | 0.0200 |
| 6.50e+00 < x <= 6.70e+00 | 0.3333 | 0.0600 |
| 6.70e+00 < x <= 6.80e+00 | 0.6667 | 0.0300 |
| 6.80e+00 < x <= 6.90e+00 | 0.6667 | 0.0300 |
| 6.90e+00 < x <= 7.10e+00 | 0.5000 | 0.0200 |
| 7.10e+00 < x <= 7.20e+00 | 1.0000 | 0.0300 |
| 7.20e+00 < x <= 7.60e+00 | 1.0000 | 0.0200 |
| 7.60e+00 < x | 1.0000 | 0.0400 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.0400 |
| 0.0000 | 0.0600 |
| nan | 0.0000 |
| 0.3333 | 0.0600 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0800 |
| 0.0000 | 0.0400 |
| nan | 0.0000 |
| 0.3333 | 0.0600 |
| 0.6667 | 0.0600 |
| 1.0000 | 0.0200 |
| 0.5000 | 0.0400 |
| 0.6667 | 0.0600 |
| 1.0000 | 0.0600 |
| 0.7500 | 0.0800 |
| nan | 0.0000 |
| 1.0000 | 0.0200 |
| nan | 0.0000 |
| nan | 0.0000 |
| 1.0000 | 0.0200 |
| 1.0000 | 0.0200 |
Grouping modalities : 100%|█████████▉| 15274/15275 [00:02<00:00, 7499.41it/s]
Computing associations: 100%|██████████| 15275/15275 [00:03<00:00, 4444.34it/s]
Testing robustness : 0%| | 0/15275 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 6.2e+00 | 0.1250 | 0.6400 |
| 6.2e+00 < x | 0.6944 | 0.3600 |
| target_rate | frequency |
|---|---|
| 0.1429 | 0.7000 |
| 0.8000 | 0.3000 |
--- [BinaryCarver] Fit Quantitative('sepal width (cm)__y=virginica') (2/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.20e+00 | 0.2500 | 0.0400 |
| 2.20e+00 < x <= 2.40e+00 | 0.0000 | 0.0300 |
| 2.40e+00 < x <= 2.50e+00 | 0.4000 | 0.0500 |
| 2.50e+00 < x <= 2.60e+00 | 0.2500 | 0.0400 |
| 2.60e+00 < x <= 2.70e+00 | 0.4286 | 0.0700 |
| 2.70e+00 < x <= 2.80e+00 | 0.5556 | 0.0900 |
| 2.80e+00 < x <= 2.90e+00 | 0.1667 | 0.0600 |
| 2.90e+00 < x <= 3.00e+00 | 0.4286 | 0.1400 |
| 3.00e+00 < x <= 3.10e+00 | 0.2222 | 0.0900 |
| 3.10e+00 < x <= 3.20e+00 | 0.5556 | 0.0900 |
| 3.20e+00 < x <= 3.30e+00 | 0.7500 | 0.0400 |
| 3.30e+00 < x <= 3.40e+00 | 0.1667 | 0.0600 |
| 3.40e+00 < x <= 3.50e+00 | 0.0000 | 0.0600 |
| 3.50e+00 < x <= 3.70e+00 | 0.2500 | 0.0400 |
| 3.70e+00 < x <= 3.80e+00 | 0.4000 | 0.0500 |
| 3.80e+00 < x <= 4.10e+00 | 0.0000 | 0.0300 |
| 4.10e+00 < x | 0.0000 | 0.0200 |
| target_rate | frequency |
|---|---|
| nan | 0.0000 |
| 0.0000 | 0.0800 |
| 0.6667 | 0.0600 |
| 1.0000 | 0.0200 |
| 0.5000 | 0.0400 |
| 0.6000 | 0.1000 |
| 0.2500 | 0.0800 |
| 0.5000 | 0.2400 |
| 1.0000 | 0.0400 |
| 0.0000 | 0.0800 |
| 0.0000 | 0.0400 |
| 0.1667 | 0.1200 |
| nan | 0.0000 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| 0.0000 | 0.0200 |
| nan | 0.0000 |
Grouping modalities : 100%|█████████▉| 2515/2516 [00:00<00:00, 7545.34it/s]
Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4394.20it/s]
Testing robustness : 1%| | 29/2516 [00:00<00:08, 302.81it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.4e+00 | 0.1429 | 0.0700 |
| 2.4e+00 < x <= 3.3e+00 | 0.4179 | 0.6700 |
| 3.3e+00 < x | 0.1538 | 0.2600 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.0800 |
| 0.4571 | 0.7000 |
| 0.0909 | 0.2200 |
--- [BinaryCarver] Fit Quantitative('petal length (cm)__y=virginica') (3/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.30e+00 | 0.0000 | 0.0400 |
| 1.30e+00 < x <= 1.40e+00 | 0.0000 | 0.1100 |
| 1.40e+00 < x <= 1.50e+00 | 0.0000 | 0.0900 |
| 1.50e+00 < x <= 1.60e+00 | 0.0000 | 0.0700 |
| 1.60e+00 < x <= 1.90e+00 | 0.0000 | 0.0300 |
| 1.90e+00 < x <= 3.50e+00 | 0.0000 | 0.0300 |
| 3.50e+00 < x <= 3.70e+00 | 0.0000 | 0.0200 |
| 3.70e+00 < x <= 4.00e+00 | 0.0000 | 0.0700 |
| 4.00e+00 < x <= 4.20e+00 | 0.0000 | 0.0300 |
| 4.20e+00 < x <= 4.30e+00 | 0.0000 | 0.0200 |
| 4.30e+00 < x <= 4.50e+00 | 0.0000 | 0.0500 |
| 4.50e+00 < x <= 4.60e+00 | 0.0000 | 0.0300 |
| 4.60e+00 < x <= 4.70e+00 | 0.0000 | 0.0300 |
| 4.70e+00 < x <= 4.80e+00 | 0.3333 | 0.0300 |
| 4.80e+00 < x <= 4.90e+00 | 0.5000 | 0.0400 |
| 4.90e+00 < x <= 5.00e+00 | 1.0000 | 0.0300 |
| 5.00e+00 < x <= 5.10e+00 | 0.8333 | 0.0600 |
| 5.10e+00 < x <= 5.40e+00 | 1.0000 | 0.0200 |
| 5.40e+00 < x <= 5.60e+00 | 1.0000 | 0.0500 |
| 5.60e+00 < x <= 5.70e+00 | 1.0000 | 0.0300 |
| 5.70e+00 < x <= 5.90e+00 | 1.0000 | 0.0300 |
| 5.90e+00 < x <= 6.10e+00 | 1.0000 | 0.0500 |
| 6.10e+00 < x <= 6.60e+00 | 1.0000 | 0.0200 |
| 6.60e+00 < x | 1.0000 | 0.0200 |
| target_rate | frequency |
|---|---|
| 0.0000 | 0.1400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0800 |
| nan | 0.0000 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0400 |
| nan | 0.0000 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0800 |
| nan | 0.0000 |
| 0.1429 | 0.1400 |
| nan | 0.0000 |
| 0.0000 | 0.0400 |
| 1.0000 | 0.0200 |
| 1.0000 | 0.0200 |
| 0.0000 | 0.0200 |
| 1.0000 | 0.0400 |
| 1.0000 | 0.0800 |
| 1.0000 | 0.0800 |
| nan | 0.0000 |
| 1.0000 | 0.0400 |
| nan | 0.0000 |
| 1.0000 | 0.0200 |
| 1.0000 | 0.0200 |
Grouping modalities : 100%|█████████▉| 10901/10902 [00:01<00:00, 7786.74it/s]
Computing associations: 100%|██████████| 10902/10902 [00:02<00:00, 4258.68it/s]
Testing robustness : 0%| | 0/10902 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 4.8e+00 | 0.0154 | 0.6500 |
| 4.8e+00 < x | 0.9143 | 0.3500 |
| target_rate | frequency |
|---|---|
| 0.0588 | 0.6800 |
| 0.9375 | 0.3200 |
--- [BinaryCarver] Fit Quantitative('petal width (cm)__y=virginica') (4/4)
[BinaryCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.00e-01 | 0.0000 | 0.0500 |
| 1.00e-01 < x <= 2.00e-01 | 0.0000 | 0.1700 |
| 2.00e-01 < x <= 3.00e-01 | 0.0000 | 0.0500 |
| 3.00e-01 < x <= 6.00e-01 | 0.0000 | 0.0700 |
| 6.00e-01 < x <= 1.00e+00 | 0.0000 | 0.0400 |
| 1.00e+00 < x <= 1.10e+00 | 0.0000 | 0.0200 |
| 1.10e+00 < x <= 1.20e+00 | 0.0000 | 0.0500 |
| 1.20e+00 < x <= 1.30e+00 | 0.0000 | 0.0800 |
| 1.30e+00 < x <= 1.40e+00 | 0.0000 | 0.0600 |
| 1.40e+00 < x <= 1.50e+00 | 0.1429 | 0.0700 |
| 1.50e+00 < x <= 1.60e+00 | 0.5000 | 0.0200 |
| 1.60e+00 < x <= 1.80e+00 | 0.8571 | 0.0700 |
| 1.80e+00 < x <= 1.90e+00 | 1.0000 | 0.0400 |
| 1.90e+00 < x <= 2.00e+00 | 1.0000 | 0.0400 |
| 2.00e+00 < x <= 2.20e+00 | 1.0000 | 0.0700 |
| 2.20e+00 < x <= 2.30e+00 | 1.0000 | 0.0500 |
| 2.30e+00 < x <= 2.40e+00 | 1.0000 | 0.0200 |
| 2.40e+00 < x | 1.0000 | 0.0300 |
| target_rate | frequency |
|---|---|
| nan | 0.0000 |
| 0.0000 | 0.2400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0400 |
| 0.0000 | 0.0600 |
| 0.0000 | 0.0200 |
| nan | 0.0000 |
| 0.0000 | 0.1000 |
| 0.5000 | 0.0400 |
| 0.2000 | 0.1000 |
| 0.0000 | 0.0400 |
| 0.8571 | 0.1400 |
| 1.0000 | 0.0200 |
| 1.0000 | 0.0400 |
| 1.0000 | 0.0400 |
| 1.0000 | 0.0600 |
| 1.0000 | 0.0200 |
| nan | 0.0000 |
Grouping modalities : 100%|█████████▉| 3212/3213 [00:00<00:00, 7088.72it/s]
Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 3949.35it/s]
Testing robustness : 0%| | 0/3213 [00:00<?, ?it/s]
[BinaryCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.5e+00 | 0.0152 | 0.6600 |
| 1.5e+00 < x | 0.9412 | 0.3400 |
| target_rate | frequency |
|---|---|
| 0.0625 | 0.6400 |
| 0.8333 | 0.3600 |
---------
AutoCarver analysis
Carving Summary
[14]:
auto_carver.summary
[14]:
| content | target_rate | frequency | |||||
|---|---|---|---|---|---|---|---|
| feature | cramerv | tschuprowt | n_mod | label | |||
| Quantitative('sepal length (cm)__y=versicolor') | 0.464436 | 0.390543 | 3 | 0 | x <= 5.4e+00 | 0.088235 | 0.34 |
| 1 | 5.4e+00 < x <= 6.2e+00 | 0.633333 | 0.30 | ||||
| 2 | 6.2e+00 < x | 0.305556 | 0.36 | ||||
| Quantitative('sepal width (cm)__y=versicolor') | 0.480207 | 0.480207 | 2 | 0 | x <= 2.9e+00 | 0.631579 | 0.38 |
| 1 | 2.9e+00 < x | 0.145161 | 0.62 | ||||
| Quantitative('petal length (cm)__y=versicolor') | 0.912237 | 0.767096 | 3 | 0 | x <= 1.9e+00 | 0.000000 | 0.34 |
| 1 | 1.9e+00 < x <= 4.8e+00 | 0.967742 | 0.31 | ||||
| 2 | 4.8e+00 < x | 0.085714 | 0.35 | ||||
| Quantitative('petal width (cm)__y=versicolor') | 0.933300 | 0.784809 | 3 | 0 | x <= 6.0e-01 | 0.000000 | 0.34 |
| 1 | 6.0e-01 < x <= 1.5e+00 | 0.968750 | 0.32 | ||||
| 2 | 1.5e+00 < x | 0.058824 | 0.34 | ||||
| Quantitative('sepal length (cm)__y=virginica') | 0.559144 | 0.559144 | 2 | 0 | x <= 6.2e+00 | 0.125000 | 0.64 |
| 1 | 6.2e+00 < x | 0.694444 | 0.36 | ||||
| Quantitative('sepal width (cm)__y=virginica') | 0.266452 | 0.224058 | 3 | 0 | x <= 2.4e+00 | 0.142857 | 0.07 |
| 1 | 2.4e+00 < x <= 3.3e+00 | 0.417910 | 0.67 | ||||
| 2 | 3.3e+00 < x | 0.153846 | 0.26 | ||||
| Quantitative('petal length (cm)__y=virginica') | 0.889524 | 0.889524 | 2 | 0 | x <= 4.8e+00 | 0.015385 | 0.65 |
| 1 | 4.8e+00 < x | 0.914286 | 0.35 | ||||
| Quantitative('petal width (cm)__y=virginica') | 0.910463 | 0.910463 | 2 | 0 | x <= 1.5e+00 | 0.015152 | 0.66 |
| 1 | 1.5e+00 < x | 0.941176 | 0.34 |
As requested with
ordinal_encoding=True, output labels are integers of modalitiesFeatures have been carved for two distinct binary targets:
y=versicolor: dummy of targetiris_typetaking value"versicolor"y=virginica: dummy of targetiris_typetaking value"virginica"
For quantitative feature
petal width (cm), fory=virginica, the selected combination of modalities groups petal widths as follows:label
0: lower or equal to 1.5cm (content="x <= 1.5e+00")label
1: higher than 1.5cm (content="1.5e+00 < x")
Detailed overview of tested combinations
[15]:
features['sepal length (cm)__y=versicolor'].history.head(7)
[15]:
| info | cramerv | tschuprowt | combination | n_mod | dropna | train | viable | dev | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Raw distribution (n_mod=26>max_n_mod=5) | 0.590426 | 0.264047 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 26 | False | NaN | NaN | NaN |
| 1 | Not viable | 0.483288 | 0.406395 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 3 | False | {'viable': True, 'info': ''} | False | {'viable': False, 'info': 'Non-representative ... |
| 2 | Not viable | 0.516114 | 0.392162 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 4 | False | {'viable': True, 'info': ''} | False | {'viable': False, 'info': 'Non-representative ... |
| 3 | Not viable | 0.465045 | 0.391055 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 3 | False | {'viable': True, 'info': ''} | False | {'viable': False, 'info': 'Inversion of target... |
| 4 | Best for tschuprowt and max_n_mod=5 | 0.464436 | 0.390543 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 3 | False | {'viable': True, 'info': ''} | True | {'viable': True, 'info': ''} |
| 5 | Not checked | 0.463821 | 0.390025 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 3 | False | NaN | NaN | NaN |
| 6 | Not checked | 0.462885 | 0.389238 | {'x <= 4.40e+00': 'x <= 4.40e+00', '4.40e+00 <... | 3 | False | NaN | NaN | NaN |
[16]:
features['sepal length (cm)__y=versicolor'].history.dev[1]
[16]:
{'viable': False, 'info': 'Non-representative modality for min_freq=5.00%'}
The most associated combination of feature
sepal length (cm)__y=versicolor(the first tested out, whereinfo!="Raw distribution") did not pass the viability tests. When looking inhistory.dev:"Non-representative modality for min_freq=5.00%": tells us that a modality is unstable betweentrain_setanddev_set
For feature
sepal length (cm)__y=versicolor, the 4th combination is the first to pass the tests:info="Best for tschuprowt and max_n_mod=5"Tschuprow’s T with
ìris_typeis0.390543for this combination (by default, combinations are ranked according to this statistic)Following combinations (less associated with the target) where not tested:
info="Not checked"
For all combinations
dropna=Falsemeans that it is not a combination in whichnans are being groupedwith other modalities (as requested withdropna=False)
Saving and Loading AutoCarver
Saving
All Carvers can safely be stored as a .json file.
[17]:
auto_carver.save("multiclass_carver.json")
Loading
Carvers can safely be loaded from a .json file.
[18]:
from AutoCarver import MulticlassCarver
auto_carver = MulticlassCarver.load("multiclass_carver.json")
Applying AutoCarver
[19]:
dev_set_processed = auto_carver.transform(dev_set)
[20]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[20]:
| sepal length (cm)__y=versicolor | sepal width (cm)__y=versicolor | petal length (cm)__y=versicolor | petal width (cm)__y=versicolor | sepal length (cm)__y=virginica | sepal width (cm)__y=virginica | petal length (cm)__y=virginica | petal width (cm)__y=virginica | |
|---|---|---|---|---|---|---|---|---|
| 0.0 | 0.36 | 0.38 | 0.32 | 0.32 | 0.7 | 0.08 | 0.68 | 0.64 |
| 1.0 | 0.34 | 0.62 | 0.36 | 0.32 | 0.3 | 0.70 | 0.32 | 0.36 |
| 2.0 | 0.30 | NaN | 0.32 | 0.36 | NaN | 0.22 | NaN | NaN |
Feature Selection
Selectors settings
Features to select from
Here all features have been carved using MulticlassCarver, hence all features are qualitative.
Number of features to select
The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).
[21]:
n_best_per_type = 4 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics
Using Selectors
[22]:
from AutoCarver.selectors import ClassificationSelector
# select the most target associated qualitative features
feature_selector = ClassificationSelector(
features=features,
n_best_per_type=n_best_per_type,
verbose=True, # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
best_features
[ClassificationSelector] Selected Features
| feature | NanMeasure | ModeMeasure | TschuprowtMeasure | TschuprowtRank | TschuprowtFilter | TschuprowtWith | |
|---|---|---|---|---|---|---|---|
| 3 | Quantitative('petal width (cm)__y=versicolor') | 0.0000 | 0.3400 | 0.9558 | 0 | 0.0000 | itself |
| 2 | Quantitative('petal length (cm)__y=versicolor') | 0.0000 | 0.3500 | 0.9421 | 1 | 0.9274 | petal width (cm)__y=versicolor |
| 7 | Quantitative('petal width (cm)__y=virginica') | 0.0000 | 0.6600 | 0.7857 | 2 | 0.8409 | petal width (cm)__y=versicolor |
| 6 | Quantitative('petal length (cm)__y=virginica') | 0.0000 | 0.6500 | 0.7695 | 3 | 0.8675 | petal width (cm)__y=virginica |
| 0 | Quantitative('sepal length (cm)__y=versicolor') | 0.0000 | 0.3600 | 0.6728 | 4 | 0.6888 | petal length (cm)__y=versicolor |
| 4 | Quantitative('sepal length (cm)__y=virginica') | 0.0000 | 0.6400 | 0.5441 | 5 | 0.8409 | sepal length (cm)__y=versicolor |
| 1 | Quantitative('sepal width (cm)__y=versicolor') | 0.0000 | 0.6200 | 0.4950 | 6 | 0.5069 | petal width (cm)__y=versicolor |
| 5 | Quantitative('sepal width (cm)__y=virginica') | 0.0000 | 0.6700 | 0.4868 | 7 | 0.5168 | petal width (cm)__y=versicolor |
[22]:
Features(['petal width (cm)__y=versicolor', 'petal length (cm)__y=versicolor', 'petal width (cm)__y=virginica', 'petal length (cm)__y=virginica'])
[23]:
train_set_processed[best_features].head()
[23]:
| petal width (cm)__y=versicolor | petal length (cm)__y=versicolor | petal width (cm)__y=virginica | petal length (cm)__y=virginica | |
|---|---|---|---|---|
| 136 | 2.0 | 2.0 | 1.0 | 1.0 |
| 17 | 0.0 | 0.0 | 0.0 | 0.0 |
| 142 | 2.0 | 2.0 | 1.0 | 1.0 |
| 59 | 1.0 | 1.0 | 0.0 | 0.0 |
| 6 | 0.0 | 0.0 | 0.0 | 0.0 |
Feature
petal width (cm)_versicoloris the most associated with the targetiris_type:Tschuprow’s T value is
TschuprowtMeasure=0.9558Its has 0 % of NaNs (
NanMeasure=0.0)Its mode,
0, represents 31 % of observed data (pct_nan=0.3100)
Feature
petal length (cm)__y=versicoloris strongly associated to featurepetal width (cm)_versicolor:Tschuprow’s T value is
TschuprowtFilter=0.9274forTschuprowtWith=petal width (cm)_versicolor
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)
Modeling
Fitting model on train data
[24]:
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
# Encode string labels to integers
label_encoder = LabelEncoder()
train_set_processed[target] = label_encoder.fit_transform(train_set[target])
model = XGBClassifier(objective='multi:softmax')
model.fit(train_set_processed[best_features], train_set_processed[target])
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\IPython\core\formatters.py:974, in MimeBundleFormatter.__call__(self, obj, include, exclude)
971 method = get_real_method(obj, self.print_method)
973 if method is not None:
--> 974 return method(include=include, exclude=exclude)
975 return None
976 else:
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:469, in BaseEstimator._repr_mimebundle_(self, **kwargs)
467 output = {"text/plain": repr(self)}
468 if get_config()["display"] == "diagram":
--> 469 output["text/html"] = estimator_html_repr(self)
470 return output
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_estimator_html_repr.py:387, in estimator_html_repr(estimator)
385 else:
386 try:
--> 387 check_is_fitted(estimator)
388 status_label = "<span>Fitted</span>"
389 is_fitted_css_class = "fitted"
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\validation.py:1751, in check_is_fitted(estimator, attributes, msg, all_or_any)
1748 if not hasattr(estimator, "fit"):
1749 raise TypeError("%s is not an estimator instance." % (estimator))
-> 1751 tags = get_tags(estimator)
1753 if not tags.requires_fit and attributes is None:
1754 return
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_tags.py:430, in get_tags(estimator)
428 for klass in reversed(type(estimator).mro()):
429 if "__sklearn_tags__" in vars(klass):
--> 430 sklearn_tags_provider[klass] = klass.__sklearn_tags__(estimator) # type: ignore[attr-defined]
431 class_order.append(klass)
432 elif "_more_tags" in vars(klass):
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:540, in ClassifierMixin.__sklearn_tags__(self)
539 def __sklearn_tags__(self):
--> 540 tags = super().__sklearn_tags__()
541 tags.estimator_type = "classifier"
542 tags.classifier_tags = ClassifierTags()
AttributeError: 'super' object has no attribute '__sklearn_tags__'
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\IPython\core\formatters.py:344, in BaseFormatter.__call__(self, obj)
342 method = get_real_method(obj, self.print_method)
343 if method is not None:
--> 344 return method()
345 return None
346 else:
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:463, in BaseEstimator._repr_html_inner(self)
458 def _repr_html_inner(self):
459 """This function is returned by the @property `_repr_html_` to make
460 `hasattr(estimator, "_repr_html_") return `True` or `False` depending
461 on `get_config()["display"]`.
462 """
--> 463 return estimator_html_repr(self)
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_estimator_html_repr.py:387, in estimator_html_repr(estimator)
385 else:
386 try:
--> 387 check_is_fitted(estimator)
388 status_label = "<span>Fitted</span>"
389 is_fitted_css_class = "fitted"
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\validation.py:1751, in check_is_fitted(estimator, attributes, msg, all_or_any)
1748 if not hasattr(estimator, "fit"):
1749 raise TypeError("%s is not an estimator instance." % (estimator))
-> 1751 tags = get_tags(estimator)
1753 if not tags.requires_fit and attributes is None:
1754 return
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\utils\_tags.py:430, in get_tags(estimator)
428 for klass in reversed(type(estimator).mro()):
429 if "__sklearn_tags__" in vars(klass):
--> 430 sklearn_tags_provider[klass] = klass.__sklearn_tags__(estimator) # type: ignore[attr-defined]
431 class_order.append(klass)
432 elif "_more_tags" in vars(klass):
File c:\Users\defra\AppData\Local\pypoetry\Cache\virtualenvs\autocarver-i96ERKJw-py3.9\lib\site-packages\sklearn\base.py:540, in ClassifierMixin.__sklearn_tags__(self)
539 def __sklearn_tags__(self):
--> 540 tags = super().__sklearn_tags__()
541 tags.estimator_type = "classifier"
542 tags.classifier_tags = ClassifierTags()
AttributeError: 'super' object has no attribute '__sklearn_tags__'
[24]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric=None, feature_types=None,
gamma=None, grow_policy=None, importance_type=None,
interaction_constraints=None, learning_rate=None, max_bin=None,
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=None, max_leaves=None,
min_child_weight=None, missing=nan, monotone_constraints=None,
multi_strategy=None, n_estimators=None, n_jobs=None,
num_parallel_tree=None, objective='multi:softmax', ...)
Saving model
[25]:
model.save_model("multiclass_xgboost.json")
Prediction on dev dataset and performance
[26]:
from sklearn.metrics import accuracy_score
dev_set_processed[target] = label_encoder.transform(dev_set[target])
dev_pred = model.predict(dev_set_processed[best_features])
accuracy_score(dev_set_processed[target], dev_pred)
[26]:
0.9
What’s next?
Thanks to Carvers all of your features are now optimally processed for your classification task!
As a final step towards your model, Selectors can prove to be handy tools to operate target optimal Data Pre-Selection, so make sure to check out Selectors Examples!
Well done!
Your commitment to achieving optimal results in multiclass classification tasks shines through in your meticulous use of AutoCarver’s MulticlassCarver for data preprocessing. By fine-tuning and optimizing your dataset, you have set the stage for robust and accurate machine learning models.
The MulticlassCarver has proven to be a valuable ally in your pursuit of excellence, carving out a path toward enhanced feature representation and model interpretability. Your dedication to refining the data preprocessing steps reflects a commitment to extracting the maximum value from your datasets.
We extend our sincere appreciation for choosing AutoCarver as your companion in the data preprocessing journey. Your use of AutoCarver demonstrates a dedication to leveraging cutting-edge tools for achieving excellence in multiclass classification tasks.
As you transition to the modeling phase, may the carefully crafted features and preprocessing steps contribute to the success of your predictive models. We’re excited to see the impact of your work and are grateful for the opportunity to be part of your data science endeavors.
Thank you for trusting AutoCarver, and we wish you continued success in your data-driven ventures.