Carvers
The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:
Identifying the most associated combination from all ordered combinations of modalities
Testing all combinations of ``nan``s grouped to one of those modalities
- Target-specific tools allow for association optimization per desired task:
Classification tasks
Binary Classification
Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).
At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab.
It is computed only once per feature \(x\) against the binary target \(y\).
The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.
BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring.
It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.
Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.
Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.
For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).
- class AutoCarver.BinaryCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.
Examples
- Parameters:
dropna (bool, optional) –
True, try to groupnanwith other modalities.False,nanare ignored (not grouped), by defaultTrue
max_n_mod (int, optional) –
Maximum number of modalities per feature, by default
5The combination with the best association will be selected.
All combinations of sizes from 1 to
max_n_modare tested out.
Tip
Set between
3(faster, more robust) and7(slower, less robust)
- Keyword Arguments:
discretizer_min_freq (float, optional) – Specific
min_freqused by discretizers, by defaultNoneformin_freq/2ordinal_encoding (bool, optional) – Whether or not to ordinal encode
Features, by defaultTruecombinations (BinaryCombinationEvaluator, optional) –
Metric to perform association measure between
Featuresand target.Tip
Use Tschuprow’s T Combinations for less, more robust, modalities
Use Cramér’s V Combinations for more, less robust, modalities
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame = None, y_dev: Series = None) None
Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.
- Parameters:
X (DataFrame) – Dataset to determine
Features’ optimal carving.y (Series) – Target with wich the association is maximized.
X_dev (DataFrame, optional) – Dataset to evaluate robustness of
Features, by defaultNoney_dev (Series, optional) – Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
History of discretization process for all features
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
The Carver has to be saved with
Carver.save(), otherwise there can be no guarantee for it to be restored.- Parameters:
file_name (str) – String of saved Carver’s .json file name.
- Returns:
A fitted Carver.
- Return type:
BaseDiscretizer
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Summary of discretization process for all features
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Multilclass Classification
Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).
For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).
MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).
For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).
- class AutoCarver.MulticlassCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.
Examples
Multiclass Classification Example
- Parameters:
dropna (bool, optional) –
True, try to groupnanwith other modalities.False,nanare ignored (not grouped), by defaultTrue
max_n_mod (int, optional) –
Maximum number of modalities per feature, by default
5The combination with the best association will be selected.
All combinations of sizes from 1 to
max_n_modare tested out.
Tip
Set between
3(faster, more robust) and7(slower, less robust)
- Keyword Arguments:
discretizer_min_freq (float, optional) – Specific
min_freqused by discretizers, by defaultNoneformin_freq/2ordinal_encoding (bool, optional) – Whether or not to ordinal encode
Features, by defaultTruecombinations (BinaryCombinationEvaluator, optional) –
Metric to perform association measure between
Featuresand target.Tip
Use Tschuprow’s T Combinations for less, more robust, modalities
Use Cramér’s V Combinations for more, less robust, modalities
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame = None, y_dev: Series = None) None
- Finds the combination of modalities of X that provides the best association with y.
If provided, X_dev set should be large enough to have the same distribution as X.
- XDataFrame
Dataset to determine
Features’ optimal carving.- ySeries
Target with wich the association is maximized.
- X_devDataFrame, optional
Dataset to evaluate robustness of
Features, by defaultNone- y_devSeries, optional
Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
History of discretization process for all features
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
The Carver has to be saved with
Carver.save(), otherwise there can be no guarantee for it to be restored.- Parameters:
file_name (str) – String of saved Carver’s .json file name.
- Returns:
A fitted Carver.
- Return type:
BaseDiscretizer
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Summary of discretization process for all features
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Regression tasks
Continuous Regression
Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).
The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).
For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.
- class AutoCarver.ContinuousCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.
For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.
Examples
- Parameters:
dropna (bool, optional) –
True, try to groupnanwith other modalities.False,nanare ignored (not grouped), by defaultTrue
max_n_mod (int, optional) –
Maximum number of modalities per feature, by default
5The combination with the best association will be selected.
All combinations of sizes from 1 to
max_n_modare tested out.
Tip
Set between
3(faster, more robust) and7(slower, less robust)
- Keyword Arguments:
discretizer_min_freq (float, optional) – Specific
min_freqused by discretizers, by defaultNoneformin_freq/2combinations (ContinuousCombinationEvaluator, optional) –
Metric to perform association measure between
Featuresand target.Currently, only Kruskal’s H Combinations are implemented.
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame = None, y_dev: Series = None) None
Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.
- Parameters:
X (DataFrame) – Dataset to determine
Features’ optimal carving.y (Series) – Target with wich the association is maximized.
X_dev (DataFrame, optional) – Dataset to evaluate robustness of
Features, by defaultNoney_dev (Series, optional) – Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
History of discretization process for all features
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
The Carver has to be saved with
Carver.save(), otherwise there can be no guarantee for it to be restored.- Parameters:
file_name (str) – String of saved Carver’s .json file name.
- Returns:
A fitted Carver.
- Return type:
BaseDiscretizer
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Summary of discretization process for all features
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame