Carvers

The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:

  1. Identifying the most associated combination from all ordered combinations of modalities

  2. Testing all combinations of ``nan``s grouped to one of those modalities

Target-specific tools allow for association optimization per desired task:

Classification tasks

Binary Classification

Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).

At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab. It is computed only once per feature \(x\) against the binary target \(y\). The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.

BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring. It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.

Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.

Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).

class AutoCarver.BinaryCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.

Examples

Binary Classification Example

Parameters:
  • dropna (bool, optional) –

    • True, try to group nan with other modalities.

    • False, nan are ignored (not grouped), by default True

  • max_n_mod (int, optional) –

    Maximum number of modalities per feature, by default 5

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

Keyword Arguments:
  • discretizer_min_freq (float, optional) – Specific min_freq used by discretizers, by default None for min_freq/2

  • ordinal_encoding (bool, optional) – Whether or not to ordinal encode Features, by default True

  • combinations (BinaryCombinationEvaluator, optional) –

    Metric to perform association measure between Features and target.

    Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

History of discretization process for all features

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

The Carver has to be saved with Carver.save(), otherwise there can be no guarantee for it to be restored.

Parameters:

file_name (str) – String of saved Carver’s .json file name.

Returns:

A fitted Carver.

Return type:

BaseDiscretizer

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Summary of discretization process for all features

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Multilclass Classification

Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).

For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).

MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).

For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).

class AutoCarver.MulticlassCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.

Examples

Multiclass Classification Example

Parameters:
  • dropna (bool, optional) –

    • True, try to group nan with other modalities.

    • False, nan are ignored (not grouped), by default True

  • max_n_mod (int, optional) –

    Maximum number of modalities per feature, by default 5

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

Keyword Arguments:
  • discretizer_min_freq (float, optional) – Specific min_freq used by discretizers, by default None for min_freq/2

  • ordinal_encoding (bool, optional) – Whether or not to ordinal encode Features, by default True

  • combinations (BinaryCombinationEvaluator, optional) –

    Metric to perform association measure between Features and target.

    Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None
Finds the combination of modalities of X that provides the best association with y.

If provided, X_dev set should be large enough to have the same distribution as X.

Xpd.DataFrame

Dataset to determine Features’ optimal carving.

ypd.Series

Target with wich the association is maximized.

X_devpd.DataFrame, optional

Dataset to evaluate robustness of Features, by default None

y_devpd.Series, optional

Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

History of discretization process for all features

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

The Carver has to be saved with Carver.save(), otherwise there can be no guarantee for it to be restored.

Parameters:

file_name (str) – String of saved Carver’s .json file name.

Returns:

A fitted Carver.

Return type:

BaseDiscretizer

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Summary of discretization process for all features

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Regression tasks

Continuous Regression

Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).

The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).

For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.

class AutoCarver.ContinuousCarver(features: Features, min_freq: float, dropna: bool = True, max_n_mod: int = 5, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.

For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.

Examples

Continuous Regression Example

Parameters:
  • dropna (bool, optional) –

    • True, try to group nan with other modalities.

    • False, nan are ignored (not grouped), by default True

  • max_n_mod (int, optional) –

    Maximum number of modalities per feature, by default 5

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

Keyword Arguments:
  • discretizer_min_freq (float, optional) – Specific min_freq used by discretizers, by default None for min_freq/2

  • combinations (ContinuousCombinationEvaluator, optional) –

    Metric to perform association measure between Features and target.

    Currently, only Kruskal’s H Combinations are implemented.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

History of discretization process for all features

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

The Carver has to be saved with Carver.save(), otherwise there can be no guarantee for it to be restored.

Parameters:

file_name (str) – String of saved Carver’s .json file name.

Returns:

A fitted Carver.

Return type:

BaseDiscretizer

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Summary of discretization process for all features

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame