Carvers

The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:

  1. Identifying the most associated combination from all ordered combinations of modalities

  2. Testing all combinations of ``nan``s grouped to one of those modalities

Target-specific tools allow for association optimization per desired task:

All carvers share the same constructor signature:

  • features (Features) — features to carve.

  • min_freq (float) — minimum frequency per modality.

  • max_n_mod (int, mandatory) — maximum number of modalities per carved feature; forwarded to the configured CombinationEvaluator.

  • combination_evaluator (CombinationEvaluator, keyword-only, optional) — association metric. Defaults to a task-appropriate evaluator (see each subclass).

  • config (DiscretizerConfig, keyword-only, optional) — behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs). Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True).

Classification tasks

Binary Classification

Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).

At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab. It is computed only once per feature \(x\) against the binary target \(y\). The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.

BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring. It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.

Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.

Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).

class AutoCarver.BinaryCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.

Examples

Binary Classification Example

Parameters:
  • max_n_mod (int) –

    Maximum number of modalities per feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

  • combination_evaluator (CombinationEvaluator, optional) – Pre-built CombinationEvaluator instance used to measure association. Subclasses default this to a task-appropriate instance (e.g. TschuprowtCombinations for binary). The carver forwards verbose onto the instance and passes max_n_mod / min_freq / dropna directly to each get_best_combination() call.

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) which are the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

Keyword Arguments:

combination_evaluator (CombinationEvaluator, optional) –

Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.

Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Multilclass Classification

Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).

For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).

MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).

For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).

class AutoCarver.MulticlassCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.

Examples

Multiclass Classification Example

Parameters:
  • max_n_mod (int) –

    Maximum number of modalities per feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

  • combination_evaluator (CombinationEvaluator, optional) – Pre-built CombinationEvaluator instance used to measure association. Subclasses default this to a task-appropriate instance (e.g. TschuprowtCombinations for binary). The carver forwards verbose onto the instance and passes max_n_mod / min_freq / dropna directly to each get_best_combination() call.

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) which are the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

Keyword Arguments:

combination_evaluator (CombinationEvaluator, optional) –

Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.

Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self
Finds the combination of modalities of X that provides the best association with y.

If provided, X_dev set should be large enough to have the same distribution as X.

Xpd.DataFrame

Dataset to determine Features’ optimal carving.

ypd.Series

Target with wich the association is maximized.

X_devpd.DataFrame, optional

Dataset to evaluate robustness of Features, by default None

y_devpd.Series, optional

Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Regression tasks

Continuous Regression

Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).

The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).

For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.

class AutoCarver.ContinuousCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.

For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.

Examples

Continuous Regression Example

Parameters:
  • max_n_mod (int) –

    Maximum number of modalities per feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 3 (faster, more robust) and 7 (slower, less robust)

  • combination_evaluator (CombinationEvaluator, optional) – Pre-built CombinationEvaluator instance used to measure association. Subclasses default this to a task-appropriate instance (e.g. TschuprowtCombinations for binary). The carver forwards verbose onto the instance and passes max_n_mod / min_freq / dropna directly to each get_best_combination() call.

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) which are the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

Keyword Arguments:

combination_evaluator (CombinationEvaluator, optional) –

Pre-built evaluator instance measuring association between Features and a continuous target. Defaults to KruskalCombinations.

Currently, only Kruskal’s H Combinations are implemented.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame