Carvers

The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:

  1. Identifying the most associated combination from all ordered combinations of modalities

  2. Testing all combinations of ``nan``s grouped to one of those modalities

Target-specific tools allow for association optimization per desired task:

All carvers share the same constructor signature:

  • features (Features) — features to carve.

  • min_freq (float) — minimum frequency per modality. Tested via the Wilson score interval at significance min_freq_alpha (see Minimum-frequency viability test (Wilson score interval)).

  • max_n_mod (int) — maximum number of modalities per carved feature; forwarded to the configured CombinationEvaluator.

  • combination_evaluator (CombinationEvaluator, optional) — association metric. Defaults to a task-appropriate evaluator (see each subclass). The search uses progressive top-K interval dynamic programming (DP) for both Kruskal-H (continuous) and Pearson \(\chi^2\) (binary); statistically equivalent to the legacy enumerate-and-score path.

  • config (DiscretizerConfig, optional) — behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True).

Per-feature parallelism (n_jobs)

With DiscretizerConfig(n_jobs=k) and k > 1, BaseCarver dispatches one task per feature through multiprocessing.Pool.imap_unordered. Each worker receives a pickled deep copy of the CombinationEvaluator and a single (feature, xagg, xagg_dev) payload; mutations stay local to the worker process and the parent reattaches the (mutated) feature on completion. Verbose per-feature logging is silenced — a single dispatch banner is printed when verbose=True.

Tip

Worth it only on a few hundred features or more. Below that, pool startup and pickle overhead dominate and the single-process path is faster. The DP top-K search already removes the per-feature compute bottleneck, so most users will not need n_jobs > 1.

Dropped features (no robust combination)

A feature for which no candidate combination passed the viability filter (Wilson min_freq on train and dev, distinct target rates, train/dev rank preservation) is removed from carver.features and retained on carver.dropped_features so the user can inspect why it was dropped without re-fitting.

The summary and history properties append rows from dropped features with two marker columns:

  • dropped (bool) — True for rows from a dropped feature, False otherwise.

  • dropped_reason (str | None) — synthesized from the dominant failing-test message across the feature’s historized combinations (e.g. “Inversion of target rates per modality”, “Non-representative modality for min_freq=2.00%”).

A dropped feature most commonly signals that X_dev is too small or not representative of X for that feature: every candidate combination viable on train flipped its target-rate ordering on dev. Increasing the dev sample size, relaxing max_n_mod, or dropping the feature entirely are the three available levers.

Classification tasks

Binary Classification

Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).

At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab. It is computed only once per feature \(x\) against the binary target \(y\). The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.

BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring. It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.

Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.

Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).

class AutoCarver.BinaryCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.

Examples

Binary Classification Example

Parameters:
  • features (Features) – A set of Features to be processed.

  • min_freq (float) –

    Minimum frequency per modality. Tested via a Wilson upper bound at significance DiscretizerConfig.min_freq_alpha (see Minimum-frequency viability test (Wilson score interval)).

    • Features need at least one modality with frequency significantly above min_freq.

    • Drives the viability filter on both train and dev crosstabs during the combination search.

    • The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.

    Tip

    Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

  • max_n_mod (int) –

    Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 5 (faster, more robust) and 7 (slower, less robust).

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) — the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

  • combination_evaluator (CombinationEvaluator, optional) –

    Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.

    Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With DiscretizerConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Multilclass Classification

Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).

For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).

MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).

For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).

class AutoCarver.MulticlassCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.

Examples

Multiclass Classification Example

Parameters:
  • features (Features) – A set of Features to be processed.

  • min_freq (float) –

    Minimum frequency per modality. Tested via a Wilson upper bound at significance DiscretizerConfig.min_freq_alpha (see Minimum-frequency viability test (Wilson score interval)).

    • Features need at least one modality with frequency significantly above min_freq.

    • Drives the viability filter on both train and dev crosstabs during the combination search.

    • The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.

    Tip

    Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

  • max_n_mod (int) –

    Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 5 (faster, more robust) and 7 (slower, less robust).

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) — the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

  • combination_evaluator (CombinationEvaluator, optional) –

    Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.

    Tip

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self
Finds the combination of modalities of X that provides the best association with y.

If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With DiscretizerConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Xpd.DataFrame

Dataset to determine Features’ optimal carving.

ypd.Series

Target with wich the association is maximized.

X_devpd.DataFrame, optional

Dataset to evaluate robustness of Features, by default None

y_devpd.Series, optional

Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Regression tasks

Continuous Regression

Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).

The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).

For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.

class AutoCarver.ContinuousCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.

For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.

Examples

Continuous Regression Example

Parameters:
  • features (Features) – A set of Features to be processed.

  • min_freq (float) –

    Minimum frequency per modality. Tested via a Wilson upper bound at significance DiscretizerConfig.min_freq_alpha (see Minimum-frequency viability test (Wilson score interval)).

    • Features need at least one modality with frequency significantly above min_freq.

    • Drives the viability filter on both train and dev crosstabs during the combination search.

    • The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.

    Tip

    Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

  • max_n_mod (int) –

    Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.

    • The combination with the best association will be selected.

    • All combinations of sizes from 1 to max_n_mod are tested out.

    Tip

    Set between 5 (faster, more robust) and 7 (slower, less robust).

  • config (DiscretizerConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Defaults to DiscretizerConfig(dropna=True, ordinal_encoding=True) — the carver-friendly defaults (group nan, ordinal-encode features for downstream sklearn estimators).

  • combination_evaluator (CombinationEvaluator, optional) –

    Pre-built evaluator instance measuring association between Features and a continuous target. Defaults to KruskalCombinations.

    Currently, only Kruskal’s H Combinations is implemented for continuous targets.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With DiscretizerConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:
  • X (pd.DataFrame) – Dataset to determine Features’ optimal carving.

  • y (pd.Series) – Target with wich the association is maximized.

  • X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None

  • y_dev (pd.Series, optional) – Target associated to X_dev, by default None

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: str) BaseCarver

Allows one to load a Carver saved as a .json file.

save(file_name: str, light_mode: bool = False) None

Saves pipeline to .json file.

Parameters:
  • file_name (str) – String of .json file name.

  • light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

  • dropped (bool): True for dropped features, False otherwise.

  • dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame