Carvers

The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:

Identifying the most associated combination from all ordered combinations of modalities

Testing all combinations of ``nan``s grouped to one of those modalities

Target-specific tools allow for association optimization per desired task:

Binary Classification
Multiclass Classification
One-vs-Rest Classification
Continuous Regression
Ordinal Classification

All carvers share the same constructor signature:

features (Features) — features to carve.
min_freq (float) — minimum frequency per modality. Tested via the Wilson score interval at significance min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
max_n_mod (int) — maximum number of modalities per carved feature; forwarded to the configured CombinationEvaluator.
combination_evaluator (CombinationEvaluator, optional) — association metric. Defaults to a task-appropriate evaluator (see each subclass). The search uses progressive top-K interval dynamic programming (DP) for both Kruskal-H (continuous) and Pearson \(\chi^2\) (binary); statistically equivalent to the legacy enumerate-and-score path.
config (ProcessingConfig, optional) — behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to ProcessingConfig(dropna=True, ordinal_encoding=True).

Per-feature parallelism (`n_jobs`)

With ProcessingConfig(n_jobs=k) and k > 1, BaseCarver dispatches one task per feature through multiprocessing.Pool.imap_unordered. Each worker receives a pickled deep copy of the CombinationEvaluator and a single (feature, xagg, xagg_dev) payload; mutations stay local to the worker process and the parent reattaches the (mutated) feature on completion. Verbose per-feature logging is silenced — a single dispatch banner is printed when verbose=True.

Tip

Worth it only on a few hundred features or more. Below that, pool startup and pickle overhead dominate and the single-process path is faster. The DP top-K search already removes the per-feature compute bottleneck, so most users will not need n_jobs > 1.

Dropped features (no robust combination)

A feature for which no candidate combination passed the viability filter (Wilson min_freq on train and dev, distinct target rates, train/dev rank preservation) is removed from carver.features and retained on carver.dropped_features so the user can inspect why it was dropped without re-fitting.

The summary and history properties append rows from dropped features with two marker columns:

dropped (bool) — True for rows from a dropped feature, False otherwise.
dropped_reason (str | None) — synthesized from the dominant failing-test message across the feature’s historized combinations (e.g. “Inversion of target rates per modality”, “Non-representative modality for min_freq=2.00%”).

A dropped feature most commonly signals that X_dev is too small or not representative of X for that feature: every candidate combination viable on train flipped its target-rate ordering on dev. Increasing the dev sample size, relaxing max_n_mod, or dropping the feature entirely are the three available levers.

Classification tasks

Binary Classification

Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).

At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab. It is computed only once per feature \(x\) against the binary target \(y\). The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.

BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring. It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.

Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.

Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).

class AutoCarver.BinaryCarver(features: Features, min_freq: float = 0.02, max_n_mod: int = 5, *, combination_evaluator: CombinationEvaluator | None = None, config: ProcessingConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.

Examples

Binary Classification Example

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- Drives the viability filter on both train and dev crosstabs during the combination search.
- The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

Defaults to 0.02 — the recommended starting point; see the recipes table in the Quick Start.
max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.
- The combination with the best association will be selected.
- All combinations of sizes from 1 to max_n_mod are tested out.
Tip

Set between 5 (faster, more robust) and 7 (slower, less robust).

Defaults to 5 — the recommended starting point; see the recipes table in the Quick Start.
config (ProcessingConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Its dropna and ordinal_encoding toggles default to None and are resolved to the carver-friendly True here (group nan, ordinal-encode features for downstream sklearn estimators). Passing a partial config (e.g. ProcessingConfig(verbose=True)) therefore keeps those carver defaults; set them explicitly to override.
combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.
Tip
- Use Tschuprow’s T Combinations for less, more robust, modalities.
- Use Cramér’s V Combinations for more, less robust, modalities.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None, cv: int | BaseCrossValidator | Iterable = 0) → Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With ProcessingConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:

X (pd.DataFrame) – Dataset to determine Features’ optimal carving.
y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None
y_dev (pd.Series, optional) – Target associated to X_dev, by default None
cv (int, cross-validation generator or iterable, optional) –
Additional robustness views, resolved via sklearn.model_selection.check_cv() exactly like sklearn’s own CV-consuming estimators — AutoCarver never builds folds itself. Ranks are still determined on the full train X; each fold is a disjoint held-out partition and a carved combination must stay viable on X_dev (if given) and every fold.
- 0 (default): disabled, a no-op.
- int >= 2: k-fold split, automatically stratified for classification targets (same rule check_cv itself applies).
- a scikit-learn cross-validator instance or splitter generator (e.g. StratifiedKFold(5, shuffle=True, random_state=0)): used as-is.
- an iterable of (train_idx, test_idx) index pairs: wrapped as-is.
Combines with X_dev (both must pass), so expect more features dropped than with a single dev set — each fold is a fraction of train, so keep the fold count small (3-5).

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: Path) → BaseCarver: Allows one to load a Carver saved as a .json file.

save(file_name: Path, light_mode: bool = False) → None

Saves pipeline to .json file.

Parameters:

file_name (Path) – pathlib.Path of the .json file to write.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

dropped (bool): True for dropped features, False otherwise.
dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies the fitted carving to X, feature-chunked across processes when n_jobs > 1.

The final output transform is otherwise a single serial pass (the qualitative map holds the GIL, so threads don’t help). Each column is pickled once per chunk, so coarse feature-chunk process parallelism trims it. Multiclass/nested/datetime feature sets (which need cross-column context) fall back to the serial BaseDiscretizer.transform().

Multiclass Classification

Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y-1}\) where \(n_y>2\) is the number of modalities taken by \(y\). Two carvers handle this shape, with a real trade-off between them:

MulticlassCarver (this class, default choice) carves each feature once, against the full \(n_y\)-class crosstab: one bucket set per feature, copy=False supported, and roughly \((n_y - 1)\times\) faster than the one-vs-rest alternative. Use it when a single multiclass model consumes the carved features.
One-vs-Rest Classification fits a separate BinaryCarver per class (\(n_y - 1\) of them, one class held out as reference), producing \(n_y - 1\) versions of every feature. Use it when the carved features feed \(n_y - 1\) independent one-vs-rest scorecards — OVR buckets may score higher per class by construction, since each one is optimized against only that class.

Migration note (breaking change)

Prior to this release, MulticlassCarver was the one-vs-rest carver. That behavior is unchanged but renamed: import OneVsRestCarver instead. MulticlassCarver now refers to the joint carver described below.

MulticlassCarver is a sibling of Ordinal Classification (both sit directly on BaseCarver and aggregate a feature-groups × target-levels crosstab): the \(n_y\) classes are unordered here, so qualitative modalities are ordered by their correspondence-analysis first-axis score (the chi²-optimal 1-D embedding — see Correspondence-analysis ordering (multiclass targets)) instead of a target-rate mean (Target-mean ordering (binary, continuous and ordinal targets)), and the association measure generalizes the binary \(\chi^2\) from a 2-column to a \((B, n_y)\)-column table:

Cramér’s \(V=\sqrt{\frac{\chi^2}{n\,(\min(B, n_y)-1)}}\) and Tschuprow’s \(T=\sqrt{\frac{\chi^2}{n\,\sqrt{(B-1)(n_y-1)}}}\), where \(B\) is the number of groups in the candidate combination. At \(n_y=2\) both reduce exactly to BinaryCarver’s own formulas.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the multiclass target \(y\).

class AutoCarver.MulticlassCarver(features: Features, min_freq: float = 0.02, max_n_mod: int = 5, *, combination_evaluator: CombinationEvaluator | None = None, config: ProcessingConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass (unordered, \(K > 2\) classes) target — one carving per feature, against the full n x K crosstab.

Sibling of OrdinalCarver (both sit directly on BaseCarver and aggregate a feature-groups x target-levels crosstab): the K target classes are unordered here, so qualitative modalities are ordered by their correspondence-analysis first-axis score (see AutoCarver.discretizers.utils.correspondence_analysis) instead of a numeric target-rate mean, and the association measure is a chi²-family statistic (Tschuprow’s T or Cramér’s V) generalised to a (B, K) table instead of Kendall’s tau-c.

A feature is carved once: unlike OneVsRestCarver (which fits K - 1 separate BinaryCarver instances — one per class, producing K - 1 versions of every feature), this carver produces a single bucket set per feature and supports copy=False.

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- Drives the viability filter on both train and dev crosstabs during the combination search.
- The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

Defaults to 0.02 — the recommended starting point; see the recipes table in the Quick Start.
max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.
- The combination with the best association will be selected.
- All combinations of sizes from 1 to max_n_mod are tested out.
Tip

Set between 5 (faster, more robust) and 7 (slower, less robust).

Defaults to 5 — the recommended starting point; see the recipes table in the Quick Start.
config (ProcessingConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Its dropna and ordinal_encoding toggles default to None and are resolved to the carver-friendly True here (group nan, ordinal-encode features for downstream sklearn estimators). Passing a partial config (e.g. ProcessingConfig(verbose=True)) therefore keeps those carver defaults; set them explicitly to override.
combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between Features and a multiclass target. Defaults to TschuprowtMulticlassCombinations.

Choose from: TschuprowtMulticlassCombinations (default), CramervMulticlassCombinations.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None, cv: int | BaseCrossValidator | Iterable = 0) → Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With ProcessingConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:

X (pd.DataFrame) – Dataset to determine Features’ optimal carving.
y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None
y_dev (pd.Series, optional) – Target associated to X_dev, by default None
cv (int, cross-validation generator or iterable, optional) –
Additional robustness views, resolved via sklearn.model_selection.check_cv() exactly like sklearn’s own CV-consuming estimators — AutoCarver never builds folds itself. Ranks are still determined on the full train X; each fold is a disjoint held-out partition and a carved combination must stay viable on X_dev (if given) and every fold.
- 0 (default): disabled, a no-op.
- int >= 2: k-fold split, automatically stratified for classification targets (same rule check_cv itself applies).
- a scikit-learn cross-validator instance or splitter generator (e.g. StratifiedKFold(5, shuffle=True, random_state=0)): used as-is.
- an iterable of (train_idx, test_idx) index pairs: wrapped as-is.
Combines with X_dev (both must pass), so expect more features dropped than with a single dev set — each fold is a fraction of train, so keep the fold count small (3-5).

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: Path) → BaseCarver: Allows one to load a Carver saved as a .json file.

save(file_name: Path, light_mode: bool = False) → None

Saves pipeline to .json file.

Parameters:

file_name (Path) – pathlib.Path of the .json file to write.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

dropped (bool): True for dropped features, False otherwise.
dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies the fitted carving to X, feature-chunked across processes when n_jobs > 1.

The final output transform is otherwise a single serial pass (the qualitative map holds the GIL, so threads don’t help). Each column is pickled once per chunk, so coarse feature-chunk process parallelism trims it. Multiclass/nested/datetime feature sets (which need cross-column context) fall back to the serial BaseDiscretizer.transform().

One-vs-Rest Classification

For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).

OneVsRestCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\) (\(n_y - 1\) fits — one class is held out as the implicit reference). Thus, the same association measures are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).

For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).

class AutoCarver.OneVsRestCarver(features: Features, min_freq: float = 0.02, max_n_mod: int = 5, *, combination_evaluator: CombinationEvaluator | None = None, config: ProcessingConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.

Examples

One-vs-Rest Classification Example

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- Drives the viability filter on both train and dev crosstabs during the combination search.
- The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

Defaults to 0.02 — the recommended starting point; see the recipes table in the Quick Start.
max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.
- The combination with the best association will be selected.
- All combinations of sizes from 1 to max_n_mod are tested out.
Tip

Set between 5 (faster, more robust) and 7 (slower, less robust).

Defaults to 5 — the recommended starting point; see the recipes table in the Quick Start.
config (ProcessingConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Its dropna and ordinal_encoding toggles default to None and are resolved to the carver-friendly True here (group nan, ordinal-encode features for downstream sklearn estimators). Passing a partial config (e.g. ProcessingConfig(verbose=True)) therefore keeps those carver defaults; set them explicitly to override.
combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between Features and a binary target. Defaults to TschuprowtCombinations.
Tip
- Use Tschuprow’s T Combinations for less, more robust, modalities.
- Use Cramér’s V Combinations for more, less robust, modalities.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None, cv: int | BaseCrossValidator | Iterable = 0) → Self

Finds the combination of modalities of X that provides the best association with y.

If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With ProcessingConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Xpd.DataFrame

Dataset to determine Features’ optimal carving.

ypd.Series

Target with wich the association is maximized.

X_devpd.DataFrame, optional

Dataset to evaluate robustness of Features, by default None

y_devpd.Series, optional

Target associated to X_dev, by default None

cvint, cross-validation generator or iterable, optional

Additional robustness views, resolved via sklearn.model_selection.check_cv() exactly like sklearn’s own CV-consuming estimators — AutoCarver never builds folds itself. Ranks are still determined on the full train X; each fold is a disjoint held-out partition and a carved combination must stay viable on X_dev (if given) and every fold.

0 (default): disabled, a no-op.
int >= 2: k-fold split, automatically stratified for classification targets (same rule check_cv itself applies).
a scikit-learn cross-validator instance or splitter generator (e.g. StratifiedKFold(5, shuffle=True, random_state=0)): used as-is.
an iterable of (train_idx, test_idx) index pairs: wrapped as-is.

Combines with X_dev (both must pass), so expect more features dropped than with a single dev set — each fold is a fraction of train, so keep the fold count small (3-5).

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: Path) → BaseCarver: Allows one to load a Carver saved as a .json file.

save(file_name: Path, light_mode: bool = False) → None

Saves pipeline to .json file.

Parameters:

file_name (Path) – pathlib.Path of the .json file to write.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

dropped (bool): True for dropped features, False otherwise.
dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies the fitted carving to X, feature-chunked across processes when n_jobs > 1.

The final output transform is otherwise a single serial pass (the qualitative map holds the GIL, so threads don’t help). Each column is pickled once per chunk, so coarse feature-chunk process parallelism trims it. Multiclass/nested/datetime feature sets (which need cross-column context) fall back to the serial BaseDiscretizer.transform().

Regression tasks

Continuous Regression

Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).

The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).

For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.

class AutoCarver.ContinuousCarver(features: Features, min_freq: float = 0.02, max_n_mod: int = 5, *, combination_evaluator: CombinationEvaluator | None = None, config: ProcessingConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.

For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.

Examples

Continuous Regression Example

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- Drives the viability filter on both train and dev crosstabs during the combination search.
- The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

Defaults to 0.02 — the recommended starting point; see the recipes table in the Quick Start.
max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.
- The combination with the best association will be selected.
- All combinations of sizes from 1 to max_n_mod are tested out.
Tip

Set between 5 (faster, more robust) and 7 (slower, less robust).

Defaults to 5 — the recommended starting point; see the recipes table in the Quick Start.
config (ProcessingConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Its dropna and ordinal_encoding toggles default to None and are resolved to the carver-friendly True here (group nan, ordinal-encode features for downstream sklearn estimators). Passing a partial config (e.g. ProcessingConfig(verbose=True)) therefore keeps those carver defaults; set them explicitly to override.
combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between Features and a continuous target. Defaults to KruskalCombinations.

Currently, only Kruskal’s H Combinations is implemented for continuous targets.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None, cv: int | BaseCrossValidator | Iterable = 0) → Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With ProcessingConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:

X (pd.DataFrame) – Dataset to determine Features’ optimal carving.
y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None
y_dev (pd.Series, optional) – Target associated to X_dev, by default None
cv (int, cross-validation generator or iterable, optional) –
Additional robustness views, resolved via sklearn.model_selection.check_cv() exactly like sklearn’s own CV-consuming estimators — AutoCarver never builds folds itself. Ranks are still determined on the full train X; each fold is a disjoint held-out partition and a carved combination must stay viable on X_dev (if given) and every fold.
- 0 (default): disabled, a no-op.
- int >= 2: k-fold split, automatically stratified for classification targets (same rule check_cv itself applies).
- a scikit-learn cross-validator instance or splitter generator (e.g. StratifiedKFold(5, shuffle=True, random_state=0)): used as-is.
- an iterable of (train_idx, test_idx) index pairs: wrapped as-is.
Combines with X_dev (both must pass), so expect more features dropped than with a single dev set — each fold is a fraction of train, so keep the fold count small (3-5).

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: Path) → BaseCarver: Allows one to load a Carver saved as a .json file.

save(file_name: Path, light_mode: bool = False) → None

Saves pipeline to .json file.

Parameters:

file_name (Path) – pathlib.Path of the .json file to write.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

dropped (bool): True for dropped features, False otherwise.
dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies the fitted carving to X, feature-chunked across processes when n_jobs > 1.

The final output transform is otherwise a single serial pass (the qualitative map holds the GIL, so threads don’t help). Each column is pickled once per chunk, so coarse feature-chunk process parallelism trims it. Multiclass/nested/datetime feature sets (which need cross-column context) fall back to the serial BaseDiscretizer.transform().

Ordinal tasks

Ordinal Classification

Within OrdinalCarver, an ordinal target is a column \(y\) whose values are integer-encoded ordered levels (e.g. \(1..K\) with \(K > 2\)); the level order is read from the ascending integer values. A two-level target should use Binary Classification and a free, unordered target Multiclass Classification — OrdinalCarver rejects both at fit time, as it does a non-integer (continuous) or string target.

The association with a feature \(x\) is measured with a rank-correlation statistic computed on the ordered contingency table (feature groups × ordinal target levels). Unlike the binary \(\chi^2\), a rank statistic rewards a grouping whose order matches the target’s order — exactly what an ordinal target calls for. The default is Kendall/Stuart’s tau-c; Kendall’s tau-b and the original Somers’ D are also available via combination_evaluator. The symmetric Kendall taus self-balance to a robust, parsimonious number of modalities (only adding a split when it is genuinely discriminative), whereas Somers’ D leans toward the coarsest split. See Ordinal tasks for the metric definitions and the search.

For two combinations of modalities of \(x\), a higher tau / Somers’ D value indicates a grouping whose ordering agrees more strongly with the ordinal target’s order.

class AutoCarver.OrdinalCarver(features: Features, min_freq: float = 0.02, max_n_mod: int = 5, *, combination_evaluator: CombinationEvaluator | None = None, target_scale: Literal['ridit', 'level'] | dict = 'ridit', config: ProcessingConfig | None = None)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with an ordinal target.

The target must be integer-encoded with ordered levels (e.g. 1..K, K > 2); the level order is taken from the ascending integer values.

For ordinal targets, Kendall’s Kendall/Stuart’s \tau_c (ordinal default) is the default association measure to sort combinations — it rewards groupings whose order matches the target’s while favouring robust, parsimonious cardinality. Kendall’s \tau_b and the original Somers’ D (Somers’ D) are also available via combination_evaluator.

target_scale declares how the integer encoding of the levels should be read (it drives the modality pre-sort and the viability rate; the rank-based tau statistics are encoding-invariant either way):

"ridit" (default) — order-only levels (Poor / Fair / Good): levels are scored by their train ridits, invariant under any strictly increasing re-encoding.
"level" — count targets (e.g. 0–5 claims), where the encoding is the scale and the mean level (expected count) is the right summary.
{level: value} — known representative values per level (e.g. a calibrated default probability per rating grade), strictly increasing.

When individual continuous target values are available, use ContinuousCarver instead.

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- Drives the viability filter on both train and dev crosstabs during the combination search.
- The pre-search discretization runs at the halved threshold half_min_freq (= min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).

Defaults to 0.02 — the recommended starting point; see the recipes table in the Quick Start.
max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured CombinationEvaluator.
- The combination with the best association will be selected.
- All combinations of sizes from 1 to max_n_mod are tested out.
Tip

Set between 5 (faster, more robust) and 7 (slower, less robust).

Defaults to 5 — the recommended starting point; see the recipes table in the Quick Start.
config (ProcessingConfig, optional) – Behavioral toggles inherited from BaseDiscretizer. Its dropna and ordinal_encoding toggles default to None and are resolved to the carver-friendly True here (group nan, ordinal-encode features for downstream sklearn estimators). Passing a partial config (e.g. ProcessingConfig(verbose=True)) therefore keeps those carver defaults; set them explicitly to override.
combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between Features and an ordinal target. Defaults to KendallTauCCombinations.

Choose from: KendallTauCCombinations (default), KendallTauBCombinations, SomersDCombinations.
target_scale ("ridit", "level" or dict, optional) – How the integer encoding of the target levels is read, by default "ridit". A dict maps each level to its (strictly increasing) representative value. Conflicts with a combination_evaluator carrying an explicit non-ridit target_rate.

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None, cv: int | BaseCrossValidator | Iterable = 0) → Self

Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.

Features for which no candidate combination survives the viability filter (Wilson min_freq on train + dev, distinct target rates, train/dev rank preservation) are dropped from self.features and retained on self.dropped_features. With ProcessingConfig(n_jobs=k) and k > 1 and more than one feature, the per-feature combination search runs in parallel through multiprocessing.Pool.imap_unordered.

Parameters:

X (pd.DataFrame) – Dataset to determine Features’ optimal carving.
y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of Features, by default None
y_dev (pd.Series, optional) – Target associated to X_dev, by default None
cv (int, cross-validation generator or iterable, optional) –
Additional robustness views, resolved via sklearn.model_selection.check_cv() exactly like sklearn’s own CV-consuming estimators — AutoCarver never builds folds itself. Ranks are still determined on the full train X; each fold is a disjoint held-out partition and a carved combination must stay viable on X_dev (if given) and every fold.
- 0 (default): disabled, a no-op.
- int >= 2: k-fold split, automatically stratified for classification targets (same rule check_cv itself applies).
- a scikit-learn cross-validator instance or splitter generator (e.g. StratifiedKFold(5, shuffle=True, random_state=0)): used as-is.
- an iterable of (train_idx, test_idx) index pairs: wrapped as-is.
Combines with X_dev (both must pass), so expect more features dropped than with a single dev set — each fold is a fraction of train, so keep the fold count small (3-5).

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property history: DataFrame

Combined combination-history of carved + dropped features.

Dropped features’ rows are appended with dropped=True; carved features’ rows get dropped=False.

classmethod load(file_name: Path) → BaseCarver: Allows one to load a Carver saved as a .json file.

save(file_name: Path, light_mode: bool = False) → None

Saves pipeline to .json file.

Parameters:

file_name (Path) – pathlib.Path of the .json file to write.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

property summary: DataFrame

Per-feature carving summary, extended with one block per dropped feature.

Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:

dropped (bool): True for dropped features, False otherwise.
dropped_reason (str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies the fitted carving to X, feature-chunked across processes when n_jobs > 1.

The final output transform is otherwise a single serial pass (the qualitative map holds the GIL, so threads don’t help). Each column is pickled once per chunk, so coarse feature-chunk process parallelism trims it. Multiclass/nested/datetime feature sets (which need cross-column context) fall back to the serial BaseDiscretizer.transform().

Carvers

Per-feature parallelism (n_jobs)

Dropped features (no robust combination)

Classification tasks

Binary Classification

Multiclass Classification

One-vs-Rest Classification

Regression tasks

Continuous Regression

Ordinal tasks

Ordinal Classification

Per-feature parallelism (`n_jobs`)