Carvers
The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:
Identifying the most associated combination from all ordered combinations of modalities
Testing all combinations of ``nan``s grouped to one of those modalities
- Target-specific tools allow for association optimization per desired task:
All carvers share the same constructor signature:
features(Features) — features to carve.min_freq(float) — minimum frequency per modality. Tested via the Wilson score interval at significancemin_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).max_n_mod(int) — maximum number of modalities per carved feature; forwarded to the configuredCombinationEvaluator.combination_evaluator(CombinationEvaluator, optional) — association metric. Defaults to a task-appropriate evaluator (see each subclass). The search uses progressive top-K interval dynamic programming (DP) for both Kruskal-H (continuous) and Pearson \(\chi^2\) (binary); statistically equivalent to the legacy enumerate-and-score path.config(DiscretizerConfig, optional) — behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults toDiscretizerConfig(dropna=True, ordinal_encoding=True).
Per-feature parallelism (n_jobs)
With DiscretizerConfig(n_jobs=k) and k > 1, BaseCarver dispatches
one task per feature through multiprocessing.Pool.imap_unordered. Each worker
receives a pickled deep copy of the CombinationEvaluator and a single
(feature, xagg, xagg_dev) payload; mutations stay local to the worker process
and the parent reattaches the (mutated) feature on completion. Verbose per-feature
logging is silenced — a single dispatch banner is printed when verbose=True.
Tip
Worth it only on a few hundred features or more. Below that, pool startup
and pickle overhead dominate and the single-process path is faster. The
DP top-K search already removes the per-feature compute
bottleneck, so most users will not need n_jobs > 1.
Dropped features (no robust combination)
A feature for which no candidate combination passed the
viability filter (Wilson min_freq on train and dev,
distinct target rates, train/dev rank preservation) is removed from
carver.features and retained on carver.dropped_features so the user
can inspect why it was dropped without re-fitting.
The summary and history properties append rows from dropped
features with two marker columns:
dropped(bool) —Truefor rows from a dropped feature,Falseotherwise.dropped_reason(str|None) — synthesized from the dominant failing-test message across the feature’s historized combinations (e.g. “Inversion of target rates per modality”, “Non-representative modality for min_freq=2.00%”).
A dropped feature most commonly signals that X_dev is too small or not
representative of X for that feature: every candidate combination viable on
train flipped its target-rate ordering on dev. Increasing the dev sample size,
relaxing max_n_mod, or dropping the feature entirely are the three available
levers.
Classification tasks
Binary Classification
Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).
At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab.
It is computed only once per feature \(x\) against the binary target \(y\).
The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.
BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring.
It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.
Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.
Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.
For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).
- class AutoCarver.BinaryCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.
Examples
- Parameters:
features (Features) – A set of
Featuresto be processed.min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.Drives the viability filter on both train and dev crosstabs during the combination search.
The pre-search discretization runs at the halved threshold
half_min_freq(=min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured
CombinationEvaluator.The combination with the best association will be selected.
All combinations of sizes from
1tomax_n_modare tested out.
Tip
Set between
5(faster, more robust) and7(slower, less robust).config (DiscretizerConfig, optional) – Behavioral toggles inherited from
BaseDiscretizer. Defaults toDiscretizerConfig(dropna=True, ordinal_encoding=True)— the carver-friendly defaults (groupnan, ordinal-encode features for downstream sklearn estimators).combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between
Featuresand a binary target. Defaults toTschuprowtCombinations.Tip
Use Tschuprow’s T Combinations for less, more robust, modalities.
Use Cramér’s V Combinations for more, less robust, modalities.
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self
Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.
Features for which no candidate combination survives the viability filter (Wilson
min_freqon train + dev, distinct target rates, train/dev rank preservation) are dropped fromself.featuresand retained onself.dropped_features. WithDiscretizerConfig(n_jobs=k)andk > 1and more than one feature, the per-feature combination search runs in parallel throughmultiprocessing.Pool.imap_unordered.- Parameters:
X (pd.DataFrame) – Dataset to determine
Features’ optimal carving.y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of
Features, by defaultNoney_dev (pd.Series, optional) – Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
Combined combination-history of carved + dropped features.
Dropped features’ rows are appended with
dropped=True; carved features’ rows getdropped=False.
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Per-feature carving summary, extended with one block per dropped feature.
Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:
dropped(bool):Truefor dropped features,Falseotherwise.dropped_reason(str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Multilclass Classification
Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).
For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).
MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).
For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).
- class AutoCarver.MulticlassCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.
Examples
Multiclass Classification Example
- Parameters:
features (Features) – A set of
Featuresto be processed.min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.Drives the viability filter on both train and dev crosstabs during the combination search.
The pre-search discretization runs at the halved threshold
half_min_freq(=min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured
CombinationEvaluator.The combination with the best association will be selected.
All combinations of sizes from
1tomax_n_modare tested out.
Tip
Set between
5(faster, more robust) and7(slower, less robust).config (DiscretizerConfig, optional) – Behavioral toggles inherited from
BaseDiscretizer. Defaults toDiscretizerConfig(dropna=True, ordinal_encoding=True)— the carver-friendly defaults (groupnan, ordinal-encode features for downstream sklearn estimators).combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between
Featuresand a binary target. Defaults toTschuprowtCombinations.Tip
Use Tschuprow’s T Combinations for less, more robust, modalities.
Use Cramér’s V Combinations for more, less robust, modalities.
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self
- Finds the combination of modalities of X that provides the best association with y.
If provided, X_dev set should be large enough to have the same distribution as X.
Features for which no candidate combination survives the viability filter (Wilson
min_freqon train + dev, distinct target rates, train/dev rank preservation) are dropped fromself.featuresand retained onself.dropped_features. WithDiscretizerConfig(n_jobs=k)andk > 1and more than one feature, the per-feature combination search runs in parallel throughmultiprocessing.Pool.imap_unordered.- Xpd.DataFrame
Dataset to determine
Features’ optimal carving.- ypd.Series
Target with wich the association is maximized.
- X_devpd.DataFrame, optional
Dataset to evaluate robustness of
Features, by defaultNone- y_devpd.Series, optional
Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
Combined combination-history of carved + dropped features.
Dropped features’ rows are appended with
dropped=True; carved features’ rows getdropped=False.
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Per-feature carving summary, extended with one block per dropped feature.
Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:
dropped(bool):Truefor dropped features,Falseotherwise.dropped_reason(str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Regression tasks
Continuous Regression
Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).
The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).
For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.
- class AutoCarver.ContinuousCarver(features: Features, min_freq: float, max_n_mod: int, *, combination_evaluator: CombinationEvaluator | None = None, config: DiscretizerConfig | None = None)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.
For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.
Examples
- Parameters:
features (Features) – A set of
Featuresto be processed.min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.Drives the viability filter on both train and dev crosstabs during the combination search.
The pre-search discretization runs at the halved threshold
half_min_freq(=min_freq / 2) so the combination evaluator has a finer granularity to recombine.
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).max_n_mod (int) –
Maximum number of modalities per carved feature. Forwarded to the configured
CombinationEvaluator.The combination with the best association will be selected.
All combinations of sizes from
1tomax_n_modare tested out.
Tip
Set between
5(faster, more robust) and7(slower, less robust).config (DiscretizerConfig, optional) – Behavioral toggles inherited from
BaseDiscretizer. Defaults toDiscretizerConfig(dropna=True, ordinal_encoding=True)— the carver-friendly defaults (groupnan, ordinal-encode features for downstream sklearn estimators).combination_evaluator (CombinationEvaluator, optional) –
Pre-built evaluator instance measuring association between
Featuresand a continuous target. Defaults toKruskalCombinations.Currently, only Kruskal’s H Combinations is implemented for continuous targets.
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) Self
Finds the combination of modalities of X that provides the best association with y. If provided, X_dev set should be large enough to have the same distribution as X.
Features for which no candidate combination survives the viability filter (Wilson
min_freqon train + dev, distinct target rates, train/dev rank preservation) are dropped fromself.featuresand retained onself.dropped_features. WithDiscretizerConfig(n_jobs=k)andk > 1and more than one feature, the per-feature combination search runs in parallel throughmultiprocessing.Pool.imap_unordered.- Parameters:
X (pd.DataFrame) – Dataset to determine
Features’ optimal carving.y (pd.Series) – Target with wich the association is maximized.
X_dev (pd.DataFrame, optional) – Dataset to evaluate robustness of
Features, by defaultNoney_dev (pd.Series, optional) – Target associated to
X_dev, by defaultNone
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property history: DataFrame
Combined combination-history of carved + dropped features.
Dropped features’ rows are appended with
dropped=True; carved features’ rows getdropped=False.
- classmethod load(file_name: str) BaseCarver
Allows one to load a Carver saved as a .json file.
- save(file_name: str, light_mode: bool = False) None
Saves pipeline to .json file.
- Parameters:
file_name (str) – String of .json file name.
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- property summary: DataFrame
Per-feature carving summary, extended with one block per dropped feature.
Rows from features that the carver dropped (no robust combination on train and/or dev) are appended at the end with two marker columns:
dropped(bool):Truefor dropped features,Falseotherwise.dropped_reason(str | None): synthesized from the feature’s history — the dominant failing test message across attempted combinations.
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame