Discretizers

AutoCarver implements Discretizers. It provides the following Data Preparation tools:

Discretizer / Data Type

Data Preparation

Continuous Discretizer:

Continuous Data

Discrete Data

Over-represented values are set as there own modality

Automatic quantile bucketization of under-represented values

Modalities are ordered by default real number ordering

Ordinal Discretizer:

Ordinal Data

Under-represented modalities are grouped with the closest modality

Modalities are ordered according to provided modality ranking

Categorical Discretizer:

Categorical Data

Under-represented modalities are grouped into a default value

Modalities are ordered by target rate (or, for a multiclass target, by correspondence-analysis first-axis score — see Modality ordering and target rates)

Note

Representativity threshold of modalities is user selected (min_freq)
At this step, if any, nan are set as there own modality (no given order)
Helps improve modality relevancy and reduces the set of possible combinations to test from
Included in all carving pipelines: BinaryCarver, MulticlassCarver, ContinuousCarver

ProcessingConfig

Behavioral toggles shared by every discretizer and carver. All flags are optional and propagate unchanged to sub-discretizers; domain parameters such as min_freq remain explicit constructor arguments.

class AutoCarver.discretizers.ProcessingConfig(copy: bool = True, ordinal_encoding: bool | None = None, dropna: bool | None = None, verbose: bool = False, n_jobs: int = 1, min_freq_alpha: float = 0.05, y_level_scores: dict | None = None, rescue_rare: bool | None = None)

Behavioral configuration applied to a BaseDiscretizer.

Carries cross-cutting toggles that propagate unchanged to sub-discretizers. Pure domain values (min_freq, combinations …) remain explicit constructor arguments; min_freq_alpha lives here because it tunes how min_freq is tested, not the target itself.

copy=True is the default so that BaseDiscretizer doesn’t mutate caller DataFrames in place — set to False when nested inside a pipeline that already owns the dataframe.

min_freq_alpha is the two-sided significance level of the Wilson interval used to decide whether a modality’s observed frequency is significantly below min_freq. Smaller values are more lenient (wider CI → fewer merges); 0.05 matches a 95% interval.

n_jobs controls per-feature parallelism inside BaseCarver: with n_jobs > 1 and more than one feature, the per-feature combination search runs through multiprocessing.Pool.imap_unordered. Worth it only on hundreds-to-thousands of features (pool startup + pickle overhead dominate below that).

ordinal_encoding and dropna default to None meaning use the context default: discretizers leave them False, carvers turn them True (group nan, ordinal-encode for downstream sklearn estimators). They are resolved to a concrete bool in BaseDiscretizer.__init__(). Leaving them None is what lets a partial config (e.g. ProcessingConfig(verbose=True)) toggle one field without silently flipping the carver-friendly defaults — set them explicitly to override.

y_level_scores is a {target level: score} scale the qualitative pre-sort maps y through before computing per-modality target means (e.g. train ridits resolved by OrdinalCarver from its target_scale). None (default) sorts by the raw target mean.

rescue_rare keeps features that fail the qualitative frequency check (a too-frequent mode — including NaN — or no frequent-enough modality) instead of raising, and lets carvers retry features whose combination search found nothing viable with the min_freq veto waived. Rescued combinations must still show distinct target rates and preserve train/dev rank ordering on X_dev and every CV fold — with no validation view the rescue is skipped. Off by default.

copy (bool, default True) — copy input X rather than mutating it.
ordinal_encoding (bool, default False) — emit ordinal codes instead of string labels (carvers default this to True).
dropna (bool, default False) — group nan into another modality (carvers default this to True).
verbose (bool, default False) — print progress and statistics.
n_jobs (int, default 1) — number of workers for parallel fits. Inside BaseCarver, n_jobs > 1 dispatches one task per feature through multiprocessing.Pool; see Carvers for sizing guidance.
min_freq_alpha (float, default 0.05) — two-sided significance level of the Wilson score interval used to gate min_freq. A modality is declared under-represented only when the Wilson upper bound of its observed proportion is significantly below min_freq (see Minimum-frequency test (Wilson score interval) for the formula and decision rule). Smaller \(\alpha\) → wider CI → fewer rejections → less merging; larger \(\alpha\) → tighter CI → more merging. \(\alpha = 1\) recovers the legacy strict-threshold behaviour.

Discretizer, a complete discretization pipeline

class AutoCarver.discretizers.Discretizer(features: Features, min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretization pipeline of continuous, discrete, categorical and ordinal features.

Pipeline steps: Complete pipeline for continuous and discrete features, Complete pipeline for categorical and ordinal features.

Modalities/values of features are grouped according to there respective orders:

[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.
[Continuous/Discrete features] real order of the values.

Parameters:

features (Features) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Quantitative Data

Complete pipeline for continuous and discrete features

The animation below walks through the five stages of QuantitativeDiscretizer on a synthetic Fare-like distribution with multiple class-fare spikes (generated with a fixed seed; the discretization itself is the real output of QuantitativeDiscretizer.fit):

Raw feature — Gaussian-KDE density estimate; the lognormal body sits under several discrete “class fare” peaks (0, 7.25, 13, 26.55), with the NaN proportion held aside.
Over-represented values detected — values occurring more often than \(1/q\) get their own singleton bin; here all four class-fare spikes qualify (marked in orange).
After ContinuousDiscretizer — four thin spike singletons plus the quantile bins that fill the gaps between them. Bars whose Wilson upper bound falls below min_freq are outlined in orange (the sparse segment (13, 25.9] and the tail (26.55, 33.9]) — these are the bins the OrdinalDiscretizer pass will merge.
Merge direction chosen — OrdinalDiscretizer merges each rare bin into the dominant neighbour with the closest target rate; dashed arrows point from the sparse bin to the bin that absorbs it.
After QuantitativeDiscretizer — the six surviving bins. Each merged bar spans the union of its absorbed Stage-2 slots and keeps the dominant (anchor) bin’s colour, so the eye can track which bin “swallowed” its sparse neighbour.

QuantitativeDiscretizer pipeline animation — raw density curve, over-rep detection, quantile bins with rare outlined, merge-direction arrows, merged bins

class AutoCarver.discretizers.QuantitativeDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretization pipeline of continuous and discrete features.

Pipeline steps: Continuous Discretizer, Ordinal Discretizer

Modalities/values of features are grouped according to there respective orders:

Parameters:

quantitatives (list[QuantitativeFeature]) – Quantitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Continuous Discretizer

The animation below walks through the three stages of ContinuousDiscretizer on a synthetic Fare-like distribution (generated with a fixed seed; the discretization itself is the real output of ContinuousDiscretizer.fit_transform):

Raw feature — Gaussian-KDE density estimate of the continuous values, with the NaN proportion held aside.
Over-represented value detected — values occurring more often than \(1/q\) get their own modality (here Fare = 0, marked in orange).
After ContinuousDiscretizer — the over-rep modality plus quantile bins; bar widths reflect each modality’s real value range and bar heights its real frequency, with a horizontal reference at min_freq.

ContinuousDiscretizer pipeline animation — raw density curve, over-rep detection, quantile bins

class AutoCarver.discretizers.ContinuousDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretizing of continuous and discrete features, building simple groups of quantiles of values.

Quantile discretization creates a lot of modalities (for example: up to 100 modalities for min_freq=0.01). Set min_freq with caution.

The number of quantiles depends on overrepresented modalities and nans:

Values more frequent than min_freq are set as there own modalities.
Other values are cut in quantiles using numpy.quantile.
The number of quantiles is set as (1-freq_frequent_modals)/(min_freq).
Nans are considered as a modality (and are taken into account in freq_frequent_modals).

Parameters:

quantitatives (list[QuantitativeFeature]) – Quantitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series | None = None) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Qualitative Data

Complete pipeline for categorical and ordinal features

The animation below shows how QualitativeDiscretizer processes two features in parallel — a categorical feature (Port, top strip) and an ordinal feature (AgeGroup, bottom strip):

Raw features — both strips shown: Port bars in frequency-descending order, AgeGroup bars in declared ordinal order; rare modalities outlined orange on each.
Rare modalities grouped — rare Port modalities (Belfast, Boston) collapse into __OTHER__; AgeGroup is unchanged (dimmed at 60 % opacity — not yet processed).
After CategoricalDiscretizer — Port bars reordered by ascending P(y=1); dot trace is now monotonic. AgeGroup still unchanged (dimmed).
OrdinalDiscretizer — merge direction — Port at full opacity (done); curved arrows show which rare AgeGroup modality merges into which neighbour.
After QualitativeDiscretizer — both strips at full opacity; AgeGroup bars span the slots of their absorbed modalities, ordinal order preserved.

class AutoCarver.discretizers.QualitativeDiscretizer(qualitatives: list[QualitativeFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretiziation pipeline of categorical and ordinal features.

Pipeline steps: Categorical Discretizer, String Discretizer, Ordinal Discretizer.

Modalities/values of features are grouped according to there respective orders:

[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.

Parameters:

qualitatives (list[QualitativeFeature]) – Qualitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Categorical Discretizer

The animation below walks through the three stages of CategoricalDiscretizer on a synthetic Titanic-flavored Port feature (generated with a fixed seed; the discretization itself is the real output of CategoricalDiscretizer.fit_transform):

Raw feature — bars in frequency-descending order, with a small target-rate (\(P(y=1)\)) dot above each bar. Bars whose Wilson upper bound falls below min_freq are outlined in orange (here Belfast and Boston).
Rare modalities grouped — under-represented modalities collapse into the default __OTHER__ bin; the dot above __OTHER__ is the frequency-weighted target rate of the absorbed modalities.
After CategoricalDiscretizer — bars reordered by ascending target rate. The dot trace is now monotonic; each modality keeps its colour across the reorder so the eye can track its movement.

CategoricalDiscretizer pipeline animation — raw bars, rare-modality grouping, target-rate sort

The final reordering step is what defines the carvers’ search space — only consecutive modalities can later be merged. How the order is built per target type (target mean for numeric targets, correspondence-analysis first-axis score for multiclass ones) is detailed in Modality ordering and target rates.

class AutoCarver.discretizers.CategoricalDiscretizer(categoricals: list[CategoricalFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretization of categorical features, building simple groups frequent enough.

Groups a qualitative features’ values less frequent than min_freq into a str_default string.

NaNs are left untouched.

Only use for qualitative non-ordinal features.

Parameters:

categoricals (list[CategoricalFeature]) – Categorical features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Ordinal Discretizer

The animation below walks through the three stages of OrdinalDiscretizer on a synthetic Titanic-flavored AgeGroup ordinal feature (generated with a fixed seed; the discretization itself is the real output of OrdinalDiscretizer.fit_transform):

Raw feature — bars in the user-declared ordinal order (child → elderly), with a small target-rate (\(P(y=1)\)) dot above each bar. Bars whose Wilson upper bound falls below min_freq are outlined in orange (here teen and elderly). The dot trace is not monotonic — ordinals are ranked by domain meaning, not by target rate.
Merge direction chosen — each rare modality merges with the adjacent neighbour whose target rate is closest (or its only neighbour at the edges). Dashed arrows show the chosen direction.
After OrdinalDiscretizer — merged bars span the slots of their absorbed modalities, ordinal order preserved.

OrdinalDiscretizer pipeline animation — raw bars, merge-direction arrows, merged groups

class AutoCarver.discretizers.OrdinalDiscretizer(ordinals: list[OrdinalFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretization of ordinal features, grouping less frequent modalities with the closest modlity in target rate or by frequency.

NaNs are left untouched.

Only use for qualitative ordinal features.

Fisrt fits String Discretizer if neccesary.

Parameters:

ordinals (list[OrdinalFeature]) – Ordinal features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Nested Discretizer

NestedDiscretizer collapses several nested columns of increasing granularity (col_a ⊃ col_b ⊃ col_c) into a single robust output column. For each NestedFeature, modalities of the finest column that are too rare are rolled up to the coarser modality they are nested within — derived from the data — level by level until every surviving modality is frequent enough. It integrates into the carving pipeline automatically: declare nested features via Features(nested={"col_c": ["col_b", "col_a"]}).

class AutoCarver.discretizers.NestedDiscretizer(nesteds: list[NestedFeature], min_freq: float, *, config: ProcessingConfig | None = None)

Automatic discretization of nested qualitative features.

For each NestedFeature, modalities of the finest (output) column whose frequency is significantly below min_freq (Wilson upper bound at config.min_freq_alpha) are rolled up to the coarser modality they are nested within — derived from the data. This repeats level by level until all surviving modalities are frequent enough or the coarsest level is reached, collapsing every nested column into the single output column.

Parameters:

nesteds (list[NestedFeature]) – Nested features to process.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series | None = None) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

String Discretizer

StringDiscretizer is used as a data preparation tool to convert qualitative data to str type.

class AutoCarver.discretizers.StringDiscretizer(features: Features, *, config: ProcessingConfig | None = None)

Converts specified columns of a DataFrame into strings. First step of a Qualitative discretization pipeline.

Keeps NaN inplace
Converts floats of int to int

Parameters:

features (Features) – A set of Features to be processed.
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.

fit(X: DataFrame, y: Series | None = None) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:

X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.
y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Timedelta Discretizer

TimedeltaDiscretizer is the quantitative counterpart of String Discretizer: a data preparation tool that converts DatetimeFeature columns to floats, namely the number of seconds elapsed since each feature’s reference_date. It runs before Continuous Discretizer so that datetime features can be bucketized as ordinary quantitative features.

class AutoCarver.discretizers.TimedeltaDiscretizer(features: list[DatetimeFeature], *, config: ProcessingConfig | None = None)

Converts datetime features into floats: the number of seconds elapsed since each feature’s reference_date.

Quantitative counterpart of StringDiscretizer: a type-conversion step run before ContinuousDiscretizer so that datetime columns can be discretized as ordinary quantitative features.

Keeps NaN inplace

Parameters:

features (list[DatetimeFeature]) – A set of Features to be processed.
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance ProcessingConfig.min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
- Features need at least one modality with frequency significantly above min_freq.
- For continuous features, drives the number of quantiles (roughly 1 / min_freq).
- Modalities significantly below min_freq are merged with the closest one (ordinal) or with a default group (categorical).
Tip

Set between 0.01 (slower, less robust) and 0.05 (faster, more robust).
config (ProcessingConfig, optional) – Behavioral toggles (copy / ordinal_encoding / dropna / verbose / n_jobs / min_freq_alpha). Defaults to a default-initialized ProcessingConfig — see ProcessingConfig for each field.
features – List of datetime features to be converted to second-based timedeltas

fit(X: DataFrame, y: Series | None = None) → Self

Learns simple discretization of values of X according to values of y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Converts to JSON format.

To be used with json.dump.

Parameters:: light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
Returns:: JSON serialized object
Return type:: str

transform(X: DataFrame, y: Series | None = None) → DataFrame: Converts each datetime feature’s column to seconds since its reference_date.