Discretizers

AutoCarver implements Discretizers. It provides the following Data Preparation tools:

Discretizer / Data Type

Data Preparation

Continuous Discretizer:

Continuous Data

Discrete Data

Over-represented values are set as there own modality

Automatic quantile bucketization of under-represented values

Modalities are ordered by default real number ordering

Ordinal Discretizer:

Ordinal Data

Under-represented modalities are grouped with the closest modality

Modalities are ordered according to provided modality ranking

Categorical Discretizer:

Categorical Data

Under-represented modalities are grouped into a default value

Modalities are ordered by target rate

Note

  • Representativity threshold of modalities is user selected (min_freq)

  • At this step, if any, nan are set as there own modality (no given order)

  • Helps improve modality relevancy and reduces the set of possible combinations to test from

  • Included in all carving pipelines: BinaryCarver, MulticlassCarver, ContinuousCarver

DiscretizerConfig

Behavioral toggles shared by every discretizer and carver. All flags are optional and propagate unchanged to sub-discretizers; domain parameters such as min_freq remain explicit constructor arguments.

class AutoCarver.discretizers.DiscretizerConfig(copy: bool = True, ordinal_encoding: bool = False, dropna: bool = False, verbose: bool = False, n_jobs: int = 1)

Behavioral configuration applied to a BaseDiscretizer.

Carries only cross-cutting toggles that propagate unchanged to sub-discretizers. Domain parameters (min_freq, combinations …) are explicit constructor arguments, not config.

copy=True is the default so that BaseDiscretizer doesn’t mutate caller DataFrames in place — set to False when nested inside a pipeline that already owns the dataframe.

  • copy (bool, default True) — copy input X rather than mutating it.

  • ordinal_encoding (bool, default False) — emit ordinal codes instead of string labels (carvers default this to True).

  • dropna (bool, default False) — group nan into another modality (carvers default this to True).

  • verbose (bool, default False) — print progress and statistics.

  • n_jobs (int, default 1) — number of workers for parallel fits.

Discretizer, a complete discretization pipeline

class AutoCarver.discretizers.Discretizer(features: Features, min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretization pipeline of continuous, discrete, categorical and ordinal features.

Pipeline steps: Complete pipeline for continuous and discrete features, Complete pipeline for categorical and ordinal features.

Modalities/values of features are grouped according to there respective orders:

  • [Categorical features] order based on modality target rate.

  • [Ordinal features] user-specified order.

  • [Continuous/Discrete features] real order of the values.

Parameters:
  • features (Features) – A set of Features to be processed

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Quantitative Data

Complete pipeline for continuous and discrete features

class AutoCarver.discretizers.QuantitativeDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretization pipeline of continuous and discrete features.

Pipeline steps: Continuous Discretizer, Ordinal Discretizer

Modalities/values of features are grouped according to there respective orders:

  • [Continuous/Discrete features] real order of the values.

Parameters:
  • quantitatives (list[QuantitativeFeature]) – Quantitative features to process

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Continuous Discretizer

class AutoCarver.discretizers.ContinuousDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretizing of continuous and discrete features, building simple groups of quantiles of values.

Quantile discretization creates a lot of modalities (for example: up to 100 modalities for min_freq=0.01). Set min_freq with caution.

The number of quantiles depends on overrepresented modalities and nans:

  • Values more frequent than min_freq are set as there own modalities.

  • Other values are cut in quantiles using numpy.quantile.

  • The number of quantiles is set as (1-freq_frequent_modals)/(min_freq).

  • Nans are considered as a modality (and are taken into account in freq_frequent_modals).

Parameters:
  • quantitatives (list[QuantitativeFeature]) – Quantitative features to process

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series | None = None) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Qualitative Data

Complete pipeline for categorical and ordinal features

class AutoCarver.discretizers.QualitativeDiscretizer(qualitatives: list[QualitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretiziation pipeline of categorical and ordinal features.

Pipeline steps: Categorical Discretizer, String Discretizer, Ordinal Discretizer.

Modalities/values of features are grouped according to there respective orders:

  • [Categorical features] order based on modality target rate.

  • [Ordinal features] user-specified order.

Parameters:
  • qualitatives (list[QualitativeFeature]) – Qualitative features to process

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Categorical Discretizer

class AutoCarver.discretizers.CategoricalDiscretizer(categoricals: list[CategoricalFeature], min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretization of categorical features, building simple groups frequent enough.

Groups a qualitative features’ values less frequent than min_freq into a str_default string.

NaNs are left untouched.

Only use for qualitative non-ordinal features.

Parameters:
  • categoricals (list[CategoricalFeature]) – Categorical features to process

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Ordinal Discretizer

class AutoCarver.discretizers.OrdinalDiscretizer(ordinals: list[OrdinalFeature], min_freq: float, *, config: DiscretizerConfig | None = None)

Automatic discretization of ordinal features, grouping less frequent modalities with the closest modlity in target rate or by frequency.

NaNs are left untouched.

Only use for qualitative ordinal features.

Fisrt fits String Discretizer if neccesary.

Parameters:
  • ordinals (list[OrdinalFeature]) – Ordinal features to process

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Chained Discretizer

ChainedDiscretizer can be used prior to using any carving pipeline or any other discretizer to group categorical modalities more intelligently. By providing a set of modality groups, the user can introduce use case specific knowledge into the discretization process. The fitted Features can then be as parameter for further discretization.

class AutoCarver.discretizers.ChainedDiscretizer(min_freq: float, features: list[BaseFeature], chained_orders: list[GroupedList], *, config: DiscretizerConfig | None = None)

Automatic discretization of categorical features, joining rare modalities into higher level groups.

For each provided GroupedList from chained_orders attribute, values less frequent than min_freq are grouped in there respective group, as defined by GroupedList.

Parameters:
  • features (Features) – A set of Features to be processed

  • min_freq (float, optional) –

    Minimum frequency per modality per feature, by default None

    • Features need at least one modality more frequent than min_freq

    • Defines number of quantiles of continuous features

    • Minimum frequency of modality of quantitative features

    Tip

    Set between 0.01 (slower, less robust) and 0.2 (faster, more robust)

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

  • chained_orders (list[GroupedList]) – A list of interlocked higher level groups for each modalities of each ordinal feature. Values of chained_orders[0] have to be grouped in chained_order[1] etc.

fit(X: DataFrame, y: Series | None = None) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

String Discretizer

StringDiscretizer is used as a data preparation tool to convert qualitative data to str type.

class AutoCarver.discretizers.StringDiscretizer(features: Features, *, config: DiscretizerConfig | None = None)

Converts specified columns of a DataFrame into strings. First step of a Qualitative discretization pipeline.

  • Keeps NaN inplace

  • Converts floats of int to int

Parameters:
  • features (Features) – A set of Features to be processed

  • config (DiscretizerConfig, optional) – Behavioral toggles (copy/ordinal_encoding/dropna/verbose/n_jobs), by default a default-initialized DiscretizerConfig.

fit(X: DataFrame, y: Series | None = None) Self

Learns simple discretization of values of X according to values of y.

Xpd.DataFrame

Training dataset, to determine features’ optimal carving Needs to have columns has specified in features attribute.

ypd.Series

Target with wich the association is maximized.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

property summary: DataFrame

Summary of discretization process for all features

to_json(light_mode: bool = False) dict

Converts to JSON format.

To be used with json.dump.

Parameters:

light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series | None = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided Features.

  • y (pd.Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame