Discretizers
AutoCarver implements Discretizers. It provides the following Data Preparation tools:
Discretizer / Data Type |
Data Preparation |
|---|---|
|
Continuous Data Discrete Data |
Over-represented values are set as there own modality Automatic quantile bucketization of under-represented values Modalities are ordered by default real number ordering |
|
Ordinal Data |
Under-represented modalities are grouped with the closest modality Modalities are ordered according to provided modality ranking |
|
Categorical Data |
Under-represented modalities are grouped into a default value Modalities are ordered by target rate |
Note
Representativity threshold of modalities is user selected (
min_freq)At this step, if any,
nanare set as there own modality (no given order)Helps improve modality relevancy and reduces the set of possible combinations to test from
Included in all carving pipelines:
BinaryCarver,MulticlassCarver,ContinuousCarver
DiscretizerConfig
Behavioral toggles shared by every discretizer and carver. All flags are optional
and propagate unchanged to sub-discretizers; domain parameters such as min_freq
remain explicit constructor arguments.
- class AutoCarver.discretizers.DiscretizerConfig(copy: bool = True, ordinal_encoding: bool = False, dropna: bool = False, verbose: bool = False, n_jobs: int = 1, min_freq_alpha: float = 0.05)
Behavioral configuration applied to a
BaseDiscretizer.Carries cross-cutting toggles that propagate unchanged to sub-discretizers. Pure domain values (
min_freq,combinations…) remain explicit constructor arguments;min_freq_alphalives here because it tunes howmin_freqis tested, not the target itself.copy=Trueis the default so that BaseDiscretizer doesn’t mutate caller DataFrames in place — set toFalsewhen nested inside a pipeline that already owns the dataframe.min_freq_alphais the two-sided significance level of the Wilson interval used to decide whether a modality’s observed frequency is significantly belowmin_freq. Smaller values are more lenient (wider CI → fewer merges);0.05matches a 95% interval.n_jobscontrols per-feature parallelism insideBaseCarver: withn_jobs > 1and more than one feature, the per-feature combination search runs throughmultiprocessing.Pool.imap_unordered. Worth it only on hundreds-to-thousands of features (pool startup + pickle overhead dominate below that).
copy(bool, defaultTrue) — copy inputXrather than mutating it.ordinal_encoding(bool, defaultFalse) — emit ordinal codes instead of string labels (carvers default this toTrue).dropna(bool, defaultFalse) — groupnaninto another modality (carvers default this toTrue).verbose(bool, defaultFalse) — print progress and statistics.n_jobs(int, default1) — number of workers for parallel fits. InsideBaseCarver,n_jobs > 1dispatches one task per feature throughmultiprocessing.Pool; see Carvers for sizing guidance.min_freq_alpha(float, default0.05) — two-sided significance level of the Wilson score interval used to gatemin_freq. A modality is declared under-represented only when the Wilson upper bound of its observed proportion is significantly belowmin_freq(see Minimum-frequency viability test (Wilson score interval) for the formula and decision rule). Smaller \(\alpha\) → wider CI → fewer rejections → less merging; larger \(\alpha\) → tighter CI → more merging. \(\alpha = 1\) recovers the legacy strict-threshold behaviour.
Discretizer, a complete discretization pipeline
- class AutoCarver.discretizers.Discretizer(features: Features, min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretization pipeline of continuous, discrete, categorical and ordinal features.
Pipeline steps: Complete pipeline for continuous and discrete features, Complete pipeline for categorical and ordinal features.
Modalities/values of features are grouped according to there respective orders:
[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.
[Continuous/Discrete features] real order of the values.
- Parameters:
features (Features) – A set of
Featuresto be processed.min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Quantitative Data
Complete pipeline for continuous and discrete features
The animation below walks through the five stages of
QuantitativeDiscretizer on a synthetic Fare-like distribution with
multiple class-fare spikes (generated with a fixed seed; the discretization
itself is the real output of QuantitativeDiscretizer.fit):
Raw feature — Gaussian-KDE density estimate; the lognormal body sits under several discrete “class fare” peaks (
0,7.25,13,26.55), with the NaN proportion held aside.Over-represented values detected — values occurring more often than \(1/q\) get their own singleton bin; here all four class-fare spikes qualify (marked in orange).
After ContinuousDiscretizer — four thin spike singletons plus the quantile bins that fill the gaps between them. Bars whose Wilson upper bound falls below
min_freqare outlined in orange (the sparse segment(13, 25.9]and the tail(26.55, 33.9]) — these are the bins the OrdinalDiscretizer pass will merge.Merge direction chosen —
OrdinalDiscretizermerges each rare bin into the dominant neighbour with the closest target rate; dashed arrows point from the sparse bin to the bin that absorbs it.After QuantitativeDiscretizer — the six surviving bins. Each merged bar spans the union of its absorbed Stage-2 slots and keeps the dominant (anchor) bin’s colour, so the eye can track which bin “swallowed” its sparse neighbour.
- class AutoCarver.discretizers.QuantitativeDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretization pipeline of continuous and discrete features.
Pipeline steps: Continuous Discretizer, Ordinal Discretizer
Modalities/values of features are grouped according to there respective orders:
[Continuous/Discrete features] real order of the values.
- Parameters:
quantitatives (list[QuantitativeFeature]) – Quantitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Continuous Discretizer
The animation below walks through the three stages of
ContinuousDiscretizer on a synthetic Fare-like distribution
(generated with a fixed seed; the discretization itself is the real output of
ContinuousDiscretizer.fit_transform):
Raw feature — Gaussian-KDE density estimate of the continuous values, with the NaN proportion held aside.
Over-represented value detected — values occurring more often than \(1/q\) get their own modality (here
Fare = 0, marked in orange).After ContinuousDiscretizer — the over-rep modality plus quantile bins; bar widths reflect each modality’s real value range and bar heights its real frequency, with a horizontal reference at
min_freq.
- class AutoCarver.discretizers.ContinuousDiscretizer(quantitatives: list[QuantitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretizing of continuous and discrete features, building simple groups of quantiles of values.
Quantile discretization creates a lot of modalities (for example: up to 100 modalities for
min_freq=0.01). Setmin_freqwith caution.The number of quantiles depends on overrepresented modalities and nans:
Values more frequent than
min_freqare set as there own modalities.Other values are cut in quantiles using
numpy.quantile.The number of quantiles is set as
(1-freq_frequent_modals)/(min_freq).Nans are considered as a modality (and are taken into account in
freq_frequent_modals).
- Parameters:
quantitatives (list[QuantitativeFeature]) – Quantitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series | None = None) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Qualitative Data
Complete pipeline for categorical and ordinal features
The animation below shows how QualitativeDiscretizer processes two
features in parallel — a categorical feature (Port, top strip) and an
ordinal feature (AgeGroup, bottom strip):
Raw features — both strips shown:
Portbars in frequency-descending order,AgeGroupbars in declared ordinal order; rare modalities outlined orange on each.Rare modalities grouped — rare
Portmodalities (Belfast,Boston) collapse into__OTHER__;AgeGroupis unchanged (dimmed at 60 % opacity — not yet processed).After CategoricalDiscretizer —
Portbars reordered by ascendingP(y=1); dot trace is now monotonic.AgeGroupstill unchanged (dimmed).OrdinalDiscretizer — merge direction —
Portat full opacity (done); curved arrows show which rareAgeGroupmodality merges into which neighbour.After QualitativeDiscretizer — both strips at full opacity;
AgeGroupbars span the slots of their absorbed modalities, ordinal order preserved.
- class AutoCarver.discretizers.QualitativeDiscretizer(qualitatives: list[QualitativeFeature], min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretiziation pipeline of categorical and ordinal features.
Pipeline steps: Categorical Discretizer, String Discretizer, Ordinal Discretizer.
Modalities/values of features are grouped according to there respective orders:
[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.
- Parameters:
qualitatives (list[QualitativeFeature]) – Qualitative features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Categorical Discretizer
The animation below walks through the three stages of
CategoricalDiscretizer on a synthetic Titanic-flavored Port
feature (generated with a fixed seed; the discretization itself is the real
output of CategoricalDiscretizer.fit_transform):
Raw feature — bars in frequency-descending order, with a small target-rate (\(P(y=1)\)) dot above each bar. Bars whose Wilson upper bound falls below
min_freqare outlined in orange (hereBelfastandBoston).Rare modalities grouped — under-represented modalities collapse into the default
__OTHER__bin; the dot above__OTHER__is the frequency-weighted target rate of the absorbed modalities.After CategoricalDiscretizer — bars reordered by ascending target rate. The dot trace is now monotonic; each modality keeps its colour across the reorder so the eye can track its movement.
- class AutoCarver.discretizers.CategoricalDiscretizer(categoricals: list[CategoricalFeature], min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretization of categorical features, building simple groups frequent enough.
Groups a qualitative features’ values less frequent than
min_freqinto astr_defaultstring.NaNs are left untouched.
Only use for qualitative non-ordinal features.
- Parameters:
categoricals (list[CategoricalFeature]) – Categorical features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Ordinal Discretizer
The animation below walks through the three stages of
OrdinalDiscretizer on a synthetic Titanic-flavored AgeGroup
ordinal feature (generated with a fixed seed; the discretization itself is
the real output of OrdinalDiscretizer.fit_transform):
Raw feature — bars in the user-declared ordinal order (
child→elderly), with a small target-rate (\(P(y=1)\)) dot above each bar. Bars whose Wilson upper bound falls belowmin_freqare outlined in orange (hereteenandelderly). The dot trace is not monotonic — ordinals are ranked by domain meaning, not by target rate.Merge direction chosen — each rare modality merges with the adjacent neighbour whose target rate is closest (or its only neighbour at the edges). Dashed arrows show the chosen direction.
After OrdinalDiscretizer — merged bars span the slots of their absorbed modalities, ordinal order preserved.
- class AutoCarver.discretizers.OrdinalDiscretizer(ordinals: list[OrdinalFeature], min_freq: float, *, config: DiscretizerConfig | None = None)
Automatic discretization of ordinal features, grouping less frequent modalities with the closest modlity in target rate or by frequency.
NaNs are left untouched.
Only use for qualitative ordinal features.
Fisrt fits String Discretizer if neccesary.
- Parameters:
ordinals (list[OrdinalFeature]) – Ordinal features to process
min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Chained Discretizer
ChainedDiscretizer can be used prior to using any carving pipeline or any other discretizer to group categorical modalities more intelligently.
By providing a set of modality groups, the user can introduce use case specific knowledge into the discretization process.
The fitted Features can then be as parameter for further discretization.
- class AutoCarver.discretizers.ChainedDiscretizer(min_freq: float, features: list[BaseFeature], chained_orders: list[GroupedList], *, config: DiscretizerConfig | None = None)
Automatic discretization of categorical features, joining rare modalities into higher level groups.
For each provided
GroupedListfromchained_ordersattribute, values less frequent thanmin_freqare grouped in there respective group, as defined byGroupedList.- Parameters:
features (Features) – A set of
Featuresto be processed.min_freq (float) –
Minimum frequency per modality. Tested via a Wilson upper bound at significance
DiscretizerConfig.min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).Features need at least one modality with frequency significantly above
min_freq.For continuous features, drives the number of quantiles (roughly
1 / min_freq).Modalities significantly below
min_freqare merged with the closest one (ordinal) or with a default group (categorical).
Tip
Set between
0.01(slower, less robust) and0.05(faster, more robust).config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.chained_orders (list[GroupedList]) – A list of interlocked higher level groups for each modalities of each ordinal feature. Values of
chained_orders[0]have to be grouped inchained_order[1]etc.
- fit(X: DataFrame, y: Series | None = None) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
String Discretizer
StringDiscretizer is used as a data preparation tool to convert qualitative data to str type.
- class AutoCarver.discretizers.StringDiscretizer(features: Features, *, config: DiscretizerConfig | None = None)
Converts specified columns of a DataFrame into strings. First step of a Qualitative discretization pipeline.
Keeps NaN inplace
Converts floats of int to int
- Parameters:
features (Features) – A set of
Featuresto be processed.config (DiscretizerConfig, optional) – Behavioral toggles (
copy/ordinal_encoding/dropna/verbose/n_jobs/min_freq_alpha). Defaults to a default-initializedDiscretizerConfig— see DiscretizerConfig for each field.
- fit(X: DataFrame, y: Series | None = None) Self
Learns simple discretization of values of X according to values of y.
- Xpd.DataFrame
Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.- ypd.Series
Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters. Pass only if the estimator accepts additional params in its fit method.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- property summary: DataFrame
Summary of discretization process for all features
- to_json(light_mode: bool = False) dict
Converts to JSON format.
To be used with
json.dump.- Parameters:
light_mode (bool, optional) – Whether or not to save features’ history and statistics, by default False
- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series | None = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (pd.DataFrame) – Dataset to be carved. Needs to have columns from provided
Features.y (pd.Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame