Discretizers
AutoCarver implements Discretizers. It provides the following Data Preparation tools:
Discretizer / Data Type |
Data Preparation |
|---|---|
|
Continuous Data Discrete Data |
Over-represented values are set as there own modality Automatic quantile bucketization of under-represented values Modalities are ordered by default real number ordering |
|
Ordinal Data |
Under-represented modalities are grouped with the closest modality Modalities are ordered according to provided modality ranking |
|
Categorical Data |
Under-represented modalities are grouped into a default value Modalities are ordered by target rate |
Note
Representativity threshold of modalities is user selected (
min_freqattribute).At this step, if any,
numpy.nanare set as there own modality (no given order).Helps improve modality relevancy and reduces the set of possible combinations to test from.
Included in all carving pipelines:
BinaryCarver,MulticlassCarver,ContinuousCarver.
Discretizer, a complete discretization pipeline
- class AutoCarver.discretizers.Discretizer(quantitative_features: list[str], qualitative_features: list[str], min_freq: float, *, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretization pipeline of continuous, discrete, categorical and ordinal features.
Pipeline steps: Complete pipeline for continuous and discrete features, Complete pipeline for categorical and ordinal features.
Modalities/values of features are grouped according to there respective orders:
[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.
[Continuous/Discrete features] real order of the values.
- Parameters:
quantitative_features (list[str]) – List of column names of quantitative features (continuous and discrete) to be dicretized
qualitative_features (list[str]) – List of column names of qualitative features (non-ordinal) to be discretized
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)ordinal_features (list[str], optional) – List of column names of ordinal features to be discretized. For those features a list of values has to be provided in the
values_ordersdict, by defaultNonevalues_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Quantitative Data
Complete pipeline for continuous and discrete features
- class AutoCarver.discretizers.QuantitativeDiscretizer(quantitative_features: list[str], min_freq: float, *, values_orders: dict[str, GroupedList] | None = None, input_dtypes: str | dict[str, str] = 'float', verbose: bool = False, copy: bool = False, **kwargs: dict)
Automatic discretization pipeline of continuous and discrete features.
Pipeline steps: Continuous Discretizer, Ordinal Discretizer
Modalities/values of features are grouped according to there respective orders:
[Continuous/Discrete features] real order of the values.
- Parameters:
quantitative_features (list[str]) – List of column names of quantitative features (continuous and discrete) to be dicretized
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)input_dtypes (Union[str, dict[str, str]], optional) –
Input data type, converted to a dict of the provided type for each feature, by default
"str"If
"str", features are considered as qualitative.If
"float", features are considered as quantitative.
values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Continuous Discretizer
- class AutoCarver.discretizers.ContinuousDiscretizer(quantitative_features: list[str], min_freq: float, *, values_orders: dict[str, Any] | None = None, copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretizing of continuous and discrete features, building simple groups of quantiles of values.
Quantile discretization creates a lot of modalities (for example: up to 100 modalities for
min_freq=0.01). Setmin_freqwith caution.The number of quantiles depends on overrepresented modalities and nans:
Values more frequent than
min_freqare set as there own modalities.Other values are cut in quantiles using
numpy.quantile.The number of quantiles is set as
(1-freq_of_frequent_modalities)/(min_freq).Nans are considered as a modality (and are taken into account in
freq_of_frequent_modalities).
- Parameters:
quantitative_features (list[str]) – List of column names of quantitative features (continuous and discrete) to be dicretized
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series | None = None) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Qualitative Data
Complete pipeline for categorical and ordinal features
- class AutoCarver.discretizers.QualitativeDiscretizer(qualitative_features: list[str], min_freq: float, *, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, input_dtypes: str | dict[str, str] = 'str', copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretiziation pipeline of categorical and ordinal features.
Pipeline steps: Categorical Discretizer, String Discretizer, Ordinal Discretizer.
Modalities/values of features are grouped according to there respective orders:
[Categorical features] order based on modality target rate.
[Ordinal features] user-specified order.
- Parameters:
qualitative_features (list[str]) – List of column names of qualitative features (non-ordinal) to be discretized
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)ordinal_features (list[str], optional) – List of column names of ordinal features to be discretized. For those features a list of values has to be provided in the
values_ordersdict, by defaultNoneinput_dtypes (Union[str, dict[str, str]], optional) –
Input data type, converted to a dict of the provided type for each feature, by default
"str"If
"str", features are considered as qualitative.If
"float", features are considered as quantitative.
values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Categorical Discretizer
- class AutoCarver.discretizers.CategoricalDiscretizer(qualitative_features: list[str], min_freq: float, *, values_orders: dict[str, GroupedList] | None = None, copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretization of categorical features, building simple groups frequent enough.
Groups a qualitative features’ values less frequent than
min_freqinto astr_defaultstring.NaNs are left untouched.
Only use for qualitative non-ordinal features.
- Parameters:
qualitative_features (list[str]) – List of column names of qualitative features (non-ordinal) to be discretized
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Ordinal Discretizer
- class AutoCarver.discretizers.OrdinalDiscretizer(ordinal_features: list[str], min_freq: float, values_orders: dict[str, GroupedList], *, input_dtypes: str | dict[str, str] = 'str', copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretization of ordinal features, grouping less frequent modalities with the closest modlity in target rate or by frequency.
NaNs are left untouched.
Only use for qualitative ordinal features.
Fisrt fits String Discretizer if neccesary.
- Parameters:
ordinal_features (list[str]) – List of column names of ordinal features to be discretized. For those features a list of values has to be provided in the
values_ordersdict.min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)input_dtypes (Union[str, dict[str, str]], optional) –
Input data type, converted to a dict of the provided type for each feature, by default
"str""str", features are considered as qualitative.'float", features are considered as quantitative.
values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Chained Discretizer
ChainedDiscretizer can be used prior to using any carving pipeline or any other discretizer to group categorical modalities more intelligently.
By providing a set of modality groups, the user can introduce use case specific knowledge into the discretization process.
The fitted ordering can then be passed as values_orders parameter for further discretization.
- class AutoCarver.discretizers.ChainedDiscretizer(qualitative_features: list[str], min_freq: float, chained_orders: list[GroupedList], *, values_orders: dict[str, GroupedList] | None = None, unknown_handling: str = 'raise', copy: bool = False, verbose: bool = False, **kwargs: dict)
Automatic discretization of categorical features, joining rare modalities into higher level groups.
For each provided
GroupedListfromchained_ordersattribute, values less frequent thanmin_freqare grouped in there respective group, as defined byGroupedList.- Parameters:
qualitative_features (list[str]) – List of column names of qualitative features (non-ordinal) to be discretized
chained_orders (list[GroupedList]) – A list of interlocked higher level groups for each modalities of each ordianl feature. Values of
chained_orders[0]have to be grouped inchained_order[1]etc.min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be discretized.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between
0.02(slower, preciser, less robust) and0.05(faster, more robust)unknown_handling (str, optional) –
Whether or not to remove unknown values, by default
'raise'.'raise', unknown values raise anAssertionError.'drop', unknown values are grouped withstr_nan.
values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series | None = None) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
String Discretizer
StringDiscretizer is used as a data preparation tool to convert qualitative data to str type.
- class AutoCarver.discretizers.StringDiscretizer(qualitative_features: list[str], *, values_orders: dict[str, GroupedList] | None = None, copy: bool = False, verbose: bool = False, **kwargs: dict)
Converts specified columns of a DataFrame into strings. First step of a Qualitative discretization pipeline.
Keeps NaN inplace
Converts floats of int to int
- Parameters:
qualitative_features (list[str]) – List of column names of qualitative features (non-ordinal) to be discretized
values_orders (dict[str, GroupedList], optional) – Dict of column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonecopy (bool, optional) – If
True, applies transform to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, by defaultFalse**kwargs (dict) – Pass values for
str_defaultandstr_nan(default string values)
Examples
- fit(X: DataFrame, y: Series | None = None) None
Learns simple discretization of values of X according to values of y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary(feature: str | None = None) DataFrame
Summarizes the data discretization process.
By default:
Modality
str_default="__OTHER__"is generated for features that contain non-representative modalities.Modality
str_nan="__NAN__"is generated for features that containnumpy.nan.
- Parameters:
feature (str, optional) – Specify for which feature to return the summary, by default
None- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
GroupedList
Note
AutoCarver would not exist without GroupedList. It allows for a complete historization of the data processing steps, thanks to its content dictionnary attribute.
All modalities are stored inside the GroupedList and can safely be linked to there respective group label.
- class AutoCarver.discretizers.GroupedList(iterable: ndarray | dict | list | tuple = ())
An ordered list that’s extended with a per-value content dict.
- Parameters:
iterable (Union[ndarray, dict, list, tuple], optional) – List-like or
GroupedList, by default()
- append(new_value: Any) None
Appends a new_value to the GroupedList
- Parameters:
new_value (Any) – New key to be added.
- contains(value: Any) bool
Checks if a value is content in any group, also matches NaNs.
- Parameters:
value (Any) – Value to search for
- Returns:
Whether the value is in the GroupedList
- Return type:
bool
- get(key: Any, default: Any | None = None) list[Any]
List of values content in key
- Parameters:
key (Any) – Group.
default (Any, optional) – Value to return if key was not found, by default None
- Returns:
Values content in key
- Return type:
list[Any]
- get_group(value: Any) Any
Returns the key (group) containing the specified value
- Parameters:
value (Any) – Value for which to find the group.
- Returns:
Corresponding key (group)
- Return type:
Any
- get_repr(char_limit: int = 6) list[str]
Returns a representative list of strings of values of groups.
- Parameters:
char_limit (int, optional) – Maximum number of character per string, by default 6
- Returns:
List of short str representation of the keys’ values
- Return type:
list[str]
- group(discarded: Any, kept: Any) None
Groups the discarded value with the kept value
- Parameters:
discarded (Any) – Value to be grouped into the key to_keep.
kept (Any) – Key value in which to group discarded.
- group_list(to_discard: list[Any], to_keep: Any) None
Groups elements to_discard into values to_keep
- Parameters:
to_discard (list[Any]) – Values to be grouped into the key to_keep.
to_keep (Any) – Key value in which to group to_discard values.
- pop(idx: int) None
Pop a value from the GroupedList by index
- Parameters:
idx (int) – Index of the value to be popped out
- remove(value: Any) None
Removes a value from the GroupedList
- Parameters:
value (Any) – value to be removed
- sort()
Sorts the values of the list and dict (if any, NaNs are last).
- Returns:
Sorted GroupedList
- Return type:
- sort_by(ordering: list[Any]) None
Sorts the values of the list and dict according to ordering, if any, NaNs are the last.
- Parameters:
ordering (list[Any]) – Order used for ordering of the list of keys.
- Returns:
Sorted GroupedList
- Return type:
- update(new_value: dict[Any, list[Any]]) None
Updates the GroupedList via a dict
- Parameters:
new_value (dict[Any, list[Any]]) – Dict of key, values to updated content dict
- values() list[Any]
All values content in all groups
- Returns:
List of all values in the GroupedList
- Return type:
list[Any]
Saving and Loading
- AutoCarver.discretizers.BaseDiscretizer.to_json(self) str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- AutoCarver.discretizers.load_discretizer(discretizer_json: dict) BaseDiscretizer
Allows one to load a Discretizer saved as a .json file.
The Discretizer has to be saved with
json.dump(f, Discretizer.to_json()), otherwise there can be no guarantee for it to be restored.- Parameters:
discretizer_json (str) – Loaded .json file using
json.load(f).- Returns:
A fitted Discretizer.
- Return type:
BaseDiscretizer