Carvers
The core of AutoCarver resides in the following Data Optimization steps:
Identifying the most associated combination from all ordered combinations of modalities.
Testing all combinations of NaNs grouped to one of those modalities.
- Target-specific tools allow for association optimization per desired task:
Classification tasks
Binary Classification
Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).
At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab.
It is computed only once per feature \(x\) against the binary target \(y\).
The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.
BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring.
It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.
Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.
Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.
For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).
- class AutoCarver.BinaryCarver(sort_by: str, min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.
- Parameters:
sort_by (str) –
Metric to be used to perform association measure between features and target.
"tschuprowt", for Tschuprow’s T."cramerv", for Cramér’s V.
Tip: use
"tschuprowt"for more robust, or less output modalities, use"cramerv"for more output modalities.min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be carved.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)
quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default
Nonequalitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default
Noneordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the
values_ordersdict, by defaultNonevalues_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonemax_n_mod (int, optional) –
Maximum number of modality per feature, by default
5All combinations of modalities for groups of modalities of sizes from 1 to
max_n_modwill be tested. The combination with the best association will be selected.Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)
min_freq_mod (float, optional) – Minimum frequency per final modality, by default
Nonefor min_freqoutput_dtype (str, optional) –
To be choosen amongst
["float", "str"], by default"float""float", grouped modalities will be converted to there corresponding floating rank."str", a per-group modality will be set for all the modalities of a group.
dropna (bool, optional) –
True, try to groupnumpy.nanwith other modalities.False, all non-numpy.nanwill be grouped, by defaultTrue
copy (bool, optional) – If
True, feature processing at transform is applied to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) –
True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by defaultFalseTrue, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.
Tip: IPython displaying can be turned off by setting
pretty_print=False.**kwargs – Pass values for
str_defaultandstr_nan(default string values), as long aspretty_printto turn off IPython.
Examples
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None
Finds the combination of modalities of X that provides the best association with y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default
NoneShould have the same distribution as X.y_dev (Series, optional) – Target of the development dataset, by default
NoneShould have the same distribution as y.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary() DataFrame
Summarizes the data discretization process.
- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Multilclass Classification
Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of values taken by \(y\).
For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).
MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).
For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).
- class AutoCarver.MulticlassCarver(sort_by: str, min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.
- Parameters:
sort_by (str) –
Metric to be used to perform association measure between features and target.
"tschuprowt", for Tschuprow’s T."cramerv", for Cramér’s V.
Tip: use
"tschuprowt"for more robust, or less output modalities, use"cramerv"for more output modalities.min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be carved.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)
quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default
Nonequalitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default
Noneordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the
values_ordersdict, by defaultNonevalues_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonemax_n_mod (int, optional) –
Maximum number of modality per feature, by default
5All combinations of modalities for groups of modalities of sizes from 1 to
max_n_modwill be tested. The combination with the best association will be selected.Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)
min_freq_mod (float, optional) – Minimum frequency per final modality, by default
Nonefor min_freqoutput_dtype (str, optional) –
To be choosen amongst
["float", "str"], by default"float""float", grouped modalities will be converted to there corresponding floating rank."str", a per-group modality will be set for all the modalities of a group.
dropna (bool, optional) –
True, try to groupnumpy.nanwith other modalities.False, all non-numpy.nanwill be grouped, by defaultTrue
copy (bool, optional) – If
True, feature processing at transform is applied to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) –
True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by defaultFalseTrue, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.
Tip: IPython displaying can be turned off by setting
pretty_print=False.**kwargs – Pass values for
str_defaultandstr_nan(default string values), as long aspretty_printto turn off IPython.
Examples
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None
Finds the combination of modalities of X that provides the best association with y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default
NoneShould have the same distribution as X.y_dev (Series, optional) – Target of the development dataset, by default
NoneShould have the same distribution as y.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary() DataFrame
Summarizes the data discretization process.
- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Regression tasks
Continuous Regression
Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).
The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).
For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.
- class AutoCarver.ContinuousCarver(min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)
Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.
For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.
- Parameters:
min_freq (float) –
Minimum frequency per grouped modalities.
Features whose most frequent modality is less frequent than
min_freqwill not be carved.Sets the number of quantiles in which to discretize the continuous features.
Sets the minimum frequency of a quantitative feature’s modality.
Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)
quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default
Nonequalitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default
Noneordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the
values_ordersdict, by defaultNonevalues_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a
GroupedListwill automatically be initiated, by defaultNonemax_n_mod (int, optional) –
Maximum number of modality per feature, by default
5All combinations of modalities for groups of modalities of sizes from 1 to
max_n_modwill be tested. The combination with the best association will be selected.Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)
min_freq_mod (float, optional) – Minimum frequency per final modality, by default
Nonefor min_freqoutput_dtype (str, optional) –
To be choosen amongst
["float", "str"], by default"float""float", grouped modalities will be converted to there corresponding floating rank."str", a per-group modality will be set for all the modalities of a group.
dropna (bool, optional) –
True, try to groupnumpy.nanwith other modalities.False, all non-numpy.nanwill be grouped, by defaultTrue
copy (bool, optional) – If
True, feature processing at transform is applied to a copy of the provided DataFrame, by defaultFalseverbose (bool, optional) –
True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by defaultFalseTrue, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.
Tip: IPython displaying can be turned off by setting
pretty_print=False.**kwargs – Pass values for
str_defaultandstr_nan(default string values), as long aspretty_printto turn off IPython.
Examples
- fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None
Finds the combination of modalities of X that provides the best association with y.
- Parameters:
X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is maximized.
X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default
NoneShould have the same distribution as X.y_dev (Series, optional) – Target of the development dataset, by default
NoneShould have the same distribution as y.
- fit_transform(X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns:
X_new – Transformed array.
- Return type:
ndarray array of shape (n_samples, n_features_new)
- summary() DataFrame
Summarizes the data discretization process.
- Returns:
A summary of features’ values per modalities.
- Return type:
DataFrame
- to_json() str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- transform(X: DataFrame, y: Series = None) DataFrame
Applies discretization to a DataFrame’s columns.
- Parameters:
X (DataFrame) – Dataset to be carved. Needs to have columns has specified in
featuresattribute.y (Series, optional) – Target, by default
None
- Returns:
Discretized X.
- Return type:
DataFrame
Saving and loading
- AutoCarver.BaseCarver.to_json(self) str
Converts to .json format.
To be used with
json.dump.- Returns:
JSON serialized object
- Return type:
str
- AutoCarver.load_carver(auto_carver_json: dict) BaseDiscretizer
Allows one to load an AutoCarver saved as a .json file.
The AutoCarver has to be saved with
json.dump(f, AutoCarver.to_json()), otherwise there can be no guarantee for it to be restored.- Parameters:
auto_carver_json (str) – Loaded .json file using
json.load(f).- Returns:
A fitted AutoCarver.
- Return type:
BaseDiscretizer