Carvers

The core of AutoCarver resides in its Carvers, they provide the following Data Optimization steps:

  1. Identifying the most associated combination from all ordered combinations of modalities.

  2. Testing all combinations of NaNs grouped to one of those modalities.

Target-specific tools allow for association optimization per desired task:

Classification tasks

Binary Classification

Within BinaryCarver, a binary target consists of a column \(y\) that only contains \(0\) and \(1\) (no str).

At the basis of BinaryCarver’s’ built-in association measures lays pandas.crosstab. It is computed only once per feature \(x\) against the binary target \(y\). The crosstab between \(y\) and each possible combination of modalities of \(x\) is then obtained via a vectorized, numpy.add. powered, implementation of pandas.groupby.

BinaryCarver takes advantage of scipy.stats.chi2_contingency to perform association measuring. It gives Pearson’s \(\chi^2\) statistics computed from crosstabs.

Cramér’s \(V\) is then computed using \(V=\sqrt{\frac{\chi^2}{n}}\) where \(n\) is the number of observation. This implementation has been simplified taking into account the binary target \(y\) to improve performances.

Finally, Tschuprow’s \(T\) is computed using \(T=\frac{V}{\sqrt{\sqrt{n_x-1}}}\) where \(n_x\) is the per-combination number of modalities.

For two combinations of modalities of \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(y\).

class AutoCarver.BinaryCarver(sort_by: str, min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a binary target.

Parameters:
  • sort_by (str) –

    Metric to be used to perform association measure between features and target.

    Tip: use "tschuprowt" for more robust, or less output modalities, use "cramerv" for more output modalities.

  • min_freq (float) –

    Minimum frequency per grouped modalities.

    • Features whose most frequent modality is less frequent than min_freq will not be carved.

    • Sets the number of quantiles in which to discretize the continuous features.

    • Sets the minimum frequency of a quantitative feature’s modality.

    Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)

  • quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default None

  • qualitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default None

  • ordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the values_orders dict, by default None

  • values_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a GroupedList will automatically be initiated, by default None

  • max_n_mod (int, optional) –

    Maximum number of modality per feature, by default 5

    All combinations of modalities for groups of modalities of sizes from 1 to max_n_mod will be tested. The combination with the best association will be selected.

    Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)

  • min_freq_mod (float, optional) – Minimum frequency per final modality, by default None for min_freq

  • output_dtype (str, optional) –

    To be choosen amongst ["float", "str"], by default "float"

    • "float", grouped modalities will be converted to there corresponding floating rank.

    • "str", a per-group modality will be set for all the modalities of a group.

  • dropna (bool, optional) –

    • True, try to group numpy.nan with other modalities.

    • False, all non-numpy.nan will be grouped, by default True

  • copy (bool, optional) – If True, feature processing at transform is applied to a copy of the provided DataFrame, by default False

  • verbose (bool, optional) –

    • True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by default False

    • True, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.

    Tip: IPython displaying can be turned off by setting pretty_print=False.

  • **kwargs – Pass values for str_default and str_nan (default string values), as long as pretty_print to turn off IPython.

Examples

See AutoCarver examples

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None

Finds the combination of modalities of X that provides the best association with y.

Parameters:
  • X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in features attribute.

  • y (Series) – Target with wich the association is maximized.

  • X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default None Should have the same distribution as X.

  • y_dev (Series, optional) – Target of the development dataset, by default None Should have the same distribution as y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

summary() DataFrame

Summarizes the data discretization process.

Returns:

A summary of features’ values per modalities.

Return type:

DataFrame

to_json() str

Converts to .json format.

To be used with json.dump.

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (DataFrame) – Dataset to be carved. Needs to have columns has specified in features attribute.

  • y (Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Multilclass Classification

Within MulticlassCarver, a multiclass target consists of a column \(y\) that contains several values \(y_0\) to \(y_{n_y}\) where \(n_y>2\) is the number of modalities taken by \(y\).

For values \(y_0\) to \(y_{n_y-1}\) of \(y\), an indicator feature is built: \(Y_0 = \mathbb{1}_{y=y_0}\) to \(Y_{n_y-1} = \mathbb{1}_{y=y_{n_y-1}}\).

MulticlassCarver repeatedly applies BinaryCarver for features \(Y_0\) to \(Y_{n_y-1}\). Thus, the same association measure are implemented: Tschuprow’s \(T\) and Cramér’s \(V\).

For two combinations of modalities of a feature \(x\), a higher \(T\) or \(V\) value indicates a stronger relationship with the binary target \(Y\).

class AutoCarver.MulticlassCarver(sort_by: str, min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a multiclass target.

Parameters:
  • sort_by (str) –

    Metric to be used to perform association measure between features and target.

    Tip: use "tschuprowt" for more robust, or less output modalities, use "cramerv" for more output modalities.

  • min_freq (float) –

    Minimum frequency per grouped modalities.

    • Features whose most frequent modality is less frequent than min_freq will not be carved.

    • Sets the number of quantiles in which to discretize the continuous features.

    • Sets the minimum frequency of a quantitative feature’s modality.

    Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)

  • quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default None

  • qualitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default None

  • ordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the values_orders dict, by default None

  • values_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a GroupedList will automatically be initiated, by default None

  • max_n_mod (int, optional) –

    Maximum number of modality per feature, by default 5

    All combinations of modalities for groups of modalities of sizes from 1 to max_n_mod will be tested. The combination with the best association will be selected.

    Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)

  • min_freq_mod (float, optional) – Minimum frequency per final modality, by default None for min_freq

  • output_dtype (str, optional) –

    To be choosen amongst ["float", "str"], by default "float"

    • "float", grouped modalities will be converted to there corresponding floating rank.

    • "str", a per-group modality will be set for all the modalities of a group.

  • dropna (bool, optional) –

    • True, try to group numpy.nan with other modalities.

    • False, all non-numpy.nan will be grouped, by default True

  • copy (bool, optional) – If True, feature processing at transform is applied to a copy of the provided DataFrame, by default False

  • verbose (bool, optional) –

    • True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by default False

    • True, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.

    Tip: IPython displaying can be turned off by setting pretty_print=False.

  • **kwargs – Pass values for str_default and str_nan (default string values), as long as pretty_print to turn off IPython.

Examples

See AutoCarver examples

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None

Finds the combination of modalities of X that provides the best association with y.

Parameters:
  • X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in features attribute.

  • y (Series) – Target with wich the association is maximized.

  • X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default None Should have the same distribution as X.

  • y_dev (Series, optional) – Target of the development dataset, by default None Should have the same distribution as y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

summary() DataFrame

Summarizes the data discretization process.

Returns:

A summary of features’ values per modalities.

Return type:

DataFrame

to_json() str

Converts to .json format.

To be used with json.dump.

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (DataFrame) – Dataset to be carved. Needs to have columns has specified in features attribute.

  • y (Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Regression tasks

Continuous Regression

Within ContinuousCarver, a continuous target consists of a column \(y\) that contains values from \(-\inf\) to \(+\inf\) (no str).

The association with a categorical/ordinal feature \(x\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(y\) is distributed the same when \(x=0, ..., x=n_x\), where \(n_x\) is the number of modalities taken by \(x\).

For two combinations of modalities of \(x\), a higher \(H\) value indicates that there is a greater difference between the medians of the samples.

class AutoCarver.ContinuousCarver(min_freq: float, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, ordinal_features: list[str] | None = None, values_orders: dict[str, GroupedList] | None = None, max_n_mod: int = 5, min_freq_mod: float | None = None, output_dtype: str = 'float', dropna: bool = True, copy: bool = False, verbose: bool = False, **kwargs)

Automatic carving of continuous, discrete, categorical and ordinal features that maximizes association with a continuous target.

For continuous targets, Kruskal-Wallis’ H test statistic is used as association measure to sort combinations.

Parameters:
  • min_freq (float) –

    Minimum frequency per grouped modalities.

    • Features whose most frequent modality is less frequent than min_freq will not be carved.

    • Sets the number of quantiles in which to discretize the continuous features.

    • Sets the minimum frequency of a quantitative feature’s modality.

    Tip: should be set between 0.02 (slower, preciser, less robust) and 0.05 (faster, more robust)

  • quantitative_features (list[str], optional) – List of column names of quantitative features (continuous and discrete) to be carved, by default None

  • qualitative_features (list[str], optional) – List of column names of qualitative features (non-ordinal) to be carved, by default None

  • ordinal_features (list[str], optional) – List of column names of ordinal features to be carved. For those features a list of values has to be provided in the values_orders dict, by default None

  • values_orders (dict[str, GroupedList], optional) – Dict of feature’s column names and there associated ordering. If lists are passed, a GroupedList will automatically be initiated, by default None

  • max_n_mod (int, optional) –

    Maximum number of modality per feature, by default 5

    All combinations of modalities for groups of modalities of sizes from 1 to max_n_mod will be tested. The combination with the best association will be selected.

    Tip: should be set between 4 (faster, more robust) and 7 (slower, preciser, less robust)

  • min_freq_mod (float, optional) – Minimum frequency per final modality, by default None for min_freq

  • output_dtype (str, optional) –

    To be choosen amongst ["float", "str"], by default "float"

    • "float", grouped modalities will be converted to there corresponding floating rank.

    • "str", a per-group modality will be set for all the modalities of a group.

  • dropna (bool, optional) –

    • True, try to group numpy.nan with other modalities.

    • False, all non-numpy.nan will be grouped, by default True

  • copy (bool, optional) – If True, feature processing at transform is applied to a copy of the provided DataFrame, by default False

  • verbose (bool, optional) –

    • True, without IPython installed: prints raw Discretizers and AutoCarver Fit steps for X, by default False

    • True, with IPython installed: adds HTML tables of target rates and frequencies for X and X_dev.

    Tip: IPython displaying can be turned off by setting pretty_print=False.

  • **kwargs – Pass values for str_default and str_nan (default string values), as long as pretty_print to turn off IPython.

Examples

See AutoCarver examples

fit(X: DataFrame, y: Series, *, X_dev: DataFrame | None = None, y_dev: Series | None = None) None

Finds the combination of modalities of X that provides the best association with y.

Parameters:
  • X (DataFrame) – Training dataset, to determine features’ optimal carving. Needs to have columns has specified in features attribute.

  • y (Series) – Target with wich the association is maximized.

  • X_dev (DataFrame, optional) – Development dataset, to evaluate robustness of carved features, by default None Should have the same distribution as X.

  • y_dev (Series, optional) – Target of the development dataset, by default None Should have the same distribution as y.

fit_transform(X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns:

X_new – Transformed array.

Return type:

ndarray array of shape (n_samples, n_features_new)

summary() DataFrame

Summarizes the data discretization process.

Returns:

A summary of features’ values per modalities.

Return type:

DataFrame

to_json() str

Converts to .json format.

To be used with json.dump.

Returns:

JSON serialized object

Return type:

str

transform(X: DataFrame, y: Series = None) DataFrame

Applies discretization to a DataFrame’s columns.

Parameters:
  • X (DataFrame) – Dataset to be carved. Needs to have columns has specified in features attribute.

  • y (Series, optional) – Target, by default None

Returns:

Discretized X.

Return type:

DataFrame

Saving and loading

AutoCarver.BaseCarver.to_json(self) str

Converts to .json format.

To be used with json.dump.

Returns:

JSON serialized object

Return type:

str

AutoCarver.load_carver(auto_carver_json: dict) BaseDiscretizer

Allows one to load an AutoCarver saved as a .json file.

The AutoCarver has to be saved with json.dump(f, AutoCarver.to_json()), otherwise there can be no guarantee for it to be restored.

Parameters:

auto_carver_json (str) – Loaded .json file using json.load(f).

Returns:

A fitted AutoCarver.

Return type:

BaseDiscretizer