Selectors

AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:

Measuring association with a target and ranking features accordingly.

Filtering out features too asociated to a better ranked feature.

It allows one to select features:

Whatever there type: quantitative or qualitative

Whatever the optimization task: Classification tasks or Regression tasks

By default, quantitative features are:

Ranked according to Kruskal-Wallis’ H test statistic

Filtered according to Spearman’s rho correlation coefficient

By default, qualitative features are:

Ranked according to Tschuprow’s T

Filtered according to Tschuprow’s T

In general, associations are computed according to the provided data types of \(x\) and \(y\):

\(x\) \ \(y\)	Qualitatitve	Quantitative
Qualitative	Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\)	Kruskal-Wallis’ \(H\), \(R\) coefficient
Quantitative	Kruskal-Wallis’ \(H\), \(R\) coefficient	Pearson’s \(r\), Spearman’s \(\rho\)

See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.

Classification tasks

class AutoCarver.selectors.ClassificationSelector(n_best: int, qualitative_features: list[str] | None = None, quantitative_features: list[str] | None = None, *, quantitative_measures: list[Callable] | None = None, qualitative_measures: list[Callable] | None = None, quantitative_filters: list[Callable] | None = None, qualitative_filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.

Get your best features with ClassificationSelector.select()!

Parameters:

n_best (int) –
Number of best features to select. Best features are:
- The first n_best of each provided data types as set in:
  - quantitative_features
  - qualitative_features
- The first n_best for each provided measures as set in:
  - quantitative_measures
  - qualitative_measures
quantitative_features (list[str]) – List of column names of quantitative features to chose from, by default None. Must be set if qualitative_features=None.
qualitative_features (list[str]) – List of column names of qualitative features to chose from, by default None. Must be set if quantitative_features=None.
quantitative_measures (list[Callable], optional) –
List of association measures to be used for quantitative_features. Implemented measures are:
- For association evaluation:
  - Kruskal-Wallis’ H (default)
  - Coefficient of determination R
- For outlier detection:
  - Standard score
  - Interquartile range
qualitative_measures (list[Callable], optional) –
List of association measures to be used for qualitative_features. Implemented measures are:
- For association evaluation:
  - Pearson’s chi²
  - Cramér’s V
  - Tschuprow’s T (default)
quantitative_filters (list[Callable], optional) –
List of filters to be used for quantitative_features. Implemented filters are:
- For linear correlation:
  - Pearson’s r
  - Spearman’s rho (default)
qualitative_filters (list[Callable], optional) –
List of filters to be used for qualitative_features. Implemented filters are:
- For correlation:
  - Cramér’s V
  - Tschuprow’s T (default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between 0 and 1, by default 1.0, all features are used.

For colsample=0.5, Selector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:].

Tip: for better performance, should be set such as len(features)//2 < 200.
verbose (bool, optional) –
- True, without IPython: prints raw selection steps for X, by default False
- True, with IPython: adds HTML tables to the output.
Tip: IPython displaying can be turned off by setting pretty_print=False.
**kwargs – Allows one to set thresholds for provided quantitative_measures/ qualitative_measures and quantitative_filters/qualitative_filters (see Association measures, X by y and Association filters, X by X) passed as keyword arguments.

Examples

See Selectors examples

select(X: DataFrame, y: Series) → list[str]

Selects the n_best features of the DataFrame, by association with the target

Parameters:

X (DataFrame) – Dataset to determine optimal features. Needs to have columns has specified in features attribute.
y (Series) – Target with wich the association is evaluated.

Returns:

List of selected features

Return type:

list[str]

Regression tasks

class AutoCarver.selectors.RegressionSelector(n_best: int, qualitative_features: list[str] | None = None, quantitative_features: list[str] | None = None, *, quantitative_measures: list[Callable] | None = None, qualitative_measures: list[Callable] | None = None, quantitative_filters: list[Callable] | None = None, qualitative_filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.

Get your best features with RegressionSelector.select()!

Parameters:

n_best (int) –
Number of best features to select. Best features are:
- The first n_best of each provided data types as set in:
  - quantitative_features
  - qualitative_features
- The first n_best for each provided measures as set in:
  - quantitative_measures
  - qualitative_measures
quantitative_features (list[str]) – List of column names of quantitative features to chose from, by default None. Must be set if qualitative_features=None.
qualitative_features (list[str]) – List of column names of qualitative features to chose from, by default None. Must be set if quantitative_features=None.
quantitative_measures (list[Callable], optional) –
List of association measures to be used for quantitative_features. Implemented measures are:
- For association evaluation:
  - Kruskal-Wallis’ H (default)
  - Coefficient of determination R
- For outlier detection:
  - Standard score
  - Interquartile range
qualitative_measures (list[Callable], optional) –
List of association measures to be used for qualitative_features. Implemented measures are:
- For association evaluation:
  - Pearson’s chi²
  - Cramér’s V
  - Tschuprow’s T (default)
quantitative_filters (list[Callable], optional) –
List of filters to be used for quantitative_features. Implemented filters are:
- For linear correlation:
  - Pearson’s r
  - Spearman’s rho (default)
qualitative_filters (list[Callable], optional) –
List of filters to be used for qualitative_features. Implemented filters are:
- For correlation:
  - Cramér’s V
  - Tschuprow’s T (default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between 0 and 1, by default 1.0, all features are used.

For colsample=0.5, Selector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:].

Tip: for better performance, should be set such as len(features)//2 < 200.
verbose (bool, optional) –
- True, without IPython: prints raw selection steps for X, by default False
- True, with IPython: adds HTML tables to the output.
Tip: IPython displaying can be turned off by setting pretty_print=False.
**kwargs – Allows one to set thresholds for provided quantitative_measures/ qualitative_measures and quantitative_filters/qualitative_filters (see Association measures, X by y and Association filters, X by X) passed as keyword arguments.

Examples

See Selectors examples

select(X: DataFrame, y: Series) → list[str]

Selects the n_best features of the DataFrame, by association with the target

Parameters:

X (DataFrame) – Dataset to determine optimal features. Needs to have columns has specified in features attribute.
y (Series) – Target with wich the association is evaluated.

Returns:

List of selected features

Return type:

list[str]

Association measures, X by y

Quantitative measures

Distance Correlation

For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:

\[1 - \frac{ (x - \bar{x}) (y - \bar{y}) } { ||x - \bar{x}||_2 ||y - \bar{y}||_2 }\]

where:

\(n_x\) is the number of observations of \(x\)

\(n_y\) is the number of observations of \(y\)

\(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)

\(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)

\(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)

\(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)

The Distance Correlation is computed using scipy.spatial.distance.correlation.

AutoCarver.selectors.measures.distance_measure(x: Series, y: Series, thresh_distance: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Distance correlation between x and y.

Parameters:

x (Series) – Quantitative feature
y (Series) – Quantitative target feature
thresh_distance (float, optional) – Minimum distance association, by default 0

Returns:

Whether x is sufficiently associated to y and Distance Correlation

Return type:

tuple[bool, dict[str, Any]]

Note

distance_measure is the default measure for quantitative features in regression tasks (i.e. when RegressionSelector.quantitative_filters=None and RegressionSelector.quantitative_features is provided).
If thresh_distance is reached, feature will automatically be dropped.

Kruskal-Wallis’ \(H\) test statistic

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:

\[H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }\]

where:

\(n\) is the number of observations

\(n_y\) is the number of modalities of \(y\)

\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality

\(x_r\) is the ranked version of \(x\)

\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)

AutoCarver.selectors.measures.kruskal_measure(x: Series, y: Series, thresh_kruskal: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Kruskal-Wallis’ test statistic between x for each value taken by y.

Parameters:

x (Series) – Quantitative feature
y (Series) – Qualitative target feature
thresh_kruskal (float, optional) – Minimum Kruskal-Wallis association, by default 0

Returns:

Whether x is sufficiently associated to y and Kruskal-Wallis’ H test statistic

Return type:

tuple[bool, dict[str, Any]]

Note

kruskal_measure is the default measure for quantitative features in classification tasks (i.e. when ClassificationSelector.quantitative_filters=None and ClassificationSelector.quantitative_features is provided).
kruskal_measure is the default measure for qualitative features in regression tasks (i.e. when RegressionSelector.qualitative_filters=None and RegressionSelector.qualitative_features is provided).
If thresh_kruskal is reached, feature will automatically be dropped.

Coefficient of determination \(R\)

For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:

\[x = \alpha + \beta y + \epsilon\]

where:

\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model

The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:

\[R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }\]

where:

\(n\) is the number of observations

\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares

\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares

\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)

AutoCarver.selectors.measures.R_measure(x: Series, y: Series, thresh_R: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Square root of the coefficient of determination of linear regression model of x by y.

Parameters:

x (Series) – Quantitative feature
y (Series) – Binary target feature
thresh_R (float, optional) – Minimum R association, by default 0

Returns:

Whether x is sufficiently associated to y and the square root of the determination coefficient

Return type:

tuple[bool, dict[str, Any]]

Note

If thresh_R is reached, feature will automatically be dropped.

Outlier Detection: Standard score

Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:

\[z_i = \frac{x_i - \bar{x}}{S}\]

where:

\(n\) is the number of observations

\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)

\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)

AutoCarver.selectors.measures.zscore_measure(x: Series, y: Series | None = None, thresh_zscore: float = 1.0, **kwargs) → tuple[bool, dict[str, Any]]

Computes outliers percentage based on the z-score

Parameters:

x (Series) – Quantitative feature
y (Series, optional) – Any target feature, by default None
thresh_zscore (float, optional) – Maximum percentage of Outliers in a feature, by default 1.0

Returns:

Whether or not there are too many outliers and the outlier measurement

Return type:

tuple[bool, dict[str, Any]]

Note

If thresh_zscore is reached, feature will automatically be dropped.

Outlier Detection: Interquartile range

Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:

\[IQR = Q_3 - Q_1\]

where:

\(Q_1\) is the 25th percentile of the \(x\)

\(Q_3\) is the 75th percentile of the \(x\)

Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:

\[Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR\]

AutoCarver.selectors.measures.iqr_measure(x: Series, y: Series | None = None, thresh_iqr: float = 1.0, **kwargs) → tuple[bool, dict[str, Any]]

Computes outliers percentage based on the interquartile range

Parameters:

x (Series) – Quantitative feature
y (Series, optional) – Any target feature, by default None
thresh_iqr (float, optional) – Maximum percentage of Outliers in a feature, by default 1.0

Returns:

Whether or not there are too many outliers and the outlier measurement

Return type:

tuple[bool, dict[str, Any]]

Note

If thresh_iqr is reached, feature will automatically be dropped.

Qualitative measures

Pearson’s \(\chi^2\) test statistic

For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)

\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)

\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)

AutoCarver.selectors.measures.chi2_measure(x: Series, y: Series, thresh_chi2: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the x by y pandas.crosstab.

Parameters:

x (Series) – Qualitative feature
y (Series) – Qualitative target feature
thresh_chi2 (float, optional) – Minimum Chi2 association, by default 0

Returns:

Whether x is sufficiently associated to y and Pearson’s chi2 between x and y

Return type:

tuple[bool, dict[str, Any]]

Note

If thresh_chi2 is reached, feature will automatically be dropped.

Cramér’s \(V\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

AutoCarver.selectors.measures.cramerv_measure(x: Series, y: Series, thresh_cramerv: float = 0, chi2_statistic: float | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Computes Carmér’s V between x and y from chi2_measure.

Parameters:

x (Series) – Qualitative feature
y (Series) – Qualitative target feature
thresh_cramerv (float, optional) – Minimum Cramér’s V association, by default 0
chi2_statistic (float, optional) – Pearson’s chi2 between x and y, by default None

Returns:

Whether x is sufficiently associated to y and Carmér’s V between x and y.

Return type:

tuple[bool, dict[str, Any]]

Note

If thresh_cramerv is reached, feature will automatically be dropped.

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

AutoCarver.selectors.measures.tschuprowt_measure(x: Series, y: Series, thresh_tschuprowt: float = 0, chi2_statistic: float | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Computes Tschuprow’s T between x and y from chi2_measure.

Parameters:

x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_tschuprowt (float, optional) – Minimum Tschuprow’s T association, by default 0
chi2_statistic (float, optional) – Pearson’s chi2 between x and y, by default None

Returns:

Whether x is sufficiently associated to y and Tschuprow’s T between x and y.

Return type:

tuple[bool, dict[str, Any]]

Note

tschuprowt_measure is the default measure for qualitative features in classification tasks (i.e. when ClassificationSelector.qualitative_filters=None and ClassificationSelector.qualitative_features is provided).
If thresh_tschuprowt is reached, feature will automatically be dropped.

Base data information

Note

Those measures are performed by default and don’t need to be added in the attributes.

Missing values

AutoCarver.selectors.measures.nans_measure(x: Series, y: Series | None = None, thresh_nan: float = 0.999, **kwargs) → tuple[bool, dict[str, Any]]

Measure of the percentage of NaNs

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_nan (float, optional) – Maximum percentage of NaNs in a feature, by default 0.999

Returns:

Whether or not there are to many NaNs and the percentage of NaNs

Return type:

tuple[bool, dict[str, Any]]

Note

nans_measure is evaluated by default in all Selectors.
If thresh_nan is reached, feature will automatically be dropped.

Data types

AutoCarver.selectors.measures.dtype_measure(x: Series, y: Series | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Feature’s dtype

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None

Returns:

True and the feature’s dtype

Return type:

tuple[bool, dict[str, Any]]

Note

dtype_measure is evaluated by default in all Selectors.

Mode

AutoCarver.selectors.measures.mode_measure(x: Series, y: Series | None = None, thresh_mode: float = 0.999, **kwargs) → tuple[bool, dict[str, Any]]

Measure of the percentage of the Mode

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_mode (float, optional) – Maximum percentage of a feature’s mode, by default 0.999

Returns:

Whether or not the mode is overrepresented and the percentage of mode

Return type:

tuple[bool, dict[str, Any]]

Note

mode_measure is evaluated by default in all Selectors.
If thresh_mode is reached, feature will automatically be dropped.

Association filters, X by X

Quantitative filters

Pearson’s \(r\)

For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}\]

where:

\(n\) is the number of observations

\(x_1^i\) is the \(i\) th observation of \(x_1\)

\(x_2^i\) is the \(i\) th observation of \(x_2\)

\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)

\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)

AutoCarver.selectors.filters.pearson_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) → dict[str, Any]

Computes maximum Pearson’s r between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Pearson’s r bewteen features, by default 1

Returns:

Maximum Pearson’s r with a better feature

Return type:

dict[str, Any]

Note

If thresh_corr is reached, feature will automatically be dropped.

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{1_{r}}x_{2_{r}}}\]

where:

\(x_{1_{r}}\) is the ranked version of \(x_1\)

\(x_{2_{r}}\) is the ranked version of \(x_2\)

\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)

AutoCarver.selectors.filters.spearman_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) → dict[str, Any]

Computes maximum Spearman’s rho between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Spearman’s rho bewteen features, by default 1

Returns:

Maximum Spearman’s rho with a better features

Return type:

dict[str, Any]

Note

spearman_filter is the default filter for quantitative features (i.e. when quantitative_filters=None and quantitative_features is provided).
If thresh_corr is reached, feature will automatically be dropped.

Qualitative filters

Cramér’s \(V\)

For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)

\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)

\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

AutoCarver.selectors.filters.cramerv_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) → dict[str, Any]

Computes maximum Cramer’s V between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Cramér’s V bewteen features, by default 1

Returns:

Maximum Cramér’s V with a better feature

Return type:

dict[str, Any]

Note

If thresh_corr is reached, feature will automatically be dropped.

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

AutoCarver.selectors.filters.tschuprowt_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) → dict[str, Any]

Computes max Tschuprow’s T between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Tschuprow’s T bewteen features, by default 1

Returns:

Maximum Tschuprow’s T with a better feature

Return type:

dict[str, Any]

Note

tschuprowt_filter is the default filter for qualitative features (i.e. when qualitative_filters=None and qualititative_features is provided).
If thresh_corr is reached, feature will automatically be dropped.