FeatureSelector

AutoCarver implements FeatureSelector, an association-centric feature selection tool. It consists of the following Data Selection steps:

Measuring association with a binary target and ranking features accordingly.

Filtering out features too asociated to a better ranked feature.

FeatureSelector allows one to select features ase on there type: quantitative or qualitative.

By default, quantitative features are:

Ranked according to Kurskal-Wallis’ \(H\) test statistic

Filtered according to Spearman’s \(\rho\) correlation coefficient

By default, qualitative features are:

Ranked according to Tschuprow’s \(T\)

Filtered according to Tschuprow’s \(T\)

In general, associations are computed according to the provided data types of \(x\) and \(y\):

\(x\) \ \(y\)	Qualitatitve	Quantitative
Qualitative	Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\)	Kruskal-Wallis’ \(H\), \(R\) coefficient
Quantitative	Kruskal-Wallis’ \(H\), \(R\) coefficient	Pearson’s \(r\), Spearman’s \(\rho\)

See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.

Note

Additionnal measure/filter specific parameters can be added as keyword arguments.

FeatureSelector, an association centric tool for feature pre-selection

class AutoCarver.feature_selection.FeatureSelector(n_best: int, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, measures: list[Callable] | None = None, filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, pretty_print: bool = False, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a binary target.

Best features are the n_best of each measure
Get your best features with FeatureSelector.select()!

Initiates a FeatureSelector.

Parameters:

n_best (int) – Number of features to select.
quantitative_features (list[str], optional) – List of column names of quantitative features to chose from, by default None Must be set if qualitative_features=None.
qualitative_features (list[str], optional) – List of column names of qualitative features to chose from, by default None Must be set if quantitative_features=None.
measures (list[Callable], optional) –
List of association measures to be used, by default None. Ranks features based on last provided measure of the list. See Association measures, X by y. Implemented measures are:
- [Quantitative Features] For association evaluation: kruskal_measure (default), R_measure
- [Quantitative Features] For outlier detection: zscore_measure, iqr_measure
- [Qualitative Features] For association evaluation: chi2_measure, cramerv_measure, tschuprowt_measure (default)
filters (list[Callable], optional) –
List of filters to be used, by default None. See Association filters, X by X. Implemented filters are:
- [Quantitative Features] For linear correlation: spearman_filter (default), pearson_filter
- [Qualitative Features] For correlation: cramerv_filter, tschuprowt_filter (default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between 0 and 1, by default 1.0 By default, all features are used.

For colsample=0.5, FeatureSelector will search for the best features in features[:len(features)//2] and then in features[len(features)//2:].

Tip: for better performance, should be set such as len(features)//2 < 200.
verbose (bool, optional) – If True, prints raw Discretizers Fit and Transform steps, as long as information on AutoCarver’s processing and tables of target rates and frequencies for X, by default False
pretty_print (bool, optional) – If True, adds to the verbose some HTML tables of target rates and frequencies for X and, if provided, X_dev. Overrides the value of verbose, by default False
**kwargs – Sets thresholds for measures and filters, passed as keyword arguments.

Examples

See FeatureSelector examples

select(X: DataFrame, y: Series) → list[str]

Selects the n_best features of the DataFrame, by association with the binary target

Parameters:

X (DataFrame) – Dataset used to measure association between features and target. Needs to have columns has specified in FeatureSelector.features.
y (Series) – Binary target feature with wich the association is maximized.

Returns:

List of selected features

Return type:

list[str]

Association measures, X by y

Quantitative measures

Kruskal-Wallis’ \(H\) test statistic

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

The association with a binary target \(y\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=1\) compared to when \(y=0\). It is computed using the following formula:

\[H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }\]

where:

\(n\) is the number of observations

\(n_y\) is the number of modalities of \(y\)

\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality

\(x_r\) is the ranked version of \(x\)

\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}\) is the sample mean of \(x_r\)

AutoCarver.feature_selection.measures.kruskal_measure(x: Series, y: Series, thresh_kruskal: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Kruskal-Wallis’ test statistic between x when y==1 and x when y==0.

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature
thresh_kruskal (float, optional) – Minimum Kruskal-Wallis association, by default 0

Returns:

Whether x is sufficiently associated to y and Kruskal-Wallis’ test statistic

Return type:

tuple[bool, dict[str, Any]]

Note

kruskal_measure is the default measure for quantitative features (i.e. when FeatureSelector.measures=[] and FeatureSelector.quantitative_features is provided).

Coefficient of determination \(R\)

For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:

\[x = \alpha + \beta y + \epsilon\]

where:

\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model

The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:

\[R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }\]

where:

\(n\) is the number of observations

\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares

\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares

\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)

AutoCarver.feature_selection.measures.R_measure(x: Series, y: Series, thresh_R: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Square root of the coefficient of determination of linear regression model of x by y.

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature
thresh_R (float, optional) – Minimum R association, by default 0

Returns:

Whether x is sufficiently associated to y and the square root of the determination coefficient

Return type:

tuple[bool, dict[str, Any]]

Quantitative Outlier Detection

Standard score

Standard score can be applied as a measure of deviation to determine outlier. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:

\[z_i = \frac{x_i - \bar{x}}{S}\]

where:

\(n\) is the number of observations

\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)

\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)

AutoCarver.feature_selection.measures.zscore_measure(x: Series, y: Series | None = None, thresh_outlier: float = 1.0, **kwargs) → tuple[bool, dict[str, Any]]

Computes outliers percentage based on the z-score

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_outlier (float, optional) – Maximum percentage of Outliers in a feature, by default 1.0

Returns:

Whether or not there are too many outliers and the outlier measurement

Return type:

tuple[bool, dict[str, Any]]

Interquartile range

Interquartile range is widely used as an outlier detection metric. For a feature \(x\) it is computed as follows:

\[IQR = Q_3 - Q_1\]

where:

\(Q_1\) is the 25th percentile of the \(x\)

\(Q_3\) is the 75th percentile of the \(x\)

Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:

\[Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR\]

AutoCarver.feature_selection.measures.iqr_measure(x: Series, y: Series | None = None, thresh_outlier: float = 1.0, **kwargs) → tuple[bool, dict[str, Any]]

Computes outliers percentage based on the interquartile range

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_outlier (float, optional) – Maximum percentage of Outliers in a feature, by default 1.0

Returns:

Whether or not there are too many outliers and the outlier measurement

Return type:

tuple[bool, dict[str, Any]]

Qualitative measures

Pearson’s \(\chi^2\) test statistic

For a qualititative feature \(x\), the association with a qualitative binary target \(y\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)

\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)

\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)

AutoCarver.feature_selection.measures.chi2_measure(x: Series, y: Series, thresh_chi2: float = 0, **kwargs) → tuple[bool, dict[str, Any]]

Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the x by y pandas.crosstab.

Parameters:

x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_chi2 (float, optional) – Minimum Chi2 association, by default 0

Returns:

Whether x is sufficiently associated to y and Pearson’s chi2 between x and y.

Return type:

tuple[bool, dict[str, Any]]

Cramér’s \(V\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

AutoCarver.feature_selection.measures.cramerv_measure(x: Series, y: Series, thresh_cramerv: float = 0, chi2_statistic: float | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Computes Carmér’s V between x and y from chi2_measure.

Parameters:

x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_cramerv (float, optional) – Minimum Cramér’s V association, by default 0
chi2_statistic (float, optional) – Pearson’s chi2 between x and y, by default None

Returns:

Whether x is sufficiently associated to y and Carmér’s V between x and y.

Return type:

tuple[bool, dict[str, Any]]

Note

cramerv_measure is the default measure for qualitative features (i.e. when FeatureSelector.measures=[] and FeatureSelector.qualititative_features is provided).

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

AutoCarver.feature_selection.measures.tschuprowt_measure(x: Series, y: Series, thresh_tschuprowt: float = 0, chi2_statistic: float | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Computes Tschuprow’s T between x and y from chi2_measure.

Parameters:

x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_tschuprowt (float, optional) – Minimum Tschuprow’s T association, by default 0
chi2_statistic (float, optional) – Pearson’s chi2 between x and y, by default None

Returns:

Whether x is sufficiently associated to y and Tschuprow’s T between x and y.

Return type:

tuple[bool, dict[str, Any]]

Base data information

AutoCarver.feature_selection.measures.nans_measure(x: Series, y: Series | None = None, thresh_nan: float = 0.999, **kwargs) → tuple[bool, dict[str, Any]]

Measure of the percentage of NaNs

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_nan (float, optional) – Maximum percentage of NaNs in a feature, by default 0.999

Returns:

Whether or not there are to many NaNs and the percentage of NaNs

Return type:

tuple[bool, dict[str, Any]]

Note

nans_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.

AutoCarver.feature_selection.measures.dtype_measure(x: Series, y: Series | None = None, **kwargs) → tuple[bool, dict[str, Any]]

Feature’s dtype

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None

Returns:

True and the feature’s dtype

Return type:

tuple[bool, dict[str, Any]]

Note

dtype_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.

AutoCarver.feature_selection.measures.mode_measure(x: Series, y: Series | None = None, thresh_mode: float = 0.999, **kwargs) → tuple[bool, dict[str, Any]]

Measure of the percentage of the Mode

Parameters:

x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default None
thresh_mode (float, optional) – Maximum percentage of a feature’s mode, by default 0.999

Returns:

Whether or not the mode is overrepresented and the percentage of mode

Return type:

tuple[bool, dict[str, Any]]

Note

mode_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.

Association filters, X by X

Quantitative filters

Pearson’s \(r\)

For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}\]

where:

\(n\) is the number of observations

\(x_1^i\) is the \(i\) th observation of \(x_1\)

\(x_2^i\) is the \(i\) th observation of \(x_2\)

\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)

\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)

AutoCarver.feature_selection.filters.pearson_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) → dict[str, Any]

Computes maximum Pearson’s r between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Pearson’s r bewteen features, by default 1

Returns:

Maximum Pearson’s r with a better feature

Return type:

dict[str, Any]

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{1_{r}}x_{2_{r}}}\]

where:

\(x_{1_{r}}\) is the ranked version of \(x_1\)

\(x_{2_{r}}\) is the ranked version of \(x_2\)

\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)

AutoCarver.feature_selection.filters.spearman_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) → dict[str, Any]

Computes maximum Spearman’s rho between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Spearman’s rho bewteen features, by default 1

Returns:

Maximum Spearman’s rho with a better features

Return type:

dict[str, Any]

Note

spearman_filter is the default measure for quantitative features (i.e. when FeatureSelector.filters=[] and FeatureSelector.quantititative_features is provided).

Qualitative filters

Cramér’s \(V\)

For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)

\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)

\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

AutoCarver.feature_selection.filters.cramerv_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) → dict[str, Any]

Computes maximum Cramer’s V between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Cramér’s V bewteen features, by default 1

Returns:

Maximum Cramér’s V with a better feature

Return type:

dict[str, Any]

Note

cramerv_filter is the default filter for qualitative features (i.e. when FeatureSelector.filters=[] and FeatureSelector.qualititative_features is provided).

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

AutoCarver.feature_selection.filters.tschuprowt_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) → dict[str, Any]

Computes max Tschuprow’s T between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided ranks).

Parameters:

X (DataFrame) – Contains columns named after ranks’s index (feature names)
ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Tschuprow’s T bewteen features, by default 1

Returns:

Maximum Tschuprow’s T with a better feature

Return type:

dict[str, Any]