FeatureSelector
AutoCarver implements FeatureSelector, an association-centric feature selection tool.
It consists of the following Data Selection steps:
Measuring association with a binary target and ranking features accordingly.
Filtering out features too asociated to a better ranked feature.
FeatureSelector allows one to select features ase on there type: quantitative or qualitative.
By default, quantitative features are:
Ranked according to Kurskal-Wallis’ \(H\) test statistic
Filtered according to Spearman’s \(\rho\) correlation coefficient
By default, qualitative features are:
Ranked according to Tschuprow’s \(T\)
Filtered according to Tschuprow’s \(T\)
In general, associations are computed according to the provided data types of \(x\) and \(y\):
\(x\) \ \(y\) |
Qualitatitve |
Quantitative |
Qualitative |
Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\) |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Quantitative |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Pearson’s \(r\), Spearman’s \(\rho\) |
See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.
Note
Additionnal measure/filter specific parameters can be added as keyword arguments.
FeatureSelector, an association centric tool for feature pre-selection
- class AutoCarver.feature_selection.FeatureSelector(n_best: int, *, quantitative_features: list[str] | None = None, qualitative_features: list[str] | None = None, measures: list[Callable] | None = None, filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, pretty_print: bool = False, **kwargs)
A pipeline of measures to perform a feature pre-selection that maximizes association with a binary target.
Best features are the n_best of each measure
Get your best features with
FeatureSelector.select()!
Initiates a
FeatureSelector.- Parameters:
n_best (int) – Number of features to select.
quantitative_features (list[str], optional) – List of column names of quantitative features to chose from, by default
NoneMust be set ifqualitative_features=None.qualitative_features (list[str], optional) – List of column names of qualitative features to chose from, by default
NoneMust be set ifquantitative_features=None.measures (list[Callable], optional) –
List of association measures to be used, by default
None. Ranks features based on last provided measure of the list. See Association measures, X by y. Implemented measures are:[Quantitative Features] For association evaluation:
kruskal_measure(default),R_measure[Quantitative Features] For outlier detection:
zscore_measure,iqr_measure[Qualitative Features] For association evaluation:
chi2_measure,cramerv_measure,tschuprowt_measure(default)
filters (list[Callable], optional) –
List of filters to be used, by default
None. See Association filters, X by X. Implemented filters are:[Quantitative Features] For linear correlation:
spearman_filter(default),pearson_filter[Qualitative Features] For correlation:
cramerv_filter,tschuprowt_filter(default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between 0 and 1, by default
1.0By default, all features are used.For colsample=0.5, FeatureSelector will search for the best features in
features[:len(features)//2]and then infeatures[len(features)//2:].Tip: for better performance, should be set such as
len(features)//2 < 200.verbose (bool, optional) – If
True, prints raw Discretizers Fit and Transform steps, as long as information on AutoCarver’s processing and tables of target rates and frequencies for X, by defaultFalsepretty_print (bool, optional) – If
True, adds to the verbose some HTML tables of target rates and frequencies for X and, if provided, X_dev. Overrides the value ofverbose, by defaultFalse**kwargs – Sets thresholds for
measuresandfilters, passed as keyword arguments.
Examples
- select(X: DataFrame, y: Series) list[str]
Selects the
n_bestfeatures of the DataFrame, by association with the binary target- Parameters:
X (DataFrame) – Dataset used to measure association between features and target. Needs to have columns has specified in
FeatureSelector.features.y (Series) – Binary target feature with wich the association is maximized.
- Returns:
List of selected features
- Return type:
list[str]
Association measures, X by y
Quantitative measures
Kruskal-Wallis’ \(H\) test statistic
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
The association with a binary target \(y\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=1\) compared to when \(y=0\). It is computed using the following formula:
where:
\(n\) is the number of observations
\(n_y\) is the number of modalities of \(y\)
\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality
\(x_r\) is the ranked version of \(x\)
\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}\) is the sample mean of \(x_r\)
- AutoCarver.feature_selection.measures.kruskal_measure(x: Series, y: Series, thresh_kruskal: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Kruskal-Wallis’ test statistic between
xwheny==1andxwheny==0.- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature
thresh_kruskal (float, optional) – Minimum Kruskal-Wallis association, by default
0
- Returns:
Whether
xis sufficiently associated toyand Kruskal-Wallis’ test statistic- Return type:
tuple[bool, dict[str, Any]]
Note
kruskal_measure is the default measure for quantitative features (i.e. when FeatureSelector.measures=[] and FeatureSelector.quantitative_features is provided).
Coefficient of determination \(R\)
For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:
- where:
\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model
The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:
where:
\(n\) is the number of observations
\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares
\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares
\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)
- AutoCarver.feature_selection.measures.R_measure(x: Series, y: Series, thresh_R: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Square root of the coefficient of determination of linear regression model of
xbyy.- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature
thresh_R (float, optional) – Minimum R association, by default
0
- Returns:
Whether
xis sufficiently associated toyand the square root of the determination coefficient- Return type:
tuple[bool, dict[str, Any]]
Quantitative Outlier Detection
Standard score
Standard score can be applied as a measure of deviation to determine outlier. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:
where:
\(n\) is the number of observations
\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)
\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)
- AutoCarver.feature_selection.measures.zscore_measure(x: Series, y: Series | None = None, thresh_outlier: float = 1.0, **kwargs) tuple[bool, dict[str, Any]]
Computes outliers percentage based on the z-score
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_outlier (float, optional) – Maximum percentage of Outliers in a feature, by default
1.0
- Returns:
Whether or not there are too many outliers and the outlier measurement
- Return type:
tuple[bool, dict[str, Any]]
Interquartile range
Interquartile range is widely used as an outlier detection metric. For a feature \(x\) it is computed as follows:
where:
\(Q_1\) is the 25th percentile of the \(x\)
\(Q_3\) is the 75th percentile of the \(x\)
Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:
- AutoCarver.feature_selection.measures.iqr_measure(x: Series, y: Series | None = None, thresh_outlier: float = 1.0, **kwargs) tuple[bool, dict[str, Any]]
Computes outliers percentage based on the interquartile range
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_outlier (float, optional) – Maximum percentage of Outliers in a feature, by default
1.0
- Returns:
Whether or not there are too many outliers and the outlier measurement
- Return type:
tuple[bool, dict[str, Any]]
Qualitative measures
Pearson’s \(\chi^2\) test statistic
For a qualititative feature \(x\), the association with a qualitative binary target \(y\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)
\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)
\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)
- AutoCarver.feature_selection.measures.chi2_measure(x: Series, y: Series, thresh_chi2: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the
xbyypandas.crosstab.- Parameters:
x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_chi2 (float, optional) – Minimum Chi2 association, by default
0
- Returns:
Whether
xis sufficiently associated toyand Pearson’s chi2 betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Cramér’s \(V\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- AutoCarver.feature_selection.measures.cramerv_measure(x: Series, y: Series, thresh_cramerv: float = 0, chi2_statistic: float | None = None, **kwargs) tuple[bool, dict[str, Any]]
Computes Carmér’s V between
xandyfromchi2_measure.- Parameters:
x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_cramerv (float, optional) – Minimum Cramér’s V association, by default
0chi2_statistic (float, optional) – Pearson’s chi2 between
xandy, by defaultNone
- Returns:
Whether
xis sufficiently associated toyand Carmér’s V betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Note
cramerv_measure is the default measure for qualitative features (i.e. when FeatureSelector.measures=[] and FeatureSelector.qualititative_features is provided).
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- AutoCarver.feature_selection.measures.tschuprowt_measure(x: Series, y: Series, thresh_tschuprowt: float = 0, chi2_statistic: float | None = None, **kwargs) tuple[bool, dict[str, Any]]
Computes Tschuprow’s T between
xandyfromchi2_measure.- Parameters:
x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_tschuprowt (float, optional) – Minimum Tschuprow’s T association, by default
0chi2_statistic (float, optional) – Pearson’s chi2 between
xandy, by defaultNone
- Returns:
Whether
xis sufficiently associated toyand Tschuprow’s T betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Base data information
- AutoCarver.feature_selection.measures.nans_measure(x: Series, y: Series | None = None, thresh_nan: float = 0.999, **kwargs) tuple[bool, dict[str, Any]]
Measure of the percentage of NaNs
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_nan (float, optional) – Maximum percentage of NaNs in a feature, by default
0.999
- Returns:
Whether or not there are to many NaNs and the percentage of NaNs
- Return type:
tuple[bool, dict[str, Any]]
Note
nans_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.
- AutoCarver.feature_selection.measures.dtype_measure(x: Series, y: Series | None = None, **kwargs) tuple[bool, dict[str, Any]]
Feature’s dtype
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
None
- Returns:
True and the feature’s dtype
- Return type:
tuple[bool, dict[str, Any]]
Note
dtype_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.
- AutoCarver.feature_selection.measures.mode_measure(x: Series, y: Series | None = None, thresh_mode: float = 0.999, **kwargs) tuple[bool, dict[str, Any]]
Measure of the percentage of the Mode
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_mode (float, optional) – Maximum percentage of a feature’s mode, by default
0.999
- Returns:
Whether or not the mode is overrepresented and the percentage of mode
- Return type:
tuple[bool, dict[str, Any]]
Note
mode_measure is evaluated by default by FeatureSelector. If threshold is reached, feature will automatically be dropped.
Association filters, X by X
Quantitative filters
Pearson’s \(r\)
For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.
Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:
where:
\(n\) is the number of observations
\(x_1^i\) is the \(i\) th observation of \(x_1\)
\(x_2^i\) is the \(i\) th observation of \(x_2\)
\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)
\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)
- AutoCarver.feature_selection.filters.pearson_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) dict[str, Any]
Computes maximum Pearson’s r between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Pearson’s r bewteen features, by default
1
- Returns:
Maximum Pearson’s r with a better feature
- Return type:
dict[str, Any]
Spearman’s \(\rho\)
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:
where:
\(x_{1_{r}}\) is the ranked version of \(x_1\)
\(x_{2_{r}}\) is the ranked version of \(x_2\)
\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)
- AutoCarver.feature_selection.filters.spearman_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) dict[str, Any]
Computes maximum Spearman’s rho between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Spearman’s rho bewteen features, by default
1
- Returns:
Maximum Spearman’s rho with a better features
- Return type:
dict[str, Any]
Note
spearman_filter is the default measure for quantitative features (i.e. when FeatureSelector.filters=[] and FeatureSelector.quantititative_features is provided).
Qualitative filters
Cramér’s \(V\)
For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)
\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)
\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- AutoCarver.feature_selection.filters.cramerv_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) dict[str, Any]
Computes maximum Cramer’s V between
XandX(qualitative). Features too correlated to a feature more associated with the target are excluded (according to providedranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Cramér’s V bewteen features, by default
1
- Returns:
Maximum Cramér’s V with a better feature
- Return type:
dict[str, Any]
Note
cramerv_filter is the default filter for qualitative features (i.e. when FeatureSelector.filters=[] and FeatureSelector.qualititative_features is provided).
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- AutoCarver.feature_selection.filters.tschuprowt_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) dict[str, Any]
Computes max Tschuprow’s T between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Tschuprow’s T bewteen features, by default
1
- Returns:
Maximum Tschuprow’s T with a better feature
- Return type:
dict[str, Any]