Selectors
AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:
Measuring association with a target and ranking features accordingly.
Filtering out features too asociated to a better ranked feature.
It allows one to select features:
Whatever there type: quantitative or qualitative
Whatever the optimization task: Classification tasks or Regression tasks
In general, associations are computed according to the provided data types of \(x\) and \(y\):
\(x\) \ \(y\) |
Qualitatitve |
Quantitative |
Qualitative |
Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\) |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Quantitative |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Pearson’s \(r\), Spearman’s \(\rho\) |
See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.
Classification tasks
- class AutoCarver.selectors.ClassificationSelector(features: Features, n_best_per_type: int, **kwargs)
A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.
Get your best features with
ClassificationSelector.select()!- Parameters:
features (Features) – A set of
Featuresto select fromn_best_per_type (int) – Number of quantitative and/or qualitative
Featuresto select
- Keyword Arguments:
measures (list[BaseMeasure], optional) –
List of association measures to be used, by default
None.Selects
n_best_per_typefeatures for each measure provided. Implemented measures are:QuantitativeFeature: see available Quantitative measuresQualitativeFeature: see available Qualitative measures
filters (list[BaseFilter], optional) –
List of filters to be used, by default
None.Filters out features that do not pass the threshold of each filter. Implemented filters are:
QuantitativeFeature: see available Quantitative filtersQualitativeFeature: see available Qualitative filters
max_num_features_per_chunk (int, optional) –
Maximum number of features per chunk, by default
100.Chunking is used to speed up the selection process for large numbers of
Features.Featuresare split inn_chunksofmax_num_features_per_chunkn_best_per_type//n_chunksof each chunk are selectedbest features are selected from the remaining features
verbose (bool, optional) –
True, withoutIPython: prints raw statiticsTrue, withIPython: prints HTML statistics, by defaultFalse
Regression tasks
- class AutoCarver.selectors.RegressionSelector(features: Features, n_best_per_type: int, **kwargs)
A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.
Get your best features with
RegressionSelector.select()!- Parameters:
features (Features) – A set of
Featuresto select fromn_best_per_type (int) – Number of quantitative and/or qualitative
Featuresto select
- Keyword Arguments:
measures (list[BaseMeasure], optional) –
List of association measures to be used, by default
None.Selects
n_best_per_typefeatures for each measure provided. Implemented measures are:QuantitativeFeature: see available Quantitative measuresQualitativeFeature: see available Qualitative measures
filters (list[BaseFilter], optional) –
List of filters to be used, by default
None.Filters out features that do not pass the threshold of each filter. Implemented filters are:
QuantitativeFeature: see available Quantitative filtersQualitativeFeature: see available Qualitative filters
max_num_features_per_chunk (int, optional) –
Maximum number of features per chunk, by default
100.Chunking is used to speed up the selection process for large numbers of
Features.Featuresare split inn_chunksofmax_num_features_per_chunkn_best_per_type//n_chunksof each chunk are selectedbest features are selected from the remaining features
verbose (bool, optional) –
True, withoutIPython: prints raw statiticsTrue, withIPython: prints HTML statistics, by defaultFalse
Association measures, X by y
Quantitative measures
Pearson’s \(r\)
For a quantititative feature \(x\), the association with a quantitative target \(y\) is computed using pandas.DataFrame.corr.
Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:
where:
\(n\) is the number of observations
\(x^i\) is the \(i\) th observation of \(x\)
\(y^i\) is the \(i\) th observation of \(y\)
\(\bar{x}=\frac{1}{n}\sum_{i=1}^n{x^i}\) is the sample mean of \(x\)
\(\bar{y}=\frac{1}{n}\sum_{i=1}^n{y^i}\) is the sample mean of \(y\)
- class AutoCarver.selectors.measures.PearsonMeasure(threshold: float = 0.0)
Pearson’s linear correlation coefficient between a Quantitative feature and target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_absolute = True
wether the measure needs absolute value for comparison or not
- is_x_quantitative = True
wether x is quantitative or not
- is_y_quantitative = True
wether y is quantitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Spearman’s \(\rho\)
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:
where:
\(x_{r}\) is the ranked version of \(x\)
\(y_{r}\) is the ranked version of \(y\)
\(r_{x_{r}y_{r}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{r}\) and \(y_{r}\)
- class AutoCarver.selectors.measures.SpearmanMeasure(threshold: float = 0.0)
Spearman’s rank correlation coefficient between a Quantitative feature and target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_absolute = True
wether the measure needs absolute value for comparison or not
- is_x_quantitative = True
wether x is quantitative or not
- is_y_quantitative = True
wether y is quantitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Note
SpearmanMeasureis the default measure for eachQuantitativeFeaturewhen usingRegressionSelector.
Distance Correlation
For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:
where:
\(n_x\) is the number of observations of \(x\)
\(n_y\) is the number of observations of \(y\)
\(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)
\(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)
\(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)
\(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)
The Distance Correlation is computed using scipy.spatial.distance.correlation.
- class AutoCarver.selectors.measures.DistanceMeasure(threshold: float = 0.0)
Distance correlation between a Quantitative feature and target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_absolute = True
wether the measure needs absolute value for comparison or not
- is_x_quantitative = True
wether x is quantitative or not
- is_y_quantitative = True
wether y is quantitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Kruskal-Wallis’ \(H\) test statistic
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:
where:
\(n\) is the number of observations
\(n_y\) is the number of modalities of \(y\)
\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality
\(x_r\) is the ranked version of \(x\)
\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)
- class AutoCarver.selectors.measures.KruskalMeasure(threshold: float = 0.0)
Kruskal-Wallis’ test statistic between a Quantitative feature and a Qualitative target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_reversible = True
wether the measure’s input can be reversed depending on there type or not
- is_x_quantitative = True
wether x is quantitative or not
- is_y_qualitative = True
wether y is qualitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Note
KruskalMeasureis the default measure for eachQualitativeFeaturewhen usingRegressionSelector.KruskalMeasureis the default measure for eachQuantitativeFeaturewhen usingClassificationSelector.
Coefficient of determination \(R\)
For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:
- where:
\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model
The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:
where:
\(n\) is the number of observations
\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares
\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares
\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)
- class AutoCarver.selectors.measures.RMeasure(threshold: float = 0.0)
Square root of the coefficient of determination of linear regression model of a Quantitative feature by a Binary target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_x_quantitative = True
wether x is quantitative or not
- is_y_binary = True
wether y is binary or not
- is_y_qualitative = True
wether y is qualitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Outlier Detection Measures
Standard Score
Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:
where:
\(n\) is the number of observations
\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)
\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)
- class AutoCarver.selectors.measures.ZscoreOutlierMeasure(threshold: float = 1.0)
Z-Score based outlier measure
- Parameters:
threshold (float, optional) – Maximum threshold to reach, by default
1.0
- compute_association(x: Series, y: Series = None) float
Computes outlier measure on
x
- higher_is_better = False
wether higher values are better or not
- is_x_qualitative = False
wether x is qualitative or not
- is_x_quantitative = True
wether x is quantitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Interquartile range
Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:
where:
\(Q_1\) is the 25th percentile of the \(x\)
\(Q_3\) is the 75th percentile of the \(x\)
Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:
- class AutoCarver.selectors.measures.IqrOutlierMeasure(threshold: float = 1.0)
Interquartile range based outlier measure
- Parameters:
threshold (float, optional) – Maximum threshold to reach, by default
1.0
- compute_association(x: Series, y: Series = None) float
Computes outlier measure on
x
- higher_is_better = False
wether higher values are better or not
- is_x_qualitative = False
wether x is qualitative or not
- is_x_quantitative = True
wether x is quantitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Qualitative measures
Pearson’s \(\chi^2\) test statistic
For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)
\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)
\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)
- class AutoCarver.selectors.measures.Chi2Measure(threshold: float = 0.0)
Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the
xbyypandas.crosstab.- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_x_qualitative = True
wether x is qualitative or not
- is_y_qualitative = True
wether y is qualitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Cramér’s \(V\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- class AutoCarver.selectors.measures.CramervMeasure(threshold: float = 0.0)
Computes Carmér’s V between a Qualitative feature and a binary target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_x_qualitative = True
wether x is qualitative or not
- is_y_qualitative = True
wether y is qualitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- class AutoCarver.selectors.measures.TschuprowtMeasure(threshold: float = 0.0)
Computes Tschuprow’s T between a Qualitative feature and a binary target.
- Parameters:
threshold (float, optional) – Minimum threshold to reach, by default
0.0
- compute_association(x: Series, y: Series) float
Computes association measure between
xandy
- higher_is_better = True
wether higher values are better or not
- is_x_qualitative = True
wether x is qualitative or not
- is_y_qualitative = True
wether y is qualitative or not
- validate() bool
Checks if
thresholdis reached- Returns:
Whether the test is passed or not
- Return type:
bool
Note
TschuprowtMeasureis the default measure for eachQualitativeFeaturewhen usingClassificationSelector.
Association filters, X by X
Quantitative filters
Pearson’s \(r\)
For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.
Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:
where:
\(n\) is the number of observations
\(x_1^i\) is the \(i\) th observation of \(x_1\)
\(x_2^i\) is the \(i\) th observation of \(x_2\)
\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)
\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)
- class AutoCarver.selectors.filters.PearsonFilter(threshold: float = 1.0)
Computes maximum Pearson’s r between quantitative features of
X- filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]
Filters out ranked features that reach the association threshold
- higher_is_better = False
wether higher values are better or not
- is_absolute = True
wether the measure needs absolute value for comparison or not
- is_x_quantitative = True
wether x is quantitative or not
Spearman’s \(\rho\)
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:
where:
\(x_{1_{r}}\) is the ranked version of \(x_1\)
\(x_{2_{r}}\) is the ranked version of \(x_2\)
\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)
- class AutoCarver.selectors.filters.SpearmanFilter(threshold: float = 1.0)
Computes maximum Spearman’s rho between quantitative features of
X- filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]
Filters out ranked features that reach the association threshold
- higher_is_better = False
wether higher values are better or not
- is_absolute = True
wether the measure needs absolute value for comparison or not
- is_x_quantitative = True
wether x is quantitative or not
Note
SpearmanFilteris the default filter as inter-QuantitativeFeatureassociation measure.
Qualitative filters
Cramér’s \(V\)
For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)
\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)
\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- class AutoCarver.selectors.filters.CramervFilter(threshold: float = 1.0)
Computes maximum Cramer’s V between qualitative features of
XFilters out ranked features that reach the association threshold
- filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]
Filters out ranked features that reach the association threshold
- higher_is_better = False
wether higher values are better or not
- is_x_qualitative = True
wether x is qualitative or not
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- class AutoCarver.selectors.filters.TschuprowtFilter(threshold: float = 1.0)
Computes maximum Tschuprow’s T between qualitative features of
XFilters out ranked features that reach the association threshold
- filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]
Filters out ranked features that reach the association threshold
- higher_is_better = False
wether higher values are better or not
- is_x_qualitative = True
wether x is qualitative or not
Note
TschuprowtFilteris the default filter as inter-QualitativeFeatureassociation measure.