Selectors

AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:

Measuring association with a target and ranking features accordingly.

Filtering out features too asociated to a better ranked feature.

It allows one to select features:

Whatever there type: quantitative or qualitative

Whatever the optimization task: Classification tasks or Regression tasks

In general, associations are computed according to the provided data types of \(x\) and \(y\):

\(x\) \ \(y\)	Qualitatitve	Quantitative
Qualitative	Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\)	Kruskal-Wallis’ \(H\), \(R\) coefficient
Quantitative	Kruskal-Wallis’ \(H\), \(R\) coefficient	Pearson’s \(r\), Spearman’s \(\rho\)

See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.

Classification tasks

class AutoCarver.selectors.ClassificationSelector(features: Features, n_best_per_type: int, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.

Get your best features with ClassificationSelector.select()!

Parameters:

features (Features) – A set of Features to select from
n_best_per_type (int) – Number of quantitative and/or qualitative Features to select

Keyword Arguments:

measures (list[BaseMeasure], optional) –
List of association measures to be used, by default None.

Selects n_best_per_type features for each measure provided. Implemented measures are:
- QuantitativeFeature: see available Quantitative measures
- QualitativeFeature: see available Qualitative measures
filters (list[BaseFilter], optional) –
List of filters to be used, by default None.

Filters out features that do not pass the threshold of each filter. Implemented filters are:
- QuantitativeFeature: see available Quantitative filters
- QualitativeFeature: see available Qualitative filters
max_num_features_per_chunk (int, optional) –
Maximum number of features per chunk, by default 100.

Chunking is used to speed up the selection process for large numbers of Features.
1. Features are split in n_chunks of max_num_features_per_chunk
2. n_best_per_type//n_chunks of each chunk are selected
3. best features are selected from the remaining features
verbose (bool, optional) –
- True, without IPython: prints raw statitics
- True, with IPython: prints HTML statistics, by default False

select(X: DataFrame, y: Series) → Features

Selects the n_best_per_type Features of X

Parameters:

X (DataFrame) – Dataset to determine optimal features.
y (Series) – Target with wich the association is evaluated.

Returns:

Selected Features

Return type:

Features

Regression tasks

class AutoCarver.selectors.RegressionSelector(features: Features, n_best_per_type: int, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.

Get your best features with RegressionSelector.select()!

Parameters:

features (Features) – A set of Features to select from
n_best_per_type (int) – Number of quantitative and/or qualitative Features to select

Keyword Arguments:

measures (list[BaseMeasure], optional) –
List of association measures to be used, by default None.

Selects n_best_per_type features for each measure provided. Implemented measures are:
- QuantitativeFeature: see available Quantitative measures
- QualitativeFeature: see available Qualitative measures
filters (list[BaseFilter], optional) –
List of filters to be used, by default None.

Filters out features that do not pass the threshold of each filter. Implemented filters are:
- QuantitativeFeature: see available Quantitative filters
- QualitativeFeature: see available Qualitative filters
max_num_features_per_chunk (int, optional) –
Maximum number of features per chunk, by default 100.

Chunking is used to speed up the selection process for large numbers of Features.
1. Features are split in n_chunks of max_num_features_per_chunk
2. n_best_per_type//n_chunks of each chunk are selected
3. best features are selected from the remaining features
verbose (bool, optional) –
- True, without IPython: prints raw statitics
- True, with IPython: prints HTML statistics, by default False

select(X: DataFrame, y: Series) → Features

Selects the n_best_per_type Features of X

Parameters:

X (DataFrame) – Dataset to determine optimal features.
y (Series) – Target with wich the association is evaluated.

Returns:

Selected Features

Return type:

Features

Association measures, X by y

Quantitative measures

Pearson’s \(r\)

For a quantititative feature \(x\), the association with a quantitative target \(y\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{xy}= \frac{\sum_{i=1}^{n}{(x^i-\bar{x})(y^i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x^i-\bar{x})^2}} \sqrt{\sum_{i=1}^{n}{(y^i-\bar{y})^2}}}\]

where:

\(n\) is the number of observations

\(x^i\) is the \(i\) th observation of \(x\)

\(y^i\) is the \(i\) th observation of \(y\)

\(\bar{x}=\frac{1}{n}\sum_{i=1}^n{x^i}\) is the sample mean of \(x\)

\(\bar{y}=\frac{1}{n}\sum_{i=1}^n{y^i}\) is the sample mean of \(y\)

class AutoCarver.selectors.measures.PearsonMeasure(threshold: float = 0.0)

Pearson’s linear correlation coefficient between a Quantitative feature and target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_absolute = True: wether the measure needs absolute value for comparison or not

is_x_quantitative = True: wether x is quantitative or not

is_y_quantitative = True: wether y is quantitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{r}y_{r}}\]

where:

\(x_{r}\) is the ranked version of \(x\)

\(y_{r}\) is the ranked version of \(y\)

\(r_{x_{r}y_{r}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{r}\) and \(y_{r}\)

class AutoCarver.selectors.measures.SpearmanMeasure(threshold: float = 0.0)

Spearman’s rank correlation coefficient between a Quantitative feature and target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_absolute = True: wether the measure needs absolute value for comparison or not

is_x_quantitative = True: wether x is quantitative or not

is_y_quantitative = True: wether y is quantitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Note

SpearmanMeasure is the default measure for each QuantitativeFeature when using RegressionSelector.

Distance Correlation

For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:

\[1 - \frac{ (x - \bar{x}) (y - \bar{y}) } { ||x - \bar{x}||_2 ||y - \bar{y}||_2 }\]

where:

\(n_x\) is the number of observations of \(x\)

\(n_y\) is the number of observations of \(y\)

\(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)

\(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)

\(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)

\(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)

The Distance Correlation is computed using scipy.spatial.distance.correlation.

class AutoCarver.selectors.measures.DistanceMeasure(threshold: float = 0.0)

Distance correlation between a Quantitative feature and target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_absolute = True: wether the measure needs absolute value for comparison or not

is_x_quantitative = True: wether x is quantitative or not

is_y_quantitative = True: wether y is quantitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Kruskal-Wallis’ \(H\) test statistic

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:

\[H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }\]

where:

\(n\) is the number of observations

\(n_y\) is the number of modalities of \(y\)

\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality

\(x_r\) is the ranked version of \(x\)

\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality

\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)

class AutoCarver.selectors.measures.KruskalMeasure(threshold: float = 0.0)

Kruskal-Wallis’ test statistic between a Quantitative feature and a Qualitative target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_reversible = True: wether the measure’s input can be reversed depending on there type or not

is_x_quantitative = True: wether x is quantitative or not

is_y_qualitative = True: wether y is qualitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Note

KruskalMeasure is the default measure for each QualitativeFeature when using RegressionSelector.
KruskalMeasure is the default measure for each QuantitativeFeature when using ClassificationSelector.

Coefficient of determination \(R\)

For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:

\[x = \alpha + \beta y + \epsilon\]

where:

\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model

The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:

\[R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }\]

where:

\(n\) is the number of observations

\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares

\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares

\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)

class AutoCarver.selectors.measures.RMeasure(threshold: float = 0.0)

Square root of the coefficient of determination of linear regression model of a Quantitative feature by a Binary target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_x_quantitative = True: wether x is quantitative or not

is_y_binary = True: wether y is binary or not

is_y_qualitative = True: wether y is qualitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Outlier Detection Measures

Standard Score

Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:

\[z_i = \frac{x_i - \bar{x}}{S}\]

where:

\(n\) is the number of observations

\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)

\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)

class AutoCarver.selectors.measures.ZscoreOutlierMeasure(threshold: float = 1.0)

Z-Score based outlier measure

Parameters:: threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series = None) → float: Computes outlier measure on x

higher_is_better = False: wether higher values are better or not

is_x_qualitative = False: wether x is qualitative or not

is_x_quantitative = True: wether x is quantitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Interquartile range

Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:

\[IQR = Q_3 - Q_1\]

where:

\(Q_1\) is the 25th percentile of the \(x\)

\(Q_3\) is the 75th percentile of the \(x\)

Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:

\[Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR\]

class AutoCarver.selectors.measures.IqrOutlierMeasure(threshold: float = 1.0)

Interquartile range based outlier measure

Parameters:: threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series = None) → float: Computes outlier measure on x

higher_is_better = False: wether higher values are better or not

is_x_qualitative = False: wether x is qualitative or not

is_x_quantitative = True: wether x is quantitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Qualitative measures

Pearson’s \(\chi^2\) test statistic

For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)

\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)

\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)

class AutoCarver.selectors.measures.Chi2Measure(threshold: float = 0.0)

Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the x by y pandas.crosstab.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_x_qualitative = True: wether x is qualitative or not

is_y_qualitative = True: wether y is qualitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Cramér’s \(V\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.CramervMeasure(threshold: float = 0.0)

Computes Carmér’s V between a Qualitative feature and a binary target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_x_qualitative = True: wether x is qualitative or not

is_y_qualitative = True: wether y is qualitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}\]

where:

\(n\) is the number of observations

\(n_x\) is the number of modalities of \(x\)

\(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.TschuprowtMeasure(threshold: float = 0.0)

Computes Tschuprow’s T between a Qualitative feature and a binary target.

Parameters:: threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) → float: Computes association measure between x and y

higher_is_better = True: wether higher values are better or not

is_x_qualitative = True: wether x is qualitative or not

is_y_qualitative = True: wether y is qualitative or not

validate() → bool

Checks if threshold is reached

Returns:: Whether the test is passed or not
Return type:: bool

Note

TschuprowtMeasure is the default measure for each QualitativeFeature when using ClassificationSelector.

Association filters, X by X

Quantitative filters

Pearson’s \(r\)

For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}\]

where:

\(n\) is the number of observations

\(x_1^i\) is the \(i\) th observation of \(x_1\)

\(x_2^i\) is the \(i\) th observation of \(x_2\)

\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)

\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)

class AutoCarver.selectors.filters.PearsonFilter(threshold: float = 1.0)

Computes maximum Pearson’s r between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) → list[BaseFeature]: Filters out ranked features that reach the association threshold

higher_is_better = False: wether higher values are better or not

is_absolute = True: wether the measure needs absolute value for comparison or not

is_x_quantitative = True: wether x is quantitative or not

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{1_{r}}x_{2_{r}}}\]

where:

\(x_{1_{r}}\) is the ranked version of \(x_1\)

\(x_{2_{r}}\) is the ranked version of \(x_2\)

\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)

class AutoCarver.selectors.filters.SpearmanFilter(threshold: float = 1.0)

Computes maximum Spearman’s rho between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) → list[BaseFeature]: Filters out ranked features that reach the association threshold

higher_is_better = False: wether higher values are better or not

is_absolute = True: wether the measure needs absolute value for comparison or not

is_x_quantitative = True: wether x is quantitative or not

Note

SpearmanFilter is the default filter as inter-QuantitativeFeature association measure.

Qualitative filters

Cramér’s \(V\)

For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)

\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)

\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.CramervFilter(threshold: float = 1.0)

Computes maximum Cramer’s V between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) → list[BaseFeature]: Filters out ranked features that reach the association threshold

higher_is_better = False: wether higher values are better or not

is_x_qualitative = True: wether x is qualitative or not

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}\]

where:

\(n\) is the number of observations

\(n_{x_1}\) is the number of modalities of \(x_1\)

\(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.TschuprowtFilter(threshold: float = 1.0)

Computes maximum Tschuprow’s T between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) → list[BaseFeature]: Filters out ranked features that reach the association threshold

higher_is_better = False: wether higher values are better or not

is_x_qualitative = True: wether x is qualitative or not

Note

TschuprowtFilter is the default filter as inter-QualitativeFeature association measure.