Selectors

AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:

  1. Measuring association with a target and ranking features accordingly.

  2. Filtering out features too asociated to a better ranked feature.

It allows one to select features:

In general, associations are computed according to the provided data types of \(x\) and \(y\):

\(x\) \ \(y\)

Qualitatitve

Quantitative

Qualitative

Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\)

Kruskal-Wallis’ \(H\), \(R\) coefficient

Quantitative

Kruskal-Wallis’ \(H\), \(R\) coefficient

Pearson’s \(r\), Spearman’s \(\rho\)

See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.

Classification tasks

class AutoCarver.selectors.ClassificationSelector(features: Features, n_best_per_type: int, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.

Get your best features with ClassificationSelector.select()!

Parameters:
  • features (Features) – A set of Features to select from

  • n_best_per_type (int) – Number of quantitative and/or qualitative Features to select

Keyword Arguments:
  • measures (list[BaseMeasure], optional) –

    List of association measures to be used, by default None.

    Selects n_best_per_type features for each measure provided. Implemented measures are:

  • filters (list[BaseFilter], optional) –

    List of filters to be used, by default None.

    Filters out features that do not pass the threshold of each filter. Implemented filters are:

  • max_num_features_per_chunk (int, optional) –

    Maximum number of features per chunk, by default 100.

    Chunking is used to speed up the selection process for large numbers of Features.

    1. Features are split in n_chunks of max_num_features_per_chunk

    2. n_best_per_type//n_chunks of each chunk are selected

    3. best features are selected from the remaining features

  • verbose (bool, optional) –

    • True, without IPython: prints raw statitics

    • True, with IPython: prints HTML statistics, by default False

select(X: DataFrame, y: Series) Features

Selects the n_best_per_type Features of X

Parameters:
  • X (DataFrame) – Dataset to determine optimal features.

  • y (Series) – Target with wich the association is evaluated.

Returns:

Selected Features

Return type:

Features

Regression tasks

class AutoCarver.selectors.RegressionSelector(features: Features, n_best_per_type: int, **kwargs)

A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.

Get your best features with RegressionSelector.select()!

Parameters:
  • features (Features) – A set of Features to select from

  • n_best_per_type (int) – Number of quantitative and/or qualitative Features to select

Keyword Arguments:
  • measures (list[BaseMeasure], optional) –

    List of association measures to be used, by default None.

    Selects n_best_per_type features for each measure provided. Implemented measures are:

  • filters (list[BaseFilter], optional) –

    List of filters to be used, by default None.

    Filters out features that do not pass the threshold of each filter. Implemented filters are:

  • max_num_features_per_chunk (int, optional) –

    Maximum number of features per chunk, by default 100.

    Chunking is used to speed up the selection process for large numbers of Features.

    1. Features are split in n_chunks of max_num_features_per_chunk

    2. n_best_per_type//n_chunks of each chunk are selected

    3. best features are selected from the remaining features

  • verbose (bool, optional) –

    • True, without IPython: prints raw statitics

    • True, with IPython: prints HTML statistics, by default False

select(X: DataFrame, y: Series) Features

Selects the n_best_per_type Features of X

Parameters:
  • X (DataFrame) – Dataset to determine optimal features.

  • y (Series) – Target with wich the association is evaluated.

Returns:

Selected Features

Return type:

Features

Association measures, X by y

Quantitative measures

Pearson’s \(r\)

For a quantititative feature \(x\), the association with a quantitative target \(y\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{xy}= \frac{\sum_{i=1}^{n}{(x^i-\bar{x})(y^i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x^i-\bar{x})^2}} \sqrt{\sum_{i=1}^{n}{(y^i-\bar{y})^2}}}\]

where:

  • \(n\) is the number of observations

  • \(x^i\) is the \(i\) th observation of \(x\)

  • \(y^i\) is the \(i\) th observation of \(y\)

  • \(\bar{x}=\frac{1}{n}\sum_{i=1}^n{x^i}\) is the sample mean of \(x\)

  • \(\bar{y}=\frac{1}{n}\sum_{i=1}^n{y^i}\) is the sample mean of \(y\)

class AutoCarver.selectors.measures.PearsonMeasure(threshold: float = 0.0)

Pearson’s linear correlation coefficient between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{r}y_{r}}\]

where:

  • \(x_{r}\) is the ranked version of \(x\)

  • \(y_{r}\) is the ranked version of \(y\)

  • \(r_{x_{r}y_{r}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{r}\) and \(y_{r}\)

class AutoCarver.selectors.measures.SpearmanMeasure(threshold: float = 0.0)

Spearman’s rank correlation coefficient between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • SpearmanMeasure is the default measure for each QuantitativeFeature when using RegressionSelector.

Distance Correlation

For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:

\[1 - \frac{ (x - \bar{x}) (y - \bar{y}) } { ||x - \bar{x}||_2 ||y - \bar{y}||_2 }\]

where:

  • \(n_x\) is the number of observations of \(x\)

  • \(n_y\) is the number of observations of \(y\)

  • \(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)

  • \(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)

  • \(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)

  • \(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)

The Distance Correlation is computed using scipy.spatial.distance.correlation.

class AutoCarver.selectors.measures.DistanceMeasure(threshold: float = 0.0)

Distance correlation between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Kruskal-Wallis’ \(H\) test statistic

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:

\[H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }\]

where:

  • \(n\) is the number of observations

  • \(n_y\) is the number of modalities of \(y\)

  • \(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality

  • \(x_r\) is the ranked version of \(x\)

  • \(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality

  • \(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality

  • \(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)

class AutoCarver.selectors.measures.KruskalMeasure(threshold: float = 0.0)

Kruskal-Wallis’ test statistic between a Quantitative feature and a Qualitative target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_reversible = True

wether the measure’s input can be reversed depending on there type or not

is_x_quantitative = True

wether x is quantitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • KruskalMeasure is the default measure for each QualitativeFeature when using RegressionSelector.

  • KruskalMeasure is the default measure for each QuantitativeFeature when using ClassificationSelector.

Coefficient of determination \(R\)

For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:

\[x = \alpha + \beta y + \epsilon\]
where:
  • \(\alpha\) and \(\beta\) are the coefficient of the linear regression model

  • \(\epsilon\) is the residual of the linear regression model

The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:

\[R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }\]

where:

  • \(n\) is the number of observations

  • \(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares

  • \(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares

  • \(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)

class AutoCarver.selectors.measures.RMeasure(threshold: float = 0.0)

Square root of the coefficient of determination of linear regression model of a Quantitative feature by a Binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_quantitative = True

wether x is quantitative or not

is_y_binary = True

wether y is binary or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Outlier Detection Measures

Standard Score

Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:

\[z_i = \frac{x_i - \bar{x}}{S}\]

where:

  • \(n\) is the number of observations

  • \(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)

  • \(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)

class AutoCarver.selectors.measures.ZscoreOutlierMeasure(threshold: float = 1.0)

Z-Score based outlier measure

Parameters:

threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series = None) float

Computes outlier measure on x

higher_is_better = False

wether higher values are better or not

is_x_qualitative = False

wether x is qualitative or not

is_x_quantitative = True

wether x is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Interquartile range

Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:

\[IQR = Q_3 - Q_1\]

where:

  • \(Q_1\) is the 25th percentile of the \(x\)

  • \(Q_3\) is the 75th percentile of the \(x\)

Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:

\[Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR\]
class AutoCarver.selectors.measures.IqrOutlierMeasure(threshold: float = 1.0)

Interquartile range based outlier measure

Parameters:

threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series = None) float

Computes outlier measure on x

higher_is_better = False

wether higher values are better or not

is_x_qualitative = False

wether x is qualitative or not

is_x_quantitative = True

wether x is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Qualitative measures

Pearson’s \(\chi^2\) test statistic

For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

  • \(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)

  • \(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)

  • \(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)

class AutoCarver.selectors.measures.Chi2Measure(threshold: float = 0.0)

Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the x by y pandas.crosstab.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Cramér’s \(V\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.CramervMeasure(threshold: float = 0.0)

Computes Carmér’s V between a Qualitative feature and a binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.TschuprowtMeasure(threshold: float = 0.0)

Computes Tschuprow’s T between a Qualitative feature and a binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • TschuprowtMeasure is the default measure for each QualitativeFeature when using ClassificationSelector.

Association filters, X by X

Quantitative filters

Pearson’s \(r\)

For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}\]

where:

  • \(n\) is the number of observations

  • \(x_1^i\) is the \(i\) th observation of \(x_1\)

  • \(x_2^i\) is the \(i\) th observation of \(x_2\)

  • \(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)

  • \(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)

class AutoCarver.selectors.filters.PearsonFilter(threshold: float = 1.0)

Computes maximum Pearson’s r between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{1_{r}}x_{2_{r}}}\]

where:

  • \(x_{1_{r}}\) is the ranked version of \(x_1\)

  • \(x_{2_{r}}\) is the ranked version of \(x_2\)

  • \(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)

class AutoCarver.selectors.filters.SpearmanFilter(threshold: float = 1.0)

Computes maximum Spearman’s rho between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

Note

  • SpearmanFilter is the default filter as inter-QuantitativeFeature association measure.

Qualitative filters

Cramér’s \(V\)

For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

  • \(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)

  • \(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)

  • \(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.CramervFilter(threshold: float = 1.0)

Computes maximum Cramer’s V between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.TschuprowtFilter(threshold: float = 1.0)

Computes maximum Tschuprow’s T between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

Note

  • TschuprowtFilter is the default filter as inter-QualitativeFeature association measure.