Selectors

AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:

  1. Measuring association with a target and ranking features accordingly.

  2. Filtering out features too asociated to a better ranked feature.

It allows one to select features:

In general, associations are computed according to the provided data types of \(x\) and \(y\):

\(x\) \ \(y\)

Qualitatitve

Quantitative

Qualitative

Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\)

Kruskal-Wallis’ \(H\), \(R\) coefficient

Quantitative

Kruskal-Wallis’ \(H\), \(R\) coefficient

Pearson’s \(r\), Spearman’s \(\rho\)

See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.

Selectors are scikit-learn transformers built like the carvers: from a Features set, a per-type budget n_best_per_type, a swappable set of measures / filters, and a ProcessingConfig carrying behavioral toggles (verbose …).

  • fit() scores every feature against the target, ranks them per measure, and filters out redundant ones.

  • transform() restricts X to the selected columns; selected_features returns the selected Features directly.

Selection is exhaustive — every feature is scored exactly — but fast: each measure scores all features of a type in a single vectorized pass rather than one call per feature.

from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig
from AutoCarver.selectors import ClassificationSelector

selector = ClassificationSelector(
    features=features,
    n_best_per_type=25,                       # best features kept per data type
    config=ProcessingConfig(verbose=True),   # behavioral toggles, as for carvers
)
selector.fit(X, y)  # or selector.fit_transform(X, y) to keep only selected features in X
best_features = selector.selected_features

Classification tasks

class AutoCarver.selectors.ClassificationSelector(features: Features | list[BaseFeature], n_best_per_type: int, *, measures: list[BaseMeasure] | None = None, filters: list[BaseFilter] | None = None, config: ProcessingConfig | None = None)

A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.

Parameters:
  • features (Features) – A set of Features to select from.

  • n_best_per_type (int) – Number of quantitative and/or qualitative Features to select.

  • measures (list[BaseMeasure], optional) – Association measures (the swappable decision boundary). Defaults to a task-appropriate set provided by the subclass. NanMeasure and ModeMeasure are always added if missing.

  • filters (list[BaseFilter], optional) – Redundancy filters. Defaults to the task-appropriate set; the validity filters are always added if missing.

  • config (ProcessingConfig, optional) – Behavioral toggles shared with the discretizers (verbose is the one consumed here). ordinal_encoding / dropna are ignored by the selector.

fit(X: DataFrame, y: Series) Self

Scores, ranks and filters features; stores the selected ones.

Parameters:
  • X (pd.DataFrame) – Dataset to select from.

  • y (pd.Series) – Target the association is evaluated against.

property selected_features: Features

The selected Features (available after fit()).

transform(X: DataFrame, y: Series | None = None) DataFrame

Restricts X to the selected features’ columns.

Regression tasks

class AutoCarver.selectors.RegressionSelector(features: Features | list[BaseFeature], n_best_per_type: int, *, measures: list[BaseMeasure] | None = None, filters: list[BaseFilter] | None = None, config: ProcessingConfig | None = None)

A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.

Parameters:
  • features (Features) – A set of Features to select from.

  • n_best_per_type (int) – Number of quantitative and/or qualitative Features to select.

  • measures (list[BaseMeasure], optional) – Association measures (the swappable decision boundary). Defaults to a task-appropriate set provided by the subclass. NanMeasure and ModeMeasure are always added if missing.

  • filters (list[BaseFilter], optional) – Redundancy filters. Defaults to the task-appropriate set; the validity filters are always added if missing.

  • config (ProcessingConfig, optional) – Behavioral toggles shared with the discretizers (verbose is the one consumed here). ordinal_encoding / dropna are ignored by the selector.

fit(X: DataFrame, y: Series) Self

Scores, ranks and filters features; stores the selected ones.

Parameters:
  • X (pd.DataFrame) – Dataset to select from.

  • y (pd.Series) – Target the association is evaluated against.

property selected_features: Features

The selected Features (available after fit()).

transform(X: DataFrame, y: Series | None = None) DataFrame

Restricts X to the selected features’ columns.

Association measures, X by y

Quantitative measures

Pearson’s \(r\)

For a quantititative feature \(x\), the association with a quantitative target \(y\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{xy}= \frac{\sum_{i=1}^{n}{(x^i-\bar{x})(y^i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x^i-\bar{x})^2}} \sqrt{\sum_{i=1}^{n}{(y^i-\bar{y})^2}}}\]

where:

  • \(n\) is the number of observations

  • \(x^i\) is the \(i\) th observation of \(x\)

  • \(y^i\) is the \(i\) th observation of \(y\)

  • \(\bar{x}=\frac{1}{n}\sum_{i=1}^n{x^i}\) is the sample mean of \(x\)

  • \(\bar{y}=\frac{1}{n}\sum_{i=1}^n{y^i}\) is the sample mean of \(y\)

class AutoCarver.selectors.measures.PearsonMeasure(threshold: float = 0.0)

Pearson’s linear correlation coefficient between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{r}y_{r}}\]

where:

  • \(x_{r}\) is the ranked version of \(x\)

  • \(y_{r}\) is the ranked version of \(y\)

  • \(r_{x_{r}y_{r}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{r}\) and \(y_{r}\)

class AutoCarver.selectors.measures.SpearmanMeasure(threshold: float = 0.0)

Spearman’s rank correlation coefficient between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • SpearmanMeasure is the default measure for each QuantitativeFeature when using RegressionSelector.

Distance Correlation

For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:

\[1 - \frac{ (x - \bar{x}) (y - \bar{y}) } { ||x - \bar{x}||_2 ||y - \bar{y}||_2 }\]

where:

  • \(n_x\) is the number of observations of \(x\)

  • \(n_y\) is the number of observations of \(y\)

  • \(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)

  • \(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)

  • \(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)

  • \(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)

The Distance Correlation is computed using scipy.spatial.distance.correlation.

class AutoCarver.selectors.measures.DistanceMeasure(threshold: float = 0.0)

Distance correlation between a Quantitative feature and target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

is_y_quantitative = True

wether y is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Kruskal-Wallis’ \(H\) test statistic

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.

Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:

\[H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }\]

where:

  • \(n\) is the number of observations

  • \(n_y\) is the number of modalities of \(y\)

  • \(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality

  • \(x_r\) is the ranked version of \(x\)

  • \(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality

  • \(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality

  • \(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)

class AutoCarver.selectors.measures.KruskalMeasure(threshold: float = 0.0)

Kruskal-Wallis’ test statistic between a Quantitative feature and a Qualitative target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_reversible = True

wether the measure’s input can be reversed depending on there type or not

is_x_quantitative = True

wether x is quantitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • KruskalEtaSquaredMeasure is the default measure for each QualitativeFeature when using RegressionSelector.

  • KruskalEtaSquaredMeasure is the default measure for each QuantitativeFeature when using ClassificationSelector.

Kruskal-Wallis’ \(\varepsilon^2\) effect size

The raw Kruskal-Wallis’ H statistic grows with the number of observations, which makes it unsuitable for comparing features of differing sample sizes. The epsilon-squared effect size normalizes it to \([0, 1]\):

\[\varepsilon^2 = \frac{H}{N - 1}\]

where:

  • \(H\) is Kruskal-Wallis’ H test statistic

  • \(N\) is the number of pooled (non-missing) observations

It is to Kruskal-Wallis what Cramér’s V is to Pearson’s chi²: a sample-size-normalized effect size meant for cross-feature ranking.

class AutoCarver.selectors.measures.KruskalEpsilonSquaredMeasure(threshold: float = 0.0)

Epsilon-squared effect size derived from Kruskal-Wallis’ H statistic.

Unlike the raw H statistic (which grows with the number of observations, making it unsuitable for comparing features of differing sample sizes), \(\varepsilon^2 = H / (N - 1)\) is bounded in \([0, 1]\). It is to Kruskal-Wallis what Cramér’s V is to Chi2 — a sample-size-normalized effect size meant for cross-feature ranking.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_reversible = True

wether the measure’s input can be reversed depending on there type or not

is_x_quantitative = True

wether x is quantitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Kruskal-Wallis’ \(\eta^2\) effect size

Like epsilon-squared, the eta-squared effect size removes the sample-size inflation of the raw H statistic, but it additionally corrects for the number of groups \(k\):

\[\eta^2 = \frac{H - k + 1}{N - k}\]

where:

  • \(H\) is Kruskal-Wallis’ H test statistic

  • \(k\) is the number of groups (modalities of \(y\))

  • \(N\) is the number of pooled (non-missing) observations

The correction for \(k\) is useful in the reversed (regression) case, where \(k\) is the feature’s modality count and therefore varies across features. The result is clamped to \([0, 1]\).

class AutoCarver.selectors.measures.KruskalEtaSquaredMeasure(threshold: float = 0.0)

Eta-squared effect size derived from Kruskal-Wallis’ H statistic.

\(\eta^2 = (H - k + 1) / (N - k)\), where k is the number of groups and N the number of pooled observations. Like KruskalEpsilonSquaredMeasure it removes the sample-size inflation of the raw H statistic, but it additionally corrects for k — useful in the reversed (regression) case where k is the feature’s modality count and therefore varies across features. Clamped to \([0, 1]\).

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_reversible = True

wether the measure’s input can be reversed depending on there type or not

is_x_quantitative = True

wether x is quantitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Coefficient of determination \(R\)

For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:

\[x = \alpha + \beta y + \epsilon\]
where:
  • \(\alpha\) and \(\beta\) are the coefficient of the linear regression model

  • \(\epsilon\) is the residual of the linear regression model

The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:

\[R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }\]

where:

  • \(n\) is the number of observations

  • \(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares

  • \(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares

  • \(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)

class AutoCarver.selectors.measures.RMeasure(threshold: float = 0.0)

Square root of the coefficient of determination of linear regression model of a Quantitative feature by a Binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_quantitative = True

wether x is quantitative or not

is_y_binary = True

wether y is binary or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Outlier Detection Measures

Standard Score

Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:

\[z_i = \frac{x_i - \bar{x}}{S}\]

where:

  • \(n\) is the number of observations

  • \(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)

  • \(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)

class AutoCarver.selectors.measures.ZscoreOutlierMeasure(threshold: float = 1.0)

Z-Score based outlier measure

Parameters:

threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series | None = None) float

Computes outlier measure on x

higher_is_better = False

wether higher values are better or not

is_x_qualitative = False

wether x is qualitative or not

is_x_quantitative = True

wether x is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Interquartile range

Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:

\[IQR = Q_3 - Q_1\]

where:

  • \(Q_1\) is the 25th percentile of the \(x\)

  • \(Q_3\) is the 75th percentile of the \(x\)

Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:

\[Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR\]
class AutoCarver.selectors.measures.IqrOutlierMeasure(threshold: float = 1.0)

Interquartile range based outlier measure

Parameters:

threshold (float, optional) – Maximum threshold to reach, by default 1.0

compute_association(x: Series, y: Series | None = None) float

Computes outlier measure on x

higher_is_better = False

wether higher values are better or not

is_x_qualitative = False

wether x is qualitative or not

is_x_quantitative = True

wether x is quantitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Qualitative measures

Pearson’s \(\chi^2\) test statistic

For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

  • \(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)

  • \(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)

  • \(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)

class AutoCarver.selectors.measures.Chi2Measure(threshold: float = 0.0)

Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the x by y pandas.crosstab.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Cramér’s \(V\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.CramervMeasure(threshold: float = 0.0)

Computes Carmér’s V between a Qualitative feature and a binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}\]

where:

  • \(n\) is the number of observations

  • \(n_x\) is the number of modalities of \(x\)

  • \(n_y\) is the number of modalities of \(y\)

class AutoCarver.selectors.measures.TschuprowtMeasure(threshold: float = 0.0)

Computes Tschuprow’s T between a Qualitative feature and a binary target.

Parameters:

threshold (float, optional) – Minimum threshold to reach, by default 0.0

compute_association(x: Series, y: Series) float

Computes association measure between x and y

higher_is_better = True

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

is_y_qualitative = True

wether y is qualitative or not

validate() bool

Checks if threshold is reached

Returns:

Whether the test is passed or not

Return type:

bool

Note

  • TschuprowtMeasure is the default measure for each QualitativeFeature when using ClassificationSelector.

Association filters, X by X

Quantitative filters

Pearson’s \(r\)

For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.

Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:

\[r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}\]

where:

  • \(n\) is the number of observations

  • \(x_1^i\) is the \(i\) th observation of \(x_1\)

  • \(x_2^i\) is the \(i\) th observation of \(x_2\)

  • \(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)

  • \(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)

class AutoCarver.selectors.filters.PearsonFilter(threshold: float = 1.0)

Computes maximum Pearson’s r between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

Spearman’s \(\rho\)

For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).

Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:

\[\rho=r_{x_{1_{r}}x_{2_{r}}}\]

where:

  • \(x_{1_{r}}\) is the ranked version of \(x_1\)

  • \(x_{2_{r}}\) is the ranked version of \(x_2\)

  • \(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)

class AutoCarver.selectors.filters.SpearmanFilter(threshold: float = 1.0)

Computes maximum Spearman’s rho between quantitative features of X

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_absolute = True

wether the measure needs absolute value for comparison or not

is_x_quantitative = True

wether x is quantitative or not

Note

  • SpearmanFilter is the default filter used for inter-QuantitativeFeature association measure.

Qualitative filters

Cramér’s \(V\)

For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.

Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:

\[\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

  • \(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)

  • \(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)

  • \(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)

Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:

\[V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.CramervFilter(threshold: float = 1.0)

Computes maximum Cramer’s V between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

Tschuprow’s \(T\)

Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:

\[T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}\]

where:

  • \(n\) is the number of observations

  • \(n_{x_1}\) is the number of modalities of \(x_1\)

  • \(n_{x_2}\) is the number of modalities of \(x_2\)

class AutoCarver.selectors.filters.TschuprowtFilter(threshold: float = 1.0)

Computes maximum Tschuprow’s T between qualitative features of X

Filters out ranked features that reach the association threshold

filter(X: DataFrame, ranks: list[BaseFeature]) list[BaseFeature]

Filters out ranked features that reach the association threshold

higher_is_better = False

wether higher values are better or not

is_x_qualitative = True

wether x is qualitative or not

Note

  • TschuprowtFilter is the default filter used for inter-QualitativeFeature association measure.