Selectors
AutoCarver implements Selectors, they provide the following, association-centric, Data Selection steps:
Measuring association with a target and ranking features accordingly.
Filtering out features too asociated to a better ranked feature.
It allows one to select features:
Whatever there type: quantitative or qualitative
Whatever the optimization task: Classification tasks or Regression tasks
By default, quantitative features are:
Ranked according to Kruskal-Wallis’ H test statistic
Filtered according to Spearman’s rho correlation coefficient
By default, qualitative features are:
Ranked according to Tschuprow’s T
Filtered according to Tschuprow’s T
In general, associations are computed according to the provided data types of \(x\) and \(y\):
\(x\) \ \(y\) |
Qualitatitve |
Quantitative |
Qualitative |
Pearson’s \(\chi^2\), Cramér’s \(V\), Tschuprow’s \(T\) |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Quantitative |
Kruskal-Wallis’ \(H\), \(R\) coefficient |
Pearson’s \(r\), Spearman’s \(\rho\) |
See Association measures, X by y and Association filters, X by X, for details on measures and filters’ implementation.
Classification tasks
- class AutoCarver.selectors.ClassificationSelector(n_best: int, qualitative_features: list[str] | None = None, quantitative_features: list[str] | None = None, *, quantitative_measures: list[Callable] | None = None, qualitative_measures: list[Callable] | None = None, quantitative_filters: list[Callable] | None = None, qualitative_filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, **kwargs)
A pipeline of measures to perform a feature pre-selection that maximizes association with a qualitative target.
Get your best features with
ClassificationSelector.select()!- Parameters:
n_best (int) –
Number of features to select.
Best features are the
n_bestof each provided data types (set inquantitative_featuresand/orqualitative_features)Best features are the
n_bestfor each provided measures (set inquantitative_measuresand/orqualitative_measures)
quantitative_features (list[str]) – List of column names of quantitative features to chose from, by default
None. Must be set ifqualitative_features=None.qualitative_features (list[str]) – List of column names of qualitative features to chose from, by default
None. Must be set ifquantitative_features=None.quantitative_measures (list[Callable], optional) –
List of association measures to be used for
quantitative_features. Implemented measures are:For association evaluation: Kruskal-Wallis’ H (default), Coefficient of determination R
For outlier detection: Standard score, Interquartile range
qualitative_measures (list[Callable], optional) –
List of association measures to be used for
qualitative_features. Implemented measures are:For association evaluation: Pearson’s chi², Cramér’s V, Tschuprow’s T (default)
quantitative_filters (list[Callable], optional) –
List of filters to be used for
quantitative_features. Implemented filters are:For linear correlation: Pearson’s r, Spearman’s rho (default)
qualitative_filters (list[Callable], optional) –
List of filters to be used for
qualitative_features. Implemented filters are:For correlation: Cramér’s V, Tschuprow’s T (default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between
0and1, by default1.0By default, all features are used.For
colsample=0.5, Selector will search for the best features infeatures[:len(features)//2]and then infeatures[len(features)//2:].Tip: for better performance, should be set such as
len(features)//2 < 200.verbose (bool, optional) –
True, without IPython installed: prints raw feature selection steps for X, by defaultFalseTrue, with IPython installed: adds HTML tables to the output.
Tip: IPython displaying can be turned off by setting
pretty_print=False.**kwargs – Allows one to set thresholds for provided
quantitative_measures/qualitative_measuresandquantitative_filters/qualitative_filters(see Association measures, X by y and Association filters, X by X) passed as keyword arguments.
Examples
- select(X: DataFrame, y: Series) list[str]
Selects the
n_bestfeatures of the DataFrame, by association with the target- Parameters:
X (DataFrame) – Dataset to determine optimal features. Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is evaluated.
- Returns:
List of selected features
- Return type:
list[str]
Regression tasks
- class AutoCarver.selectors.RegressionSelector(n_best: int, qualitative_features: list[str] | None = None, quantitative_features: list[str] | None = None, *, quantitative_measures: list[Callable] | None = None, qualitative_measures: list[Callable] | None = None, quantitative_filters: list[Callable] | None = None, qualitative_filters: list[Callable] | None = None, colsample: float = 1.0, verbose: bool = False, **kwargs)
A pipeline of measures to perform a feature pre-selection that maximizes association with a quantitative target.
Get your best features with
RegressionSelector.select()!- Parameters:
n_best (int) –
Number of features to select.
Best features are the
n_bestof each provided data types (set inquantitative_featuresand/orqualitative_features)Best features are the
n_bestfor each provided measures (set inquantitative_measuresand/orqualitative_measures)
quantitative_features (list[str]) – List of column names of quantitative features to chose from, by default
None. Must be set ifqualitative_features=None.qualitative_features (list[str]) – List of column names of qualitative features to chose from, by default
None. Must be set ifquantitative_features=None.quantitative_measures (list[Callable], optional) –
List of association measures to be used for
quantitative_features. Implemented measures are:For association evaluation: Kruskal-Wallis’ H (default), Coefficient of determination R
For outlier detection: Standard score, Interquartile range
qualitative_measures (list[Callable], optional) –
List of association measures to be used for
qualitative_features. Implemented measures are:For association evaluation: Pearson’s chi², Cramér’s V, Tschuprow’s T (default)
quantitative_filters (list[Callable], optional) –
List of filters to be used for
quantitative_features. Implemented filters are:For linear correlation: Pearson’s r, Spearman’s rho (default)
qualitative_filters (list[Callable], optional) –
List of filters to be used for
qualitative_features. Implemented filters are:For correlation: Cramér’s V, Tschuprow’s T (default)
colsample (float, optional) –
Size of sampled list of features for sped up computation, between
0and1, by default1.0By default, all features are used.For
colsample=0.5, Selector will search for the best features infeatures[:len(features)//2]and then infeatures[len(features)//2:].Tip: for better performance, should be set such as
len(features)//2 < 200.verbose (bool, optional) –
True, without IPython installed: prints raw feature selection steps for X, by defaultFalseTrue, with IPython installed: adds HTML tables to the output.
Tip: IPython displaying can be turned off by setting
pretty_print=False.**kwargs – Allows one to set thresholds for provided
quantitative_measures/qualitative_measuresandquantitative_filters/qualitative_filters(see Association measures, X by y and Association filters, X by X) passed as keyword arguments.
Examples
- select(X: DataFrame, y: Series) list[str]
Selects the
n_bestfeatures of the DataFrame, by association with the target- Parameters:
X (DataFrame) – Dataset to determine optimal features. Needs to have columns has specified in
featuresattribute.y (Series) – Target with wich the association is evaluated.
- Returns:
List of selected features
- Return type:
list[str]
Association measures, X by y
Quantitative measures
Distance Correlation
For two quantitative features \(x\) and \(y\), the Distance Correlation can be computed using the following formula:
where:
\(n_x\) is the number of observations of \(x\)
\(n_y\) is the number of observations of \(y\)
\(\bar{y}=\sum_{i=1}^{n_y}{y_{i}}\) is the sample mean of \(y\)
\(\bar{x}=\sum_{i=1}^{n_x}{x_{i}}\) is the sample mean of \(x\)
\(||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }\) is the euclidean norm of \(x\)
\(||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }\) is the euclidean norm of \(y\)
The Distance Correlation is computed using scipy.spatial.distance.correlation.
- AutoCarver.selectors.measures.distance_measure(x: Series, y: Series, thresh_distance: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Distance correlation between
xandy.- Parameters:
x (Series) – Quantitative feature
y (Series) – Quantitative target feature
thresh_distance (float, optional) – Minimum distance association, by default
0
- Returns:
Whether
xis sufficiently associated toyand Distance Correlation- Return type:
tuple[bool, dict[str, Any]]
Note
distance_measureis the default measure for quantitative features in regression tasks (i.e. whenRegressionSelector.quantitative_filters=NoneandRegressionSelector.quantitative_featuresis provided).If
thresh_distanceis reached, feature will automatically be dropped.
Kruskal-Wallis’ \(H\) test statistic
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
The association with a qualitative target \(y\) is computed using scipy.stats.kruskal.
Kruskal-Wallis’ \(H\) test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution. It is used to determine whether or not \(x\) is distributed the same when \(y=y_0\) to \(y=y_{n_y-1}\) where \(n_y\) is the number of modalities taken by \(y\). It is computed using the following formula:
where:
\(n\) is the number of observations
\(n_y\) is the number of modalities of \(y\)
\(n_{y=i}\) is the number of observations taking \(y\)’s \(i\) th modality
\(x_r\) is the ranked version of \(x\)
\(x_r^{ij}\) is the \(j\) th observation of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}\) is the sample mean of \(x_r\) when \(y\) takes its \(i\) th modality
\(\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}\) is the sample mean of \(x_r\)
- AutoCarver.selectors.measures.kruskal_measure(x: Series, y: Series, thresh_kruskal: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Kruskal-Wallis’ test statistic between
xfor each value taken byy.- Parameters:
x (Series) – Quantitative feature
y (Series) – Qualitative target feature
thresh_kruskal (float, optional) – Minimum Kruskal-Wallis association, by default
0
- Returns:
Whether
xis sufficiently associated toyand Kruskal-Wallis’ test statistic- Return type:
tuple[bool, dict[str, Any]]
Note
kruskal_measureis the default measure for quantitative features in classification tasks (i.e. whenClassificationSelector.quantitative_filters=NoneandClassificationSelector.quantitative_featuresis provided).kruskal_measureis the default measure for qualitative features in regression tasks (i.e. whenRegressionSelector.qualitative_filters=NoneandRegressionSelector.qualitative_featuresis provided).If
thresh_kruskalis reached, feature will automatically be dropped.
Coefficient of determination \(R\)
For a binary feature \(y\) and a quantitative feature \(x\) the following linear regression model is fitted using statsmodels.formula.api.ols:
- where:
\(\alpha\) and \(\beta\) are the coefficient of the linear regression model
\(\epsilon\) is the residual of the linear regression model
The determination coefficient, often denoted as \(R^2\), is a statistical measure that quantifies the goodness of fit of a linear regression model. In this specific case, it is equal to the square of Pearson’s \(r\) correlation coefficient between \(x\) and \(y\). It is computed with the following formula:
where:
\(n\) is the number of observations
\(SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}\) is the residual sum of squares
\(SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}\) is the total sum of squares
\(\bar{x}=\sum_{i=1}^{n}{x_i}\) is the sample mean of \(x\)
- AutoCarver.selectors.measures.R_measure(x: Series, y: Series, thresh_R: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Square root of the coefficient of determination of linear regression model of
xbyy.- Parameters:
x (Series) – Quantitative feature
y (Series) – Binary target feature
thresh_R (float, optional) – Minimum R association, by default
0
- Returns:
Whether
xis sufficiently associated toyand the square root of the determination coefficient- Return type:
tuple[bool, dict[str, Any]]
Note
If thresh_R is reached, feature will automatically be dropped.
Outlier Detection: Standard score
Standard score can be applied as a measure of deviation to determine outlier for quantitative features. For a feature \(x\) it is computed for any oservation \(x_i\) as follows:
where:
\(n\) is the number of observations
\(\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}\) is the sample mean of \(x\)
\(S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}\) is the sample standard deviation of \(x\)
- AutoCarver.selectors.measures.zscore_measure(x: Series, y: Series | None = None, thresh_zscore: float = 1.0, **kwargs) tuple[bool, dict[str, Any]]
Computes outliers percentage based on the z-score
- Parameters:
x (Series) – Quantitative feature
y (Series, optional) – Any target feature, by default
Nonethresh_zscore (float, optional) – Maximum percentage of Outliers in a feature, by default
1.0
- Returns:
Whether or not there are too many outliers and the outlier measurement
- Return type:
tuple[bool, dict[str, Any]]
Note
If thresh_zscore is reached, feature will automatically be dropped.
Outlier Detection: Interquartile range
Interquartile range is widely used as an outlier detection metric for quantitative features. For a feature \(x\) it is computed as follows:
where:
\(Q_1\) is the 25th percentile of the \(x\)
\(Q_3\) is the 75th percentile of the \(x\)
Any observation \(x_i\) of feature \(x\), can be considered an outlier if it does not verify:
- AutoCarver.selectors.measures.iqr_measure(x: Series, y: Series | None = None, thresh_iqr: float = 1.0, **kwargs) tuple[bool, dict[str, Any]]
Computes outliers percentage based on the interquartile range
- Parameters:
x (Series) – Quantitative feature
y (Series, optional) – Any target feature, by default
Nonethresh_iqr (float, optional) – Maximum percentage of Outliers in a feature, by default
1.0
- Returns:
Whether or not there are too many outliers and the outlier measurement
- Return type:
tuple[bool, dict[str, Any]]
Note
If thresh_iqr is reached, feature will automatically be dropped.
Qualitative measures
Pearson’s \(\chi^2\) test statistic
For a qualititative feature \(x\), the association with a qualitative target \(y\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) test statistic is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x\) and modality \(j\) of \(y\)
\(n_{i.}=\sum_{i=1}^{n_x}n_{ij}\) is the total number of observations that take modality \(i\) of \(x\)
\(n_{.j}=\sum_{j=1}^{n_y}n_{ij}\) is the total number of observations that take modality \(j\) of \(y\)
- AutoCarver.selectors.measures.chi2_measure(x: Series, y: Series, thresh_chi2: float = 0, **kwargs) tuple[bool, dict[str, Any]]
Wrapper for scipy.stats.chi2_contingency. Computes Chi2 statistic on the
xbyypandas.crosstab.- Parameters:
x (Series) – Qualitative feature
y (Series) – Qualitative target feature
thresh_chi2 (float, optional) – Minimum Chi2 association, by default
0
- Returns:
Whether
xis sufficiently associated toyand Pearson’s chi2 betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Note
If thresh_chi2 is reached, feature will automatically be dropped.
Cramér’s \(V\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- AutoCarver.selectors.measures.cramerv_measure(x: Series, y: Series, thresh_cramerv: float = 0, chi2_statistic: float | None = None, **kwargs) tuple[bool, dict[str, Any]]
Computes Carmér’s V between
xandyfromchi2_measure.- Parameters:
x (Series) – Qualitative feature
y (Series) – Qualitative target feature
thresh_cramerv (float, optional) – Minimum Cramér’s V association, by default
0chi2_statistic (float, optional) – Pearson’s chi2 between
xandy, by defaultNone
- Returns:
Whether
xis sufficiently associated toyand Carmér’s V betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Note
If thresh_cramerv is reached, feature will automatically be dropped.
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_x\) is the number of modalities of \(x\)
\(n_y\) is the number of modalities of \(y\)
- AutoCarver.selectors.measures.tschuprowt_measure(x: Series, y: Series, thresh_tschuprowt: float = 0, chi2_statistic: float | None = None, **kwargs) tuple[bool, dict[str, Any]]
Computes Tschuprow’s T between
xandyfromchi2_measure.- Parameters:
x (Series) – Feature to measure
y (Series) – Binary target feature
thresh_tschuprowt (float, optional) – Minimum Tschuprow’s T association, by default
0chi2_statistic (float, optional) – Pearson’s chi2 between
xandy, by defaultNone
- Returns:
Whether
xis sufficiently associated toyand Tschuprow’s T betweenxandy.- Return type:
tuple[bool, dict[str, Any]]
Note
tschuprowt_measureis the default measure for qualitative features in classification tasks (i.e. whenClassificationSelector.qualitative_filters=NoneandClassificationSelector.qualitative_featuresis provided).If
thresh_tschuprowtis reached, feature will automatically be dropped.
Base data information
Note
Those measures are performed by default and don’t need to be added in the attributes.
Missing values
- AutoCarver.selectors.measures.nans_measure(x: Series, y: Series | None = None, thresh_nan: float = 0.999, **kwargs) tuple[bool, dict[str, Any]]
Measure of the percentage of NaNs
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_nan (float, optional) – Maximum percentage of NaNs in a feature, by default
0.999
- Returns:
Whether or not there are to many NaNs and the percentage of NaNs
- Return type:
tuple[bool, dict[str, Any]]
Note
nans_measureis evaluated by default in all Selectors.If
thresh_nanis reached, feature will automatically be dropped.
Data types
- AutoCarver.selectors.measures.dtype_measure(x: Series, y: Series | None = None, **kwargs) tuple[bool, dict[str, Any]]
Feature’s dtype
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
None
- Returns:
True and the feature’s dtype
- Return type:
tuple[bool, dict[str, Any]]
Note
dtype_measure is evaluated by default in all Selectors.
Mode
- AutoCarver.selectors.measures.mode_measure(x: Series, y: Series | None = None, thresh_mode: float = 0.999, **kwargs) tuple[bool, dict[str, Any]]
Measure of the percentage of the Mode
- Parameters:
x (Series) – Feature to measure
y (Series, optional) – Binary target feature, by default
Nonethresh_mode (float, optional) – Maximum percentage of a feature’s mode, by default
0.999
- Returns:
Whether or not the mode is overrepresented and the percentage of mode
- Return type:
tuple[bool, dict[str, Any]]
Note
mode_measureis evaluated by default in all Selectors.If
thresh_modeis reached, feature will automatically be dropped.
Association filters, X by X
Quantitative filters
Pearson’s \(r\)
For a quantititative feature \(x_1\), the association with a quantitative feature \(x_2\) is computed using pandas.DataFrame.corr.
Pearson’s \(r\), as known as the bivariate correlation, is a measure of linear correlation between quantitative features. It is computed using the following formula:
where:
\(n\) is the number of observations
\(x_1^i\) is the \(i\) th observation of \(x_1\)
\(x_2^i\) is the \(i\) th observation of \(x_2\)
\(\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}\) is the sample mean of \(x_1\)
\(\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}\) is the sample mean of \(x_2\)
- AutoCarver.selectors.filters.pearson_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) dict[str, Any]
Computes maximum Pearson’s r between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Pearson’s r bewteen features, by default
1
- Returns:
Maximum Pearson’s r with a better feature
- Return type:
dict[str, Any]
Note
If thresh_corr is reached, feature will automatically be dropped.
Spearman’s \(\rho\)
For a quantitative feature \(x\), the corresponding order feature \(x_o\) is the sorted sample of \(x\) such that any \(i\) in \((1, n-1)\) verifies \(x_o^i \leq x_o^{i+1}\), where \(n\) is the number of observations. For the same feature \(x\), the corresponding rank \(x_r\) is the index of \(x\)’s values in \(x_o\).
Spearman’s \(\rho\) is Pearson’s \(r\) computed on the rank features. As so, Spearman’s \(\rho\) is computed with the following formula:
where:
\(x_{1_{r}}\) is the ranked version of \(x_1\)
\(x_{2_{r}}\) is the ranked version of \(x_2\)
\(r_{x_{1_{r}}x_{2_{r}}}\) is Pearson’s \(r\) linear correlation coefficient between \(x_{1_{r}}\) and \(x_{2_{r}}\)
- AutoCarver.selectors.filters.spearman_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **params) dict[str, Any]
Computes maximum Spearman’s rho between X and X (quantitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Spearman’s rho bewteen features, by default
1
- Returns:
Maximum Spearman’s rho with a better features
- Return type:
dict[str, Any]
Note
spearman_filteris the default filter for quantitative features (i.e. whenquantitative_filters=Noneandquantitative_featuresis provided).If
thresh_corris reached, feature will automatically be dropped.
Qualitative filters
Cramér’s \(V\)
For a qualititative feature \(x_1\), the association with a qualitative feature \(x_2\) is computed based on the pandas.crosstab.
Pearson’s \(\chi^2\) statistics is then computed using scipy.stats.chi2_contingency to perform association measuring. The formula is the following:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
\(n_{ij}\) is the number of observations that take modality \(i\) of \(x_1\) and modality \(j\) of \(x_2\)
\(n_{i.}=\sum_{i=1}^{n_{x_1}}\) is the total number of observations that take modality \(i\) of \(x_1\)
\(n_{.j}=\sum_{j=1}^{n_{x_2}}\) is the total number of observations that take modality \(j\) of \(x_2\)
Based on Pearson’s \(\chi^2\), Cramér’s \(V\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- AutoCarver.selectors.filters.cramerv_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) dict[str, Any]
Computes maximum Cramer’s V between
XandX(qualitative). Features too correlated to a feature more associated with the target are excluded (according to providedranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Cramér’s V bewteen features, by default
1
- Returns:
Maximum Cramér’s V with a better feature
- Return type:
dict[str, Any]
Note
If thresh_corr is reached, feature will automatically be dropped.
Tschuprow’s \(T\)
Based on Pearson’s \(\chi^2\), Tschuprow’s \(T\) is computed using the following formula:
where:
\(n\) is the number of observations
\(n_{x_1}\) is the number of modalities of \(x_1\)
\(n_{x_2}\) is the number of modalities of \(x_2\)
- AutoCarver.selectors.filters.tschuprowt_filter(X: DataFrame, ranks: DataFrame, thresh_corr: float = 1, **kwargs) dict[str, Any]
Computes max Tschuprow’s T between X and X (qualitative). Features too correlated to a feature more associated with the target are excluded (according to provided
ranks).- Parameters:
X (DataFrame) – Contains columns named after
ranks’s index (feature names)ranks (DataFrame) – Ranked features as index of the association table
thresh_corr (float, optional) – Maximum Tschuprow’s T bewteen features, by default
1
- Returns:
Maximum Tschuprow’s T with a better feature
- Return type:
dict[str, Any]
Note
tschuprowt_filteris the default filter for qualitative features (i.e. whenqualitative_filters=Noneandqualititative_featuresis provided).If
thresh_corris reached, feature will automatically be dropped.