.. _Selectors:
Selectors
=========
**AutoCarver** implements **Selectors**, they provide the following, association-centric, Data Selection steps:
1. Measuring association with a target and ranking features accordingly.
2. Filtering out features too asociated to a better ranked feature.
It allows one to select features:
* Whatever there type: quantitative or qualitative
* Whatever the optimization task: :ref:`ClassificationSelector` or :ref:`RegressionSelector`
In general, associations are computed according to the provided data types of :math:`x` and :math:`y`:
+-----------------------+---------------------------------------------------------------------+--------------------------------------------------+
| :math:`x` \\ :math:`y`| Qualitatitve | Quantitative |
+-----------------------+---------------------------------------------------------------------+--------------------------------------------------+
| Qualitative | Pearson's :math:`\chi^2`, Cramér's :math:`V`, Tschuprow's :math:`T` | Kruskal-Wallis' :math:`H`, :math:`R` coefficient |
+-----------------------+---------------------------------------------------------------------+--------------------------------------------------+
| Quantitative | Kruskal-Wallis' :math:`H`, :math:`R` coefficient | Pearson's :math:`r`, Spearman's :math:`\rho` |
+-----------------------+---------------------------------------------------------------------+--------------------------------------------------+
See :ref:`Measures` and :ref:`Filters`, for details on measures and filters' implementation.
Selectors are `scikit-learn `_ transformers built like
the :ref:`carvers `: from a :class:`Features` set, a per-type budget
``n_best_per_type``, a swappable set of ``measures`` / ``filters``, and a
:class:`ProcessingConfig` carrying behavioral toggles (``verbose`` …).
* :meth:`~AutoCarver.selectors.BaseSelector.fit` scores every feature against
the target, ranks them per measure, and filters out redundant ones.
* :meth:`~AutoCarver.selectors.BaseSelector.transform` restricts ``X`` to the
selected columns; :attr:`~AutoCarver.selectors.BaseSelector.selected_features`
returns the selected :class:`Features` directly.
Selection is **exhaustive** — every feature is scored exactly — but fast: each
measure scores all features of a type in a single vectorized pass rather than one
call per feature.
.. code-block:: python
from AutoCarver.discretizers.utils.base_discretizer import ProcessingConfig
from AutoCarver.selectors import ClassificationSelector
selector = ClassificationSelector(
features=features,
n_best_per_type=25, # best features kept per data type
config=ProcessingConfig(verbose=True), # behavioral toggles, as for carvers
)
selector.fit(X, y) # or selector.fit_transform(X, y) to keep only selected features in X
best_features = selector.selected_features
.. _ClassificationSelector:
Classification tasks
--------------------
.. autoclass:: AutoCarver.selectors.ClassificationSelector
:members: fit, transform, select, selected_features
.. _RegressionSelector:
Regression tasks
----------------
.. autoclass:: AutoCarver.selectors.RegressionSelector
:members: fit, transform, select, selected_features
.. _Measures:
Association measures, X by y
----------------------------
.. _QuantiMeasures:
Quantitative measures
.....................
Pearson's :math:`r`
^^^^^^^^^^^^^^^^^^^
For a **quantititative** feature :math:`x`, the association with a **quantitative** target :math:`y` is computed using `pandas.DataFrame.corr `_.
Pearson's :math:`r`, as known as the bivariate correlation, is a measure of linear correlation between quantitative features.
It is computed using the following formula:
.. math::
r_{xy}= \frac{\sum_{i=1}^{n}{(x^i-\bar{x})(y^i-\bar{y})}}{\sqrt{\sum_{i=1}^{n}{(x^i-\bar{x})^2}} \sqrt{\sum_{i=1}^{n}{(y^i-\bar{y})^2}}}
where:
* :math:`n` is the number of observations
* :math:`x^i` is the :math:`i` th observation of :math:`x`
* :math:`y^i` is the :math:`i` th observation of :math:`y`
* :math:`\bar{x}=\frac{1}{n}\sum_{i=1}^n{x^i}` is the sample mean of :math:`x`
* :math:`\bar{y}=\frac{1}{n}\sum_{i=1}^n{y^i}` is the sample mean of :math:`y`
.. autoclass:: AutoCarver.selectors.measures.PearsonMeasure
:members: compute_association, validate, is_x_quantitative, is_y_quantitative, higher_is_better, is_absolute
Spearman's :math:`\rho`
^^^^^^^^^^^^^^^^^^^^^^^
For a **quantitative** feature :math:`x`, the corresponding order feature :math:`x_o` is the sorted sample of :math:`x` such that any :math:`i` in :math:`(1, n-1)` verifies :math:`x_o^i \leq x_o^{i+1}`, where :math:`n` is the number of observations. For the same feature :math:`x`, the corresponding rank :math:`x_r` is the index of :math:`x`'s values in :math:`x_o`.
Spearman's :math:`\rho` is Pearson's :math:`r` computed on the rank features. As so, Spearman's :math:`\rho` is computed with the following formula:
.. math::
\rho=r_{x_{r}y_{r}}
where:
* :math:`x_{r}` is the ranked version of :math:`x`
* :math:`y_{r}` is the ranked version of :math:`y`
* :math:`r_{x_{r}y_{r}}` is Pearson's :math:`r` linear correlation coefficient between :math:`x_{r}` and :math:`y_{r}`
.. autoclass:: AutoCarver.selectors.measures.SpearmanMeasure
:members: compute_association, validate, is_x_quantitative, is_y_quantitative, higher_is_better, is_absolute
.. note::
* :class:`SpearmanMeasure` is the default measure for each :class:`QuantitativeFeature` when using :class:`RegressionSelector`.
.. _distance:
Distance Correlation
^^^^^^^^^^^^^^^^^^^^
For two **quantitative** features :math:`x` and :math:`y`, the Distance Correlation can be computed using the following formula:
.. math::
1 - \frac{ (x - \bar{x}) (y - \bar{y}) } { ||x - \bar{x}||_2 ||y - \bar{y}||_2 }
where:
* :math:`n_x` is the number of observations of :math:`x`
* :math:`n_y` is the number of observations of :math:`y`
* :math:`\bar{y}=\sum_{i=1}^{n_y}{y_{i}}` is the sample mean of :math:`y`
* :math:`\bar{x}=\sum_{i=1}^{n_x}{x_{i}}` is the sample mean of :math:`x`
* :math:`||x - \bar{x}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (x_i - \bar{x})^2 } }` is the euclidean norm of :math:`x`
* :math:`||y - \bar{y}||_2 = \sqrt{ \sum_{i=1}^{n_x}{ (y_i - \bar{y})^2 } }` is the euclidean norm of :math:`y`
The Distance Correlation is computed using `scipy.spatial.distance.correlation `_.
.. autoclass:: AutoCarver.selectors.measures.DistanceMeasure
:members: compute_association, validate, is_x_quantitative, is_y_quantitative, higher_is_better, is_absolute
.. _kruskal:
Kruskal-Wallis' :math:`H` test statistic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a **quantitative** feature :math:`x`, the corresponding order feature :math:`x_o` is the sorted sample of :math:`x` such that any :math:`i` in :math:`(1, n-1)` verifies :math:`x_o^i \leq x_o^{i+1}`, where :math:`n` is the number of observations. For the same feature :math:`x`, the corresponding rank :math:`x_r` is the index of :math:`x`'s values in :math:`x_o`.
The association with a **qualitative** target :math:`y` is computed using `scipy.stats.kruskal `_.
Kruskal-Wallis' :math:`H` test statistic, as known as one-way ANOVA on ranks, allows one to check that two samples originate from the same distribution.
It is used to determine whether or not :math:`x` is distributed the same when :math:`y=y_0` to :math:`y=y_{n_y-1}` where :math:`n_y` is the number of modalities taken by :math:`y`.
It is computed using the following formula:
.. math::
H = (n-1) \frac{ \sum_{i=1}^{n_y}{ n_{y=i} (\bar{x_r^{i.}} - \bar{x_r})^2 } } { \sum_{i=1}^{n_y}{ \sum_{j=1}^{n_{y=i}}{ (x_r^{ij} - \bar{x_r})^2 } } }
where:
* :math:`n` is the number of observations
* :math:`n_y` is the number of modalities of :math:`y`
* :math:`n_{y=i}` is the number of observations taking :math:`y`'s :math:`i` th modality
* :math:`x_r` is the ranked version of :math:`x`
* :math:`x_r^{ij}` is the :math:`j` th observation of :math:`x_r` when :math:`y` takes its :math:`i` th modality
* :math:`\bar{x_r^{i.}}=\sum_{j=1}^{n_{y=i}}x_r^{ij}` is the sample mean of :math:`x_r` when :math:`y` takes its :math:`i` th modality
* :math:`\bar{x_r}=\sum_{i=1}^{n_y}{\sum_{j=1}^{n_{y=i}}}x_r^{ij}` is the sample mean of :math:`x_r`
.. autoclass:: AutoCarver.selectors.measures.KruskalMeasure
:members: compute_association, validate, is_x_quantitative, is_y_qualitative, higher_is_better, is_reversible
.. note::
* :class:`KruskalEtaSquaredMeasure` is the default measure for each :class:`QualitativeFeature` when using :class:`RegressionSelector`.
* :class:`KruskalEtaSquaredMeasure` is the default measure for each :class:`QuantitativeFeature` when using :class:`ClassificationSelector`.
.. _kruskal_epsilon2:
Kruskal-Wallis' :math:`\varepsilon^2` effect size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The raw :ref:`Kruskal-Wallis' H ` statistic grows with the number of observations, which makes it unsuitable for comparing features of differing sample sizes. The epsilon-squared effect size normalizes it to :math:`[0, 1]`:
.. math::
\varepsilon^2 = \frac{H}{N - 1}
where:
* :math:`H` is :ref:`Kruskal-Wallis' H ` test statistic
* :math:`N` is the number of pooled (non-missing) observations
It is to Kruskal-Wallis what :ref:`Cramér's V ` is to :ref:`Pearson's chi² `: a sample-size-normalized effect size meant for cross-feature ranking.
.. autoclass:: AutoCarver.selectors.measures.KruskalEpsilonSquaredMeasure
:members: compute_association, validate, is_x_quantitative, is_y_qualitative, higher_is_better, is_reversible
.. _kruskal_eta2:
Kruskal-Wallis' :math:`\eta^2` effect size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Like :ref:`epsilon-squared `, the eta-squared effect size removes the sample-size inflation of the raw :ref:`H ` statistic, but it additionally corrects for the number of groups :math:`k`:
.. math::
\eta^2 = \frac{H - k + 1}{N - k}
where:
* :math:`H` is :ref:`Kruskal-Wallis' H ` test statistic
* :math:`k` is the number of groups (modalities of :math:`y`)
* :math:`N` is the number of pooled (non-missing) observations
The correction for :math:`k` is useful in the reversed (regression) case, where :math:`k` is the feature's modality count and therefore varies across features. The result is clamped to :math:`[0, 1]`.
.. autoclass:: AutoCarver.selectors.measures.KruskalEtaSquaredMeasure
:members: compute_association, validate, is_x_quantitative, is_y_qualitative, higher_is_better, is_reversible
.. _R:
Coefficient of determination :math:`R`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a **binary** feature :math:`y` and a **quantitative** feature :math:`x` the following linear regression model is fitted using `statsmodels.formula.api.ols `_:
.. math::
x = \alpha + \beta y + \epsilon
where:
* :math:`\alpha` and :math:`\beta` are the coefficient of the linear regression model
* :math:`\epsilon` is the residual of the linear regression model
The determination coefficient, often denoted as :math:`R^2`, is a statistical measure that quantifies the goodness of fit of a linear regression model.
In this specific case, it is equal to the square of Pearson's :math:`r` correlation coefficient between :math:`x` and :math:`y`.
It is computed with the following formula:
.. math::
R = \sqrt{ 1 - \frac{ SS_{res} }{ SS_{tot} } }
where:
* :math:`n` is the number of observations
* :math:`SS_{res} = \sum_{i=1}^n{(x_i - \alpha - \beta y_i)^2} = \sum_{i=1}^n{\epsilon_i^2}` is the residual sum of squares
* :math:`SS_{tot} = \sum_{i=1}^n{(x_i - \bar{x})^2}` is the total sum of squares
* :math:`\bar{x}=\sum_{i=1}^{n}{x_i}` is the sample mean of :math:`x`
.. autoclass:: AutoCarver.selectors.measures.RMeasure
:members: compute_association, validate, is_x_quantitative, is_y_qualitative, is_y_binary, higher_is_better
.. _OutliersMeasures:
Outlier Detection Measures
^^^^^^^^^^^^^^^^^^^^^^^^^^
.. _zscore:
Standard Score
""""""""""""""
Standard score can be applied as a measure of deviation to determine outlier for **quantitative** features.
For a feature :math:`x` it is computed for any oservation :math:`x_i` as follows:
.. math::
z_i = \frac{x_i - \bar{x}}{S}
where:
* :math:`n` is the number of observations
* :math:`\bar{x}=\frac{1}{n}\sum_{j=1}^n{x_j}` is the sample mean of :math:`x`
* :math:`S=\sqrt{\frac{1}{n-1}\sum_{j=1}^n{(x_j - \bar{x})^2}}` is the sample standard deviation of :math:`x`
.. autoclass:: AutoCarver.selectors.measures.ZscoreOutlierMeasure
:members: compute_association, validate, is_x_quantitative, is_x_qualitative, higher_is_better
.. _iqr:
Interquartile range
"""""""""""""""""""
Interquartile range is widely used as an outlier detection metric for **quantitative** features.
For a feature :math:`x` it is computed as follows:
.. math::
IQR = Q_3 - Q_1
where:
* :math:`Q_1` is the 25th percentile of the :math:`x`
* :math:`Q_3` is the 75th percentile of the :math:`x`
Any observation :math:`x_i` of feature :math:`x`, can be considered an outlier if it does not verify:
.. math::
Q1 - 1.5 IQR \leq x_i \leq Q3 + 1.5 IQR
.. autoclass:: AutoCarver.selectors.measures.IqrOutlierMeasure
:members: compute_association, validate, is_x_quantitative, is_x_qualitative, higher_is_better
.. _QualiMeasures:
Qualitative measures
....................
.. _chi2:
Pearson's :math:`\chi^2` test statistic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For a **qualititative** feature :math:`x`, the association with a **qualitative** target :math:`y` is computed based on the `pandas.crosstab `_.
Pearson's :math:`\chi^2` test statistic is then computed using `scipy.stats.chi2_contingency `_ to perform association measuring.
The formula is the following:
.. math::
\chi^2=\sum_{i=1}^{n_x}{\sum_{j=1}^{n_y}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}
where:
* :math:`n` is the number of observations
* :math:`n_x` is the number of modalities of :math:`x`
* :math:`n_y` is the number of modalities of :math:`y`
* :math:`n_{ij}` is the number of observations that take modality :math:`i` of :math:`x` and modality :math:`j` of :math:`y`
* :math:`n_{i.}=\sum_{i=1}^{n_x}n_{ij}` is the total number of observations that take modality :math:`i` of :math:`x`
* :math:`n_{.j}=\sum_{j=1}^{n_y}n_{ij}` is the total number of observations that take modality :math:`j` of :math:`y`
.. autoclass:: AutoCarver.selectors.measures.Chi2Measure
:members: compute_association, validate, is_x_qualitative, is_y_qualitative, higher_is_better
.. _Cramerv:
Cramér's :math:`V`
^^^^^^^^^^^^^^^^^^
Based on Pearson's :math:`\chi^2`, Cramér's :math:`V` is computed using the following formula:
.. math::
V=\sqrt{\frac{\chi^2}{n\min(n_x-1, n_y-1)}}
where:
* :math:`n` is the number of observations
* :math:`n_x` is the number of modalities of :math:`x`
* :math:`n_y` is the number of modalities of :math:`y`
.. autoclass:: AutoCarver.selectors.measures.CramervMeasure
:members: compute_association, validate, is_x_qualitative, is_y_qualitative, higher_is_better
.. _Tschuprowt:
Tschuprow's :math:`T`
^^^^^^^^^^^^^^^^^^^^^
Based on Pearson's :math:`\chi^2`, Tschuprow's :math:`T` is computed using the following formula:
.. math::
T=\sqrt{\frac{\chi^2}{n\sqrt{(n_x-1)(n_y-1)}}}
where:
* :math:`n` is the number of observations
* :math:`n_x` is the number of modalities of :math:`x`
* :math:`n_y` is the number of modalities of :math:`y`
.. autoclass:: AutoCarver.selectors.measures.TschuprowtMeasure
:members: compute_association, validate, is_x_qualitative, is_y_qualitative, higher_is_better
.. note::
* :class:`TschuprowtMeasure` is the default measure for each :class:`QualitativeFeature` when using :class:`ClassificationSelector`.
.. _Filters:
Association filters, X by X
---------------------------
.. _QuantiFilters:
Quantitative filters
....................
.. _pearson_filter:
Pearson's :math:`r`
^^^^^^^^^^^^^^^^^^^
For a **quantititative** feature :math:`x_1`, the association with a **quantitative** feature :math:`x_2` is computed using `pandas.DataFrame.corr `_.
Pearson's :math:`r`, as known as the bivariate correlation, is a measure of linear correlation between quantitative features.
It is computed using the following formula:
.. math::
r_{x_1x_2}= \frac{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})(x_2^i-\bar{x_2})}}{\sqrt{\sum_{i=1}^{n}{(x_1^i-\bar{x_1})^2}} \sqrt{\sum_{i=1}^{n}{(x_2^i-\bar{x_2})^2}}}
where:
* :math:`n` is the number of observations
* :math:`x_1^i` is the :math:`i` th observation of :math:`x_1`
* :math:`x_2^i` is the :math:`i` th observation of :math:`x_2`
* :math:`\bar{x_1}=\frac{1}{n}\sum_{i=1}^n{x_1^i}` is the sample mean of :math:`x_1`
* :math:`\bar{x_2}=\frac{1}{n}\sum_{i=1}^n{x_2^i}` is the sample mean of :math:`x_2`
.. autoclass:: AutoCarver.selectors.filters.PearsonFilter
:members: filter, is_x_quantitative, higher_is_better, is_absolute
.. _spearman_filter:
Spearman's :math:`\rho`
^^^^^^^^^^^^^^^^^^^^^^^
For a **quantitative** feature :math:`x`, the corresponding order feature :math:`x_o` is the sorted sample of :math:`x` such that any :math:`i` in :math:`(1, n-1)` verifies :math:`x_o^i \leq x_o^{i+1}`, where :math:`n` is the number of observations. For the same feature :math:`x`, the corresponding rank :math:`x_r` is the index of :math:`x`'s values in :math:`x_o`.
Spearman's :math:`\rho` is Pearson's :math:`r` computed on the rank features. As so, Spearman's :math:`\rho` is computed with the following formula:
.. math::
\rho=r_{x_{1_{r}}x_{2_{r}}}
where:
* :math:`x_{1_{r}}` is the ranked version of :math:`x_1`
* :math:`x_{2_{r}}` is the ranked version of :math:`x_2`
* :math:`r_{x_{1_{r}}x_{2_{r}}}` is Pearson's :math:`r` linear correlation coefficient between :math:`x_{1_{r}}` and :math:`x_{2_{r}}`
.. autoclass:: AutoCarver.selectors.filters.SpearmanFilter
:members: filter, is_x_quantitative, higher_is_better, is_absolute
.. note::
* :class:`SpearmanFilter` is the default filter used for inter-:class:`QuantitativeFeature` association measure.
.. _QualiFilters:
Qualitative filters
...................
.. _cramerv_filter:
Cramér's :math:`V`
^^^^^^^^^^^^^^^^^^
For a **qualititative** feature :math:`x_1`, the association with a **qualitative** feature :math:`x_2` is computed based on the `pandas.crosstab `_.
Pearson's :math:`\chi^2` statistics is then computed using `scipy.stats.chi2_contingency `_ to perform association measuring.
The formula is the following:
.. math::
\chi^2=\sum_{i=1}^{n_{x_1}}{\sum_{j=1}^{n_{x_2}}{\frac{(n_{ij} - \frac{n_{i.}n_{.j}}{n})^2}{\frac{n_{i.}n_{.j}}{n}}}}
where:
* :math:`n` is the number of observations
* :math:`n_{x_1}` is the number of modalities of :math:`x_1`
* :math:`n_{x_2}` is the number of modalities of :math:`x_2`
* :math:`n_{ij}` is the number of observations that take modality :math:`i` of :math:`x_1` and modality :math:`j` of :math:`x_2`
* :math:`n_{i.}=\sum_{i=1}^{n_{x_1}}` is the total number of observations that take modality :math:`i` of :math:`x_1`
* :math:`n_{.j}=\sum_{j=1}^{n_{x_2}}` is the total number of observations that take modality :math:`j` of :math:`x_2`
Based on Pearson's :math:`\chi^2`, Cramér's :math:`V` is computed using the following formula:
.. math::
V=\sqrt{ \frac{ \chi^2 }{ n\min(n_{x_1}-1, n_{x_2}-1) } }
where:
* :math:`n` is the number of observations
* :math:`n_{x_1}` is the number of modalities of :math:`x_1`
* :math:`n_{x_2}` is the number of modalities of :math:`x_2`
.. autoclass:: AutoCarver.selectors.filters.CramervFilter
:members: filter, is_x_qualitative, higher_is_better
.. _tschuprowt_filter:
Tschuprow's :math:`T`
^^^^^^^^^^^^^^^^^^^^^
Based on Pearson's :math:`\chi^2`, Tschuprow's :math:`T` is computed using the following formula:
.. math::
T=\sqrt{\frac{\chi^2}{n\sqrt{(n_{x_1}-1)(n_{x_2}-1)}}}
where:
* :math:`n` is the number of observations
* :math:`n_{x_1}` is the number of modalities of :math:`x_1`
* :math:`n_{x_2}` is the number of modalities of :math:`x_2`
.. autoclass:: AutoCarver.selectors.filters.TschuprowtFilter
:members: filter, is_x_qualitative, higher_is_better
.. note::
* :class:`TschuprowtFilter` is the default filter used for inter-:class:`QualitativeFeature` association measure.