.. _Combinations: Combinations ============ **Combinations** are at the core of **Carvers**. They are used to identify the best combination from all possible combinations with up to :attr:`max_n_mod` modalities. A pre-built :class:`CombinationEvaluator` instance can be passed to any carver via the ``combination_evaluator`` keyword. Each subclass defaults to a task-appropriate metric: :class:`TschuprowtCombinations` for :class:`BinaryCarver` / :class:`MulticlassCarver`, :class:`KruskalCombinations` for :class:`ContinuousCarver`. The animation below starts from the six ordered bins a :class:`QuantitativeDiscretizer` produces (its final state — see :ref:`QuantitativeDiscretizer`) and shows the core step: every consecutive grouping into ``max_n_mod`` groups is scored by its association with the binary target (Tschuprow's T) and the table fills best-first in growing ``top_k`` batches (the :ref:`progressive top-K DP ` search). The highest-scoring grouping that passes the :ref:`viability filter ` is kept (gold row). Adjacent bins sharing a colour in a row are merged into one group. .. image:: _static/animations/combinations.svg :alt: Combinations search animation — ordered bins, then consecutive groupings ranked by Tschuprow's T filling a table best-first, with the selected grouping highlighted :width: 100% :align: center .. autoclass:: AutoCarver.combinations.CombinationEvaluator :members: get_best_combination, save, load .. _MinFreqViability: Minimum-frequency viability test (Wilson score interval) -------------------------------------------------------- A candidate combination is *viable* on a sample only if every grouped modality is sufficiently frequent. Comparing :math:`\hat p = \text{count} / n_{obs}` directly against ``min_freq`` is noisy on small modalities — a modality with :math:`\hat p = 4\%` out of :math:`n_{obs}=100` would be rejected against ``min_freq=5%``, even though its 95% confidence interval comfortably straddles 5%. **AutoCarver** instead tests the one-sided question *"is this modality's true proportion significantly below* ``min_freq`` *?"* at level :math:`\alpha`, using a Wilson score interval — the small-sample-stable proportion interval recommended over Wald in Brown, Cai & DasGupta (2001). **Decision rule.** Modality :math:`m` is declared under-represented iff the **upper bound** of the two-sided Wilson interval for :math:`\hat p_m` is strictly below ``min_freq``: .. math:: \text{UB}(\hat p, n, \alpha) = \frac{\hat p + z^2/(2n)}{1 + z^2/n} + \frac{z}{1 + z^2/n}\sqrt{\frac{\hat p(1-\hat p)}{n} + \frac{z^2}{4n^2}} with :math:`z = \Phi^{-1}(1 - \alpha/2)` (two-sided z-score; :math:`\alpha=0.05` gives :math:`z \approx 1.96`). Reject iff :math:`\text{UB}(\hat p_m, n_{obs}, \alpha) < \text{min_freq}`. **Properties.** * **Asymptotic equivalence:** as :math:`n_{obs} \to \infty`, :math:`\text{UB} \to \hat p`, so the test converges to the legacy strict threshold :math:`\hat p < \text{min_freq}`. * **Small-sample conservativity:** a modality with very few observations cannot be rejected (the CI is too wide to fall below ``min_freq``), preventing premature merges driven by sampling noise. * :math:`n_{obs} = 0` returns :math:`\text{UB} = 1.0`, so empty groups are never rejected by this test (other checks catch them). **Where the test fires.** * Inside each :ref:`Discretizer ` to gate raw modalities **before** the combination search. Carvers discretize at ``min_freq / 2`` so this gate runs at the halved threshold, giving the combination evaluator a finer granularity to recombine. * Inside :class:`CombinationEvaluator` viability checks on both **train** and **dev** samples for every candidate combination during the search. **Tuning.** Set via :attr:`DiscretizerConfig.min_freq_alpha` (default ``0.05``). Smaller :math:`\alpha` → wider CI → fewer rejections → less merging; larger :math:`\alpha` → tighter CI → more rejections → more aggressive merging. :math:`\alpha = 1` recovers the legacy strict-threshold behaviour (:math:`\text{UB}` collapses to :math:`\hat p`). .. autofunction:: AutoCarver.discretizers.utils.frequency_ci.wilson_upper_bound .. autofunction:: AutoCarver.discretizers.utils.frequency_ci.is_significantly_below .. _DPTopK: Search strategy — interval dynamic programming (DP) with progressive top-K -------------------------------------------------------------------------- For fixed ``min_freq``, ``max_n_mod`` and association metric, **AutoCarver returns the partition that maximises the metric among admissible candidates**. The DP described below is a search-strategy optimisation; it does **not** prune the candidate set and does **not** change the statistical claim. Bit-exact agreement with the legacy enumerate-and-score path is pinned by parity tests (``tests/combinations/binary/test_dp_chi2_parity.py``, ``tests/combinations/continuous/test_dp_kruskal_parity.py``). The search problem ^^^^^^^^^^^^^^^^^^ For a feature with raw modalities :math:`m_0, \dots, m_{n-1}` already ordered (by ordinal rank, target rate, or numeric quantile), the carver searches over **consecutive segmentations** with at most ``max_n_mod`` groups: a partition is fully determined by integer split positions :math:`0 = s_0 < s_1 < \dots < s_k = n` with :math:`k \le \text{max_n_mod}`. The chosen partition maximises the association metric subject to the :ref:`viability filter ` (Wilson ``min_freq`` on train + dev, distinct target rates, preserved rank between train and dev when dev is provided). The legacy path enumerated every admissible partition, scored each, then walked them in metric-desc order. This is correct but wasteful — only the top handful of candidates ever survive the viability walk. The DP idea ^^^^^^^^^^^ The DP exploits two properties shared by both supported metrics (Kruskal-Wallis :math:`H` for continuous targets, Pearson :math:`\chi^2` for binary targets): 1. **Segmentation structure.** A partition is a sequence of disjoint consecutive intervals :math:`[s_g, s_{g+1})`. Sub-problems factorise over the right boundary :math:`j` and the number of groups :math:`k`. 2. **Additive decomposability of the metric over groups, given fixed** :math:`k`. Both :math:`H` and :math:`\chi^2` reduce — at fixed :math:`k` and after factoring out :math:`k`-dependent normalising constants — to a sum over groups of a quantity that depends **only on a single interval** :math:`[i, j)` of raw modalities. We therefore run an interval DP indexed by :math:`(k, j)` whose state is the **top-K** prefixes (by partial score) ending at split :math:`j` with :math:`k` groups: .. math:: \text{dp}[k][j] = \operatorname*{top\text{-}K}_{i \in [k-1,\, j)} \big\{\, \text{dp}[k-1][i] \oplus \text{seg_cost}(i,\, j)\, \big\} The final candidate list is :math:`\bigcup_k \text{dp}[k][n]`, sorted desc and truncated to ``top_k``. .. raw:: html
How the DP fills its table — a worked sketch
Why it beats enumerate-and-score, in one picture.
1. The search space. Take a feature with 6 ordered raw modalities. A partition with k = 3 groups is fully determined by 2 internal split positions. Two candidate partitions:
A: m₀ m₁ m₂ m₃ m₄ m₅
B: m₀ m₁ m₂ m₃ m₄ m₅
2. The shared-prefix insight. A and B share the first group m₀ m₁. The naïve path re-scores that prefix once per candidate that contains it; for n = 6, k = 3 that's C(5, 2) = 10 partitions and a lot of wasted work. The DP scores each prefix once and reuses it.
3. The DP table. dp[k][j] holds the top-K best scores of partitioning the first j modalities into exactly k groups. We fill it row-by-row, left-to-right:
k \ j 1 2 3 4 5 6
k=1
k=2
k=3
The answer for k = 3 lives at the ★ cell: dp[3][6]. Cells marked are infeasible (can't make k groups out of fewer than k modalities).
4. The recurrence — one cell at a time. To fill dp[k][j], try every possible last split position i and combine:
dp[k][j]  =  top-K over  i  of  {  dp[k−1][i]  ⊕  seg_cost(i, j)  }
dp[k−1][i] is already computed (previous row). seg_cost(i, j) is the contribution of the final group [i, j) — closed-form from prefix sums (no inner enumeration). So each cell costs O(j) work, and the whole table is O(K · max_n_mod · n²) instead of the O(2ⁿ) enumerate-and-score path.
Why "progressive" top-K? The DP returns the top top_k partitions by score, then the viability filter (Wilson min_freq, distinct target rates, train/dev rank) walks them in order. If none pass, top_k doubles and the DP re-runs — keeping the common case (a viable winner in the first batch) cheap, while preserving the optimality guarantee in the worst case.
Progressive top-K ^^^^^^^^^^^^^^^^^ The DP returns the **top-K** scored partitions, not all of them. The viability walk consumes that list in metric-desc order; if no candidate is viable in the current top-K we **double** ``top_k`` and re-run the DP, walking only the newly-appeared entries. Repeats until either: * a viable candidate is found, or * the DP returns fewer than ``top_k`` entries — every consecutive partition has been emitted; no viable exists for this feature. Doubling guarantees the search is **exhaustive in the worst case**: the same admissible candidate set is eventually considered as in the legacy enumerate-and-score path. Total work is bounded by :math:`\sim 2 \times` a single DP run at the final ``top_k``. The common case (viable found in the initial top-K) costs :math:`O(K \cdot n^2 \cdot \text{top\_k} \cdot \log \text{top\_k})` ops, **independent of the total combination count** — which scales combinatorially in :math:`n` and ``max_n_mod`` and reached :math:`\sim 8\text{M}` at :math:`n=40,\,\text{max_n_mod}=7` previously. The initial top-K is configurable via the class attribute :attr:`CombinationEvaluator.dp_top_k_initial` (default ``1000``). NaN fan-out path ^^^^^^^^^^^^^^^^ When ``dropna=True`` and the feature has NaNs, the DP runs on the **non-NaN sub-index** to produce base partitions. Each base is then **fanned out** across NaN placements: * NaN folded into each existing group; * NaN as its own group when ``len(base) < max_n_mod``; * plus the degenerate ``[all_non_nan, [NaN]]`` partition. Each variant is scored in closed form (``_kruskal_h_for_combination`` / ``_chi2_assoc_for_combination``) against the **full** per-modality stats — the NaN row is still in the aggregated sample because ``_apply_best_combination`` rebuilt it from raw after the non-NaN DP. Variants are walked sorted desc, dedup'd by partition key across progressive iterations so combinations carried over from a smaller ``top_k`` are not re-tested. What does **not** change ^^^^^^^^^^^^^^^^^^^^^^^^ * The admissible candidate set: consecutive segmentations with :math:`k \le \text{max_n_mod}`. * The :ref:`viability filter `: Wilson ``min_freq`` on train + dev, distinct target rates, rank preservation. * The optimality claim: **for fixed** ``min_freq``\ **,** ``max_n_mod``\ **, and metric, no admissible combination scores higher than the one returned.** The DP is a **search-strategy optimisation**, not a statistical change. Classification tasks -------------------- .. _DPChi2: Pearson :math:`\chi^2` (binary targets) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ For a 2-column contingency table (binary target), each group :math:`g` contributes counts :math:`(n_{0,g},\, n_{1,g})`. With row marginals :math:`R_g = n_{0,g} + n_{1,g}`, column marginals :math:`C_c = \sum_g n_{c,g}`, and grand total :math:`N = \sum_g R_g`, Pearson's statistic is .. math:: \chi^2 = \sum_{g, c} \frac{(O_{g, c} - E_{g, c})^2}{E_{g, c}}, \quad E_{g, c} = \frac{R_g \cdot C_c}{N}. Two key observations: * **Given a fixed number of groups** :math:`k`, the column marginals :math:`C_c` and total :math:`N` depend **only on** :math:`k` (and a constant ``tol`` shift applied to every cell — matching the legacy ``chi2_contingency(xagg + tol)`` call): :math:`C_c = N_c + k\cdot\text{tol}`, :math:`N = N_0 + N_1 + 2k\cdot\text{tol}`. They are **invariant under re-partitioning at fixed** :math:`k`. * Therefore, at fixed :math:`k`, :math:`\chi^2` is **additive over groups**: each group contributes :math:`(O - E)^2 / E` summed over its two cells, with :math:`E` derivable from :math:`(n_{0,g},\, n_{1,g})` and the constants :math:`(C_0, C_1, N)`. The **Yates correction** (subtract :math:`0.5` from :math:`|O - E|` iff the table is exactly :math:`2 \times 2`, matching scipy's default) is applied **only when** :math:`k = 2`, which is again known at the DP level. The DP is therefore run **once per** :math:`k \in [2,\, \text{max_n_mod}]` with the constants :math:`(C_0, C_1, N, \text{yates_flag})` fixed; per-:math:`k` top-K lists are merged and re-truncated: .. math:: \text{seg_cost}_k(i,\, j) = \chi^2\text{ contribution of }[i,\, j)\text{ under } (C_0,\, C_1,\, N,\, \text{yates_flag} = (k = 2)). Cramér's :math:`V = \sqrt{\chi^2 / N_{obs}}` and Tschuprow's :math:`T = V / \sqrt[4]{k - 1}` are monotone transforms of :math:`\chi^2` at fixed :math:`k`, so sorting by either is equivalent to sorting by :math:`\chi^2` **within each** :math:`k` **slice**. The cross-:math:`k` merge re-applies the configured ``sort_by`` so the global top-K is correct under either metric. Statistical equivalence to :func:`scipy.stats.chi2_contingency` is **bit-exact** (parity tests pin the :math:`+\text{tol}` shift, the Yates handling, and the :math:`\text{round}(x / \text{tol}) \cdot \text{tol}` quantisation). .. _CramervCombinations: Cramér's V Combinations ^^^^^^^^^^^^^^^^^^^^^^^ See :ref:`Cramerv` for more details on the metric. .. autoclass:: AutoCarver.combinations.CramervCombinations :members: save, load .. _TschuprowtCombinations: Tschuprow's T Combinations ^^^^^^^^^^^^^^^^^^^^^^^^^^ See :ref:`Tschuprowt` for more details on the metric. .. autoclass:: AutoCarver.combinations.TschuprowtCombinations :members: save, load Regression tasks ---------------- .. _DPKruskal: Kruskal-Wallis H (continuous targets) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Given a partition with :math:`n_g` observations per group, rank sum :math:`R_g`, total :math:`N = \sum_g n_g`, and **tie correction** :math:`T = 1 - \sum_i (t_i^3 - t_i) / (N^3 - N)` (depends only on the pooled :math:`y` multiset), the Kruskal-Wallis statistic is .. math:: H = \frac{1}{T}\left[\,\frac{12}{N(N+1)} \sum_g \frac{R_g^2}{n_g} - 3(N+1)\,\right]. Two key observations: * **Per-modality** :math:`(R_i, n_i)`, the total :math:`N`, and the tie correction :math:`T` depend only on the raw feature ranking — *not* on the partition. They are computed **once** by ranking :math:`y` once over the pooled sample (see ``_modality_rank_stats``). * :math:`\sum_g R_g^2 / n_g` is **additive over groups**. With prefix sums ``R_prefix`` and ``n_prefix`` over the raw modalities, a single interval's contribution is closed-form: .. math:: \text{seg_cost}(i, j) = \frac{\big(\text{R_prefix}[j] - \text{R_prefix}[i]\big)^2} {\text{n_prefix}[j] - \text{n_prefix}[i]}. The DP maximises :math:`\sum_g \text{seg_cost}` over partitions; :math:`H` is recovered at the end by applying the constant prefactor :math:`12 / (N(N+1))`, the constant offset :math:`-3(N+1)`, and dividing by :math:`T`. Statistical equivalence to :func:`scipy.stats.kruskal` is **bit-exact** — the DP only re-orders the search. .. _KruskalCombinations: Kruskal's H Combinations ^^^^^^^^^^^^^^^^^^^^^^^^ See :ref:`kruskal` for more details on the metric. .. autoclass:: AutoCarver.combinations.KruskalCombinations :members: save, load