Combinations

Combinations are at the core of Carvers. They are used to identify the best combination from all possible combinations with up to max_n_mod modalities.

A pre-built CombinationEvaluator instance can be passed to any carver via the combination_evaluator keyword. Each subclass defaults to a task-appropriate metric: TschuprowtCombinations for BinaryCarver / MulticlassCarver, KruskalCombinations for ContinuousCarver.

The animation below starts from the six ordered bins a QuantitativeDiscretizer produces (its final state — see Complete pipeline for continuous and discrete features) and shows the core step: every consecutive grouping into max_n_mod groups is scored by its association with the binary target (Tschuprow’s T) and the table fills best-first in growing top_k batches (the progressive top-K DP search). The highest-scoring grouping that passes the viability filter is kept (gold row). Adjacent bins sharing a colour in a row are merged into one group.

Combinations search animation — ordered bins, then consecutive groupings ranked by Tschuprow's T filling a table best-first, with the selected grouping highlighted
class AutoCarver.combinations.CombinationEvaluator(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

CombinationEvaluator class to evaluate the best combination of modalities for a feature.

Parameters:
  • verbose (bool, optional) – Whether to print progress / statistics, by default False.

  • target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

get_best_combination(feature: BaseFeature, xagg: Series | DataFrame | None, xagg_dev: Series | DataFrame | None = None, *, max_n_mod: int, min_freq: float, dropna: bool, min_freq_alpha: float = 0.05) dict | None

Computes the best combination of modalities for the feature.

Parameters:
  • feature (BaseFeature) – Feature being carved.

  • xagg (pd.Series | pd.DataFrame | None) – Train-sample aggregation (per-modality y-lists for continuous, crosstab for binary).

  • xagg_dev (pd.Series | pd.DataFrame | None, optional) – Dev-sample aggregation, used for the robustness veto.

  • max_n_mod (int) – Maximum number of modalities allowed in the returned combination.

  • min_freq (float) – Minimum per-modality frequency, tested via a Wilson upper bound at significance min_freq_alpha (see Minimum-frequency viability test (Wilson score interval)).

  • dropna (bool) – Whether to group NaN as another modality (fans out NaN placements after the non-NaN DP, see Search strategy — interval dynamic programming (DP) with progressive top-K).

  • min_freq_alpha (float, default 0.05) – Two-sided significance level of the Wilson score interval. Smaller → wider CI → fewer rejections. alpha=1 recovers the legacy strict-threshold behaviour.

classmethod load(file: str | dict) CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:

file (str | dict) – String of .json file name or content of the file.

Returns:

A ready-to-use CombinationEvaluator

Return type:

CombinationEvaluator

save(file_name: str) None

Saves CombinationEvaluator to .json file.

Parameters:

file_name (str) – String of .json file name

Minimum-frequency viability test (Wilson score interval)

A candidate combination is viable on a sample only if every grouped modality is sufficiently frequent. Comparing \(\hat p = \text{count} / n_{obs}\) directly against min_freq is noisy on small modalities — a modality with \(\hat p = 4\%\) out of \(n_{obs}=100\) would be rejected against min_freq=5%, even though its 95% confidence interval comfortably straddles 5%. AutoCarver instead tests the one-sided question “is this modality’s true proportion significantly below min_freq ?” at level \(\alpha\), using a Wilson score interval — the small-sample-stable proportion interval recommended over Wald in Brown, Cai & DasGupta (2001).

Decision rule. Modality \(m\) is declared under-represented iff the upper bound of the two-sided Wilson interval for \(\hat p_m\) is strictly below min_freq:

\[\text{UB}(\hat p, n, \alpha) = \frac{\hat p + z^2/(2n)}{1 + z^2/n} + \frac{z}{1 + z^2/n}\sqrt{\frac{\hat p(1-\hat p)}{n} + \frac{z^2}{4n^2}}\]

with \(z = \Phi^{-1}(1 - \alpha/2)\) (two-sided z-score; \(\alpha=0.05\) gives \(z \approx 1.96\)). Reject iff \(\text{UB}(\hat p_m, n_{obs}, \alpha) < \text{min_freq}\).

Properties.

  • Asymptotic equivalence: as \(n_{obs} \to \infty\), \(\text{UB} \to \hat p\), so the test converges to the legacy strict threshold \(\hat p < \text{min_freq}\).

  • Small-sample conservativity: a modality with very few observations cannot be rejected (the CI is too wide to fall below min_freq), preventing premature merges driven by sampling noise.

  • \(n_{obs} = 0\) returns \(\text{UB} = 1.0\), so empty groups are never rejected by this test (other checks catch them).

Where the test fires.

  • Inside each Discretizer to gate raw modalities before the combination search. Carvers discretize at min_freq / 2 so this gate runs at the halved threshold, giving the combination evaluator a finer granularity to recombine.

  • Inside CombinationEvaluator viability checks on both train and dev samples for every candidate combination during the search.

Tuning. Set via DiscretizerConfig.min_freq_alpha (default 0.05). Smaller \(\alpha\) → wider CI → fewer rejections → less merging; larger \(\alpha\) → tighter CI → more rejections → more aggressive merging. \(\alpha = 1\) recovers the legacy strict-threshold behaviour (\(\text{UB}\) collapses to \(\hat p\)).

AutoCarver.discretizers.utils.frequency_ci.wilson_upper_bound(count: ndarray | int | float, nobs: int, alpha: float) ndarray | float

Upper bound of the two-sided Wilson score interval for count / nobs.

Parameters:
  • count (array-like or scalar) – Observed successes. Accepts integer counts or float counts (e.g. weighted/aggregated frequencies).

  • nobs (int) – Number of trials. Must be >= 0; returns 1.0 when nobs == 0 so callers treat empty samples as non-significant.

  • alpha (float) – Two-sided significance level (e.g. 0.05 for a 95% interval).

Returns:

Wilson upper bound, same shape as count.

Return type:

array-like or scalar

AutoCarver.discretizers.utils.frequency_ci.is_significantly_below(count: ndarray | int | float, nobs: int, min_freq: float, alpha: float) ndarray | bool

Whether the observed proportion count / nobs is significantly below min_freq.

A modality is significantly below min_freq when the Wilson upper bound of its observed proportion is strictly below min_freq.

Search strategy — interval dynamic programming (DP) with progressive top-K

For fixed min_freq, max_n_mod and association metric, AutoCarver returns the partition that maximises the metric among admissible candidates. The DP described below is a search-strategy optimisation; it does not prune the candidate set and does not change the statistical claim. Bit-exact agreement with the legacy enumerate-and-score path is pinned by parity tests (tests/combinations/binary/test_dp_chi2_parity.py, tests/combinations/continuous/test_dp_kruskal_parity.py).

The search problem

For a feature with raw modalities \(m_0, \dots, m_{n-1}\) already ordered (by ordinal rank, target rate, or numeric quantile), the carver searches over consecutive segmentations with at most max_n_mod groups: a partition is fully determined by integer split positions \(0 = s_0 < s_1 < \dots < s_k = n\) with \(k \le \text{max_n_mod}\). The chosen partition maximises the association metric subject to the viability filter (Wilson min_freq on train + dev, distinct target rates, preserved rank between train and dev when dev is provided).

The legacy path enumerated every admissible partition, scored each, then walked them in metric-desc order. This is correct but wasteful — only the top handful of candidates ever survive the viability walk.

The DP idea

The DP exploits two properties shared by both supported metrics (Kruskal-Wallis \(H\) for continuous targets, Pearson \(\chi^2\) for binary targets):

  1. Segmentation structure. A partition is a sequence of disjoint consecutive intervals \([s_g, s_{g+1})\). Sub-problems factorise over the right boundary \(j\) and the number of groups \(k\).

  2. Additive decomposability of the metric over groups, given fixed \(k\). Both \(H\) and \(\chi^2\) reduce — at fixed \(k\) and after factoring out \(k\)-dependent normalising constants — to a sum over groups of a quantity that depends only on a single interval \([i, j)\) of raw modalities.

We therefore run an interval DP indexed by \((k, j)\) whose state is the top-K prefixes (by partial score) ending at split \(j\) with \(k\) groups:

\[\text{dp}[k][j] = \operatorname*{top\text{-}K}_{i \in [k-1,\, j)} \big\{\, \text{dp}[k-1][i] \oplus \text{seg_cost}(i,\, j)\, \big\}\]

The final candidate list is \(\bigcup_k \text{dp}[k][n]\), sorted desc and truncated to top_k.

How the DP fills its table — a worked sketch
Why it beats enumerate-and-score, in one picture.
1. The search space. Take a feature with 6 ordered raw modalities. A partition with k = 3 groups is fully determined by 2 internal split positions. Two candidate partitions:
A: m₀ m₁ m₂ m₃ m₄ m₅
B: m₀ m₁ m₂ m₃ m₄ m₅
2. The shared-prefix insight. A and B share the first group m₀ m₁. The naïve path re-scores that prefix once per candidate that contains it; for n = 6, k = 3 that's C(5, 2) = 10 partitions and a lot of wasted work. The DP scores each prefix once and reuses it.
3. The DP table. dp[k][j] holds the top-K best scores of partitioning the first j modalities into exactly k groups. We fill it row-by-row, left-to-right:
k \ j 1 2 3 4 5 6
k=1
k=2
k=3
The answer for k = 3 lives at the ★ cell: dp[3][6]. Cells marked are infeasible (can't make k groups out of fewer than k modalities).
4. The recurrence — one cell at a time. To fill dp[k][j], try every possible last split position i and combine:
dp[k][j]  =  top-K over  i  of  {  dp[k−1][i]  ⊕  seg_cost(i, j)  }
dp[k−1][i] is already computed (previous row). seg_cost(i, j) is the contribution of the final group [i, j) — closed-form from prefix sums (no inner enumeration). So each cell costs O(j) work, and the whole table is O(K · max_n_mod · n²) instead of the O(2ⁿ) enumerate-and-score path.
Why "progressive" top-K? The DP returns the top top_k partitions by score, then the viability filter (Wilson min_freq, distinct target rates, train/dev rank) walks them in order. If none pass, top_k doubles and the DP re-runs — keeping the common case (a viable winner in the first batch) cheap, while preserving the optimality guarantee in the worst case.

Progressive top-K

The DP returns the top-K scored partitions, not all of them. The viability walk consumes that list in metric-desc order; if no candidate is viable in the current top-K we double top_k and re-run the DP, walking only the newly-appeared entries. Repeats until either:

  • a viable candidate is found, or

  • the DP returns fewer than top_k entries — every consecutive partition has been emitted; no viable exists for this feature.

Doubling guarantees the search is exhaustive in the worst case: the same admissible candidate set is eventually considered as in the legacy enumerate-and-score path. Total work is bounded by \(\sim 2 \times\) a single DP run at the final top_k. The common case (viable found in the initial top-K) costs \(O(K \cdot n^2 \cdot \text{top\_k} \cdot \log \text{top\_k})\) ops, independent of the total combination count — which scales combinatorially in \(n\) and max_n_mod and reached \(\sim 8\text{M}\) at \(n=40,\,\text{max_n_mod}=7\) previously.

The initial top-K is configurable via the class attribute CombinationEvaluator.dp_top_k_initial (default 1000).

NaN fan-out path

When dropna=True and the feature has NaNs, the DP runs on the non-NaN sub-index to produce base partitions. Each base is then fanned out across NaN placements:

  • NaN folded into each existing group;

  • NaN as its own group when len(base) < max_n_mod;

  • plus the degenerate [all_non_nan, [NaN]] partition.

Each variant is scored in closed form (_kruskal_h_for_combination / _chi2_assoc_for_combination) against the full per-modality stats — the NaN row is still in the aggregated sample because _apply_best_combination rebuilt it from raw after the non-NaN DP. Variants are walked sorted desc, dedup’d by partition key across progressive iterations so combinations carried over from a smaller top_k are not re-tested.

What does not change

  • The admissible candidate set: consecutive segmentations with \(k \le \text{max_n_mod}\).

  • The viability filter: Wilson min_freq on train + dev, distinct target rates, rank preservation.

  • The optimality claim: for fixed min_freq, max_n_mod, and metric, no admissible combination scores higher than the one returned.

The DP is a search-strategy optimisation, not a statistical change.

Classification tasks

Pearson \(\chi^2\) (binary targets)

For a 2-column contingency table (binary target), each group \(g\) contributes counts \((n_{0,g},\, n_{1,g})\). With row marginals \(R_g = n_{0,g} + n_{1,g}\), column marginals \(C_c = \sum_g n_{c,g}\), and grand total \(N = \sum_g R_g\), Pearson’s statistic is

\[\chi^2 = \sum_{g, c} \frac{(O_{g, c} - E_{g, c})^2}{E_{g, c}}, \quad E_{g, c} = \frac{R_g \cdot C_c}{N}.\]

Two key observations:

  • Given a fixed number of groups \(k\), the column marginals \(C_c\) and total \(N\) depend only on \(k\) (and a constant tol shift applied to every cell — matching the legacy chi2_contingency(xagg + tol) call): \(C_c = N_c + k\cdot\text{tol}\), \(N = N_0 + N_1 + 2k\cdot\text{tol}\). They are invariant under re-partitioning at fixed \(k\).

  • Therefore, at fixed \(k\), \(\chi^2\) is additive over groups: each group contributes \((O - E)^2 / E\) summed over its two cells, with \(E\) derivable from \((n_{0,g},\, n_{1,g})\) and the constants \((C_0, C_1, N)\). The Yates correction (subtract \(0.5\) from \(|O - E|\) iff the table is exactly \(2 \times 2\), matching scipy’s default) is applied only when \(k = 2\), which is again known at the DP level.

The DP is therefore run once per \(k \in [2,\, \text{max_n_mod}]\) with the constants \((C_0, C_1, N, \text{yates_flag})\) fixed; per-\(k\) top-K lists are merged and re-truncated:

\[\text{seg_cost}_k(i,\, j) = \chi^2\text{ contribution of }[i,\, j)\text{ under } (C_0,\, C_1,\, N,\, \text{yates_flag} = (k = 2)).\]

Cramér’s \(V = \sqrt{\chi^2 / N_{obs}}\) and Tschuprow’s \(T = V / \sqrt[4]{k - 1}\) are monotone transforms of \(\chi^2\) at fixed \(k\), so sorting by either is equivalent to sorting by \(\chi^2\) within each \(k\) slice. The cross-\(k\) merge re-applies the configured sort_by so the global top-K is correct under either metric. Statistical equivalence to scipy.stats.chi2_contingency() is bit-exact (parity tests pin the \(+\text{tol}\) shift, the Yates handling, and the \(\text{round}(x / \text{tol}) \cdot \text{tol}\) quantisation).

Cramér’s V Combinations

See Cramér’s V for more details on the metric.

class AutoCarver.combinations.CramervCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Cramér’s V based combination evaluation toolkit.

Same DP search as TschuprowtCombinations (see Pearson \chi^2 (binary targets)); only the sort_by key differs. \(V = \sqrt{\chi^2 / N_{obs}}\) is a monotone transform of \(\chi^2\) at fixed \(k\).

Parameters:
  • verbose (bool, optional) – Whether to print progress / statistics, by default False.

  • target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:

file (str | dict) – String of .json file name or content of the file.

Returns:

A ready-to-use CombinationEvaluator

Return type:

CombinationEvaluator

save(file_name: str) None

Saves CombinationEvaluator to .json file.

Parameters:

file_name (str) – String of .json file name

Tschuprow’s T Combinations

See Tschuprow’s T for more details on the metric.

class AutoCarver.combinations.TschuprowtCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Tschuprow’s T based combination evaluation toolkit.

Search uses progressive top-K interval DP over the closed-form Pearson \(\chi^2\) decomposition (per-k DP with constant column marginals, Yates correction iff k == 2). Statistically equivalent to scipy.stats.chi2_contingency() — bit-exact agreement pinned by parity tests.

Parameters:
  • verbose (bool, optional) – Whether to print progress / statistics, by default False.

  • target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:

file (str | dict) – String of .json file name or content of the file.

Returns:

A ready-to-use CombinationEvaluator

Return type:

CombinationEvaluator

save(file_name: str) None

Saves CombinationEvaluator to .json file.

Parameters:

file_name (str) – String of .json file name

Regression tasks

Kruskal-Wallis H (continuous targets)

Given a partition with \(n_g\) observations per group, rank sum \(R_g\), total \(N = \sum_g n_g\), and tie correction \(T = 1 - \sum_i (t_i^3 - t_i) / (N^3 - N)\) (depends only on the pooled \(y\) multiset), the Kruskal-Wallis statistic is

\[H = \frac{1}{T}\left[\,\frac{12}{N(N+1)} \sum_g \frac{R_g^2}{n_g} - 3(N+1)\,\right].\]

Two key observations:

  • Per-modality \((R_i, n_i)\), the total \(N\), and the tie correction \(T\) depend only on the raw feature ranking — not on the partition. They are computed once by ranking \(y\) once over the pooled sample (see _modality_rank_stats).

  • \(\sum_g R_g^2 / n_g\) is additive over groups. With prefix sums R_prefix and n_prefix over the raw modalities, a single interval’s contribution is closed-form:

    \[\text{seg_cost}(i, j) = \frac{\big(\text{R_prefix}[j] - \text{R_prefix}[i]\big)^2} {\text{n_prefix}[j] - \text{n_prefix}[i]}.\]

The DP maximises \(\sum_g \text{seg_cost}\) over partitions; \(H\) is recovered at the end by applying the constant prefactor \(12 / (N(N+1))\), the constant offset \(-3(N+1)\), and dividing by \(T\). Statistical equivalence to scipy.stats.kruskal() is bit-exact — the DP only re-orders the search.

Kruskal’s H Combinations

See Kruskal-Wallis’ H test statistic for more details on the metric.

class AutoCarver.combinations.KruskalCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Kruskal-Wallis’ H based combination evaluation toolkit.

Search uses progressive top-K interval DP over the closed-form Kruskal-Wallis H decomposition (rank once over pooled y, prefix-sum per-modality rank stats). Statistically equivalent to scipy.stats.kruskal() — bit-exact agreement pinned by parity tests.

Parameters:
  • verbose (bool, optional) – Whether to print progress / statistics, by default False.

  • target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:

file (str | dict) – String of .json file name or content of the file.

Returns:

A ready-to-use CombinationEvaluator

Return type:

CombinationEvaluator

save(file_name: str) None

Saves CombinationEvaluator to .json file.

Parameters:

file_name (str) – String of .json file name