Combinations

Combinations are at the core of Carvers. They are used to identify the best combination from all possible combinations with up to max_n_mod modalities.

A pre-built CombinationEvaluator instance can be passed to any carver via the combination_evaluator keyword. Each subclass defaults to a task-appropriate metric: TschuprowtCombinations for BinaryCarver / MulticlassCarver, KruskalCombinations for ContinuousCarver.

The animation below starts from the six ordered bins a QuantitativeDiscretizer produces (its final state — see Complete pipeline for continuous and discrete features) and shows the core step: every consecutive grouping into max_n_mod groups is scored by its association with the binary target (Tschuprow’s T) and the table fills best-first in growing top_k batches (the progressive top-K DP search). The highest-scoring grouping that passes the viability filter is kept (gold row). Adjacent bins sharing a colour in a row are merged into one group.

class AutoCarver.combinations.CombinationEvaluator(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

CombinationEvaluator class to evaluate the best combination of modalities for a feature.

Parameters:

verbose (bool, optional) – Whether to print progress / statistics, by default False.
target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

Computes the best combination of modalities for the feature.

Parameters:

feature (BaseFeature) – Feature being carved.
xagg (pd.Series | pd.DataFrame | None) – Train-sample aggregation (per-modality y-lists for continuous, crosstab for binary).
xagg_dev (pd.Series | pd.DataFrame | None, optional) – Dev-sample aggregation, used for the robustness veto.
max_n_mod (int) – Maximum number of modalities allowed in the returned combination.
min_freq (float) – Minimum per-modality frequency, tested via a Wilson upper bound at significance min_freq_alpha (see Minimum-frequency test (Wilson score interval)).
dropna (bool) – Whether to group NaN as another modality (fans out NaN placements after the non-NaN DP, see Search strategy — interval dynamic programming (DP) with progressive top-K).
min_freq_alpha (float, default 0.05) – Two-sided significance level of the Wilson score interval. Smaller → wider CI → fewer rejections. alpha=1 recovers the legacy strict-threshold behaviour.

classmethod load(file: str | dict) → CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:: file (str | dict) – String of .json file name or content of the file.
Returns:: A ready-to-use CombinationEvaluator
Return type:: CombinationEvaluator

save(file_name: str) → None

Saves CombinationEvaluator to .json file.

Parameters:: file_name (str) – String of .json file name

The highest-scoring grouping is not necessarily the one that is kept: each candidate must clear the viability filter (minimum frequency via a Wilson score interval, distinct consecutive target rates, and train/dev rank preservation). That filter is documented on its own page.

Search strategy — interval dynamic programming (DP) with progressive top-K

For fixed min_freq, max_n_mod and association metric, AutoCarver returns the partition that maximises the metric among admissible candidates. The DP described below is a search-strategy optimisation; it does not prune the candidate set and does not change the statistical claim. Bit-exact agreement with the legacy enumerate-and-score path is pinned by parity tests (tests/combinations/binary/test_dp_chi2_parity.py, tests/combinations/continuous/test_dp_kruskal_parity.py).

The search problem

For a feature with raw modalities \(m_0, \dots, m_{n-1}\) already ordered (by ordinal rank, target rate, or numeric quantile), the carver searches over consecutive segmentations with at most max_n_mod groups: a partition is fully determined by integer split positions \(0 = s_0 < s_1 < \dots < s_k = n\) with \(k \le \text{max_n_mod}\). The chosen partition maximises the association metric subject to the viability filter (Wilson min_freq on train + dev, distinct target rates, preserved rank between train and dev when dev is provided).

The legacy path enumerated every admissible partition, scored each, then walked them in metric-desc order. This is correct but wasteful — only the top handful of candidates ever survive the viability walk.

The DP idea

The DP exploits two properties shared by both supported metrics (Kruskal-Wallis \(H\) for continuous targets, Pearson \(\chi^2\) for binary targets):

Segmentation structure. A partition is a sequence of disjoint consecutive intervals \([s_g, s_{g+1})\). Sub-problems factorise over the right boundary \(j\) and the number of groups \(k\).
Additive decomposability of the metric over groups, given fixed \(k\). Both \(H\) and \(\chi^2\) reduce — at fixed \(k\) and after factoring out \(k\)-dependent normalising constants — to a sum over groups of a quantity that depends only on a single interval \([i, j)\) of raw modalities.

We therefore run an interval DP indexed by \((k, j)\) whose state is the top-K prefixes (by partial score) ending at split \(j\) with \(k\) groups:

\[\text{dp}[k][j] = \operatorname*{top\text{-}K}_{i \in [k-1,\, j)} \big\{\, \text{dp}[k-1][i] \oplus \text{seg_cost}(i,\, j)\, \big\}\]

The final candidate list is \(\bigcup_k \text{dp}[k][n]\), sorted desc and truncated to top_k.

How the DP fills its table — a worked sketch

Why it beats enumerate-and-score, in one picture.

1. The search space. Take a feature with 6 ordered raw modalities. A partition with k = 3 groups is fully determined by 2 internal split positions. Two candidate partitions:

      A:
      m₀ m₁
      │
      m₂ m₃
      │
      m₄ m₅
    

      B:
      m₀ m₁
      │
      m₂ m₃ m₄
      │
      m₅
    

2. The shared-prefix insight. A and B share the first group m₀ m₁. The naïve path re-scores that prefix once per candidate that contains it; for n = 6, k = 3 that's C(5, 2) = 10 partitions and a lot of wasted work. The DP scores each prefix once and reuses it.

3. The DP table. dp[k][j] holds the top-K best scores of partitioning the first j modalities into exactly k groups. We fill it row-by-row, left-to-right:

k \ j	1	2	3	4	5	6
k=1	●	●	●	●	●	●
k=2	—	●	●	●	●	●
k=3	—	—	●	●	●	★

The answer for k = 3 lives at the ★ cell: dp[3][6]. Cells marked — are infeasible (can't make k groups out of fewer than k modalities).

4. The recurrence — one cell at a time. To fill dp[k][j], try every possible last split position i and combine:

    dp[k][j]
     = 
    top-K over
     i 
    of
     { 
    dp[k−1][i]
     ⊕ 
    seg_cost(i, j)
     }
  

dp[k−1][i] is already computed (previous row). seg_cost(i, j) is the contribution of the final group [i, j) — closed-form from prefix sums (no inner enumeration). So each cell costs O(j) work, and the whole table is O(K · max_n_mod · n²) instead of the O(2ⁿ) enumerate-and-score path.

Why "progressive" top-K? The DP returns the top top_k partitions by score, then the viability filter (Wilson min_freq, distinct target rates, train/dev rank) walks them in order. If none pass, top_k doubles and the DP re-runs — keeping the common case (a viable winner in the first batch) cheap, while preserving the optimality guarantee in the worst case.

Progressive top-K

The DP returns the top-K scored partitions, not all of them. The viability walk consumes that list in metric-desc order; if no candidate is viable in the current top-K we double top_k and re-run the DP, walking only the newly-appeared entries. Repeats until either:

a viable candidate is found, or
the DP returns fewer than top_k entries — every consecutive partition has been emitted; no viable exists for this feature.

Doubling guarantees the search is exhaustive in the worst case: the same admissible candidate set is eventually considered as in the legacy enumerate-and-score path. Total work is bounded by \(\sim 2 \times\) a single DP run at the final top_k. The common case (viable found in the initial top-K) costs \(O(K \cdot n^2 \cdot \text{top\_k} \cdot \log \text{top\_k})\) ops, independent of the total combination count — which scales combinatorially in \(n\) and max_n_mod and reached \(\sim 8\text{M}\) at \(n=40,\,\text{max_n_mod}=7\) previously.

The initial top-K is configurable via the class attribute CombinationEvaluator.dp_top_k_initial (default 1000).

NaN fan-out path

When dropna=True and the feature has NaNs, the DP runs on the non-NaN sub-index to produce base partitions. Each base is then fanned out across NaN placements:

NaN folded into each existing group;
NaN as its own group when len(base) < max_n_mod;
plus the degenerate [all_non_nan, [NaN]] partition.

Each variant is scored in closed form (_kruskal_h_for_combination / _chi2_assoc_for_combination) against the full per-modality stats — the NaN row is still in the aggregated sample because _apply_best_combination rebuilt it from raw after the non-NaN DP. Variants are walked sorted desc, dedup’d by partition key across progressive iterations so combinations carried over from a smaller top_k are not re-tested.

What does not change

The admissible candidate set: consecutive segmentations with \(k \le \text{max_n_mod}\).
The viability filter: Wilson min_freq on train + dev, distinct target rates, rank preservation.
The optimality claim: for fixed min_freq, max_n_mod, and metric, no admissible combination scores higher than the one returned.

The DP is a search-strategy optimisation, not a statistical change.

Target rates

Every combination evaluator carries a target rate — the per-modality summary of the target that the carver reports and, crucially, the scalar the viability filter orders by. It is passed via the target_rate keyword and defaults to a task-appropriate choice (the target mean, TargetMean, for every built-in evaluator).

A target rate plays two distinct roles:

Display statistic. One value per modality, stored on feature.statistics and surfaced in the carved-feature summary, so the grouping can be read off in interpretable units (an event rate, an odds ratio, a mean target, …).
Ordering key for viability. The same per-modality value is what the distinct-rate test requires to differ between consecutive modalities, and what the train/dev rank-preservation veto sorts on. A combination whose target-rate ordering collapses or flips is rejected.

Because of this dual role, two properties decide whether a candidate rate is a good fit:

It must be an orderable scalar. The viability checks need a single value per modality with a meaningful monotone ordering. A symmetric measure (e.g. a Gini-style impurity, maximal at \(p = 0.5\)) is fine as a display statistic but a poor ordering key.
Decomposability buys a fast path. When a rate can be reconstructed from per-raw-modality sufficient statistics it can opt into a closed-form path (compute_from_stats) that costs \(O(k)\) per combination instead of re-aggregating the raw sample on every candidate. The continuous mean does this; rates that need the full value multiset (median, quantiles) cannot and fall back to the general aggregation path.

class AutoCarver.combinations.utils.TargetRate

Target rate class.

Generic over XAgg for the inner _compute() worker so that binary (DataFrame crosstabs) and continuous (Series of y-lists) subclasses don’t violate LSP by narrowing the worker’s parameter type. The outer compute() keeps a wide Series | DataFrame | None signature because call sites in CombinationEvaluator and BaseCarver carry that union directly from AggregatedSample.raw / pretty-printer plumbing.

abstract compute(xagg: Series | DataFrame) → DataFrame

abstract compute(xagg: None) → None

Computes the target rate.

Parameters:: xagg (pd.Series | pd.DataFrame | None) – A crosstab (binary) or Series-of-y-lists (continuous).
Returns:: Target rate frame, or None if xagg was None.
Return type:: pd.DataFrame | None

Binary target rates

For binary (and multiclass) targets the per-modality input is a two-column crosstab \((n_0, n_1)\), so every rate below is closed-form. The default is the event rate \(p = n_1 / (n_0 + n_1)\) (TargetMean).

class AutoCarver.combinations.binary.binary_target_rates.TargetMean: Mean target rate class.

Continuous target rates

For continuous targets the per-modality input is the multiset of target values. The default TargetMean is decomposable from per-modality \((n, \sum y)\) aggregates and therefore implements the closed-form compute_from_stats fast path; TargetMedian is not decomposable from sums and uses the general aggregation path.

class AutoCarver.combinations.continuous.continuous_target_rates.TargetMean: Mean target rate class.

class AutoCarver.combinations.continuous.continuous_target_rates.TargetMedian: Median of target per class.

Custom target rates

A custom rate subclasses BinaryTargetRate or ContinuousTargetRate and implements _compute (one value per modality). To make it serialisable through save / load, add it to the evaluator’s _target_rate_classes registry. When the rate is additively decomposable from per-modality sums, override compute_from_stats so the search uses the closed-form path.

Candidate extensions that fit this contract (not yet implemented):

Binary, closed-form: logit / log-odds \(\log(n_1 / n_0)\), and a column-normalised weight-of-evidence \(\log\!\big((n_1/\textstyle\sum n_1)\,/\,(n_0/\textstyle\sum n_0)\big)\).
Continuous, decomposable (extend the carried stats with \(\sum y^2\)): variance and standard deviation.
Continuous, non-decomposable (general path only): inter-quartile range, arbitrary quantiles, and trimmed/robust location — useful for heavy-tailed targets where the mean ordering is unstable.

Classification tasks

Pearson \(\chi^2\) (binary targets)

For a 2-column contingency table (binary target), each group \(g\) contributes counts \((n_{0,g},\, n_{1,g})\). With row marginals \(R_g = n_{0,g} + n_{1,g}\), column marginals \(C_c = \sum_g n_{c,g}\), and grand total \(N = \sum_g R_g\), Pearson’s statistic is

\[\chi^2 = \sum_{g, c} \frac{(O_{g, c} - E_{g, c})^2}{E_{g, c}}, \quad E_{g, c} = \frac{R_g \cdot C_c}{N}.\]

Two key observations:

Given a fixed number of groups \(k\), the column marginals \(C_c\) and total \(N\) depend only on \(k\) (and a constant tol shift applied to every cell — matching the legacy chi2_contingency(xagg + tol) call): \(C_c = N_c + k\cdot\text{tol}\), \(N = N_0 + N_1 + 2k\cdot\text{tol}\). They are invariant under re-partitioning at fixed \(k\).
Therefore, at fixed \(k\), \(\chi^2\) is additive over groups: each group contributes \((O - E)^2 / E\) summed over its two cells, with \(E\) derivable from \((n_{0,g},\, n_{1,g})\) and the constants \((C_0, C_1, N)\). The Yates correction (subtract \(0.5\) from \(|O - E|\) iff the table is exactly \(2 \times 2\), matching scipy’s default) is applied only when \(k = 2\), which is again known at the DP level.

The DP is therefore run once per \(k \in [2,\, \text{max_n_mod}]\) with the constants \((C_0, C_1, N, \text{yates_flag})\) fixed; per-\(k\) top-K lists are merged and re-truncated:

\[\text{seg_cost}_k(i,\, j) = \chi^2\text{ contribution of }[i,\, j)\text{ under } (C_0,\, C_1,\, N,\, \text{yates_flag} = (k = 2)).\]

Cramér’s \(V = \sqrt{\chi^2 / N_{obs}}\) and Tschuprow’s \(T = V / \sqrt[4]{k - 1}\) are monotone transforms of \(\chi^2\) at fixed \(k\), so sorting by either is equivalent to sorting by \(\chi^2\) within each \(k\) slice. The cross-\(k\) merge re-applies the configured sort_by so the global top-K is correct under either metric. Statistical equivalence to scipy.stats.chi2_contingency() is bit-exact (parity tests pin the \(+\text{tol}\) shift, the Yates handling, and the \(\text{round}(x / \text{tol}) \cdot \text{tol}\) quantisation).

Cramér’s V Combinations

See Cramér’s V for more details on the metric.

class AutoCarver.combinations.CramervCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Cramér’s V based combination evaluation toolkit.

Same DP search as TschuprowtCombinations (see Pearson \chi^2 (binary targets)); only the sort_by key differs. \(V = \sqrt{\chi^2 / N_{obs}}\) is a monotone transform of \(\chi^2\) at fixed \(k\).

Parameters:

verbose (bool, optional) – Whether to print progress / statistics, by default False.
target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) → CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:: file (str | dict) – String of .json file name or content of the file.
Returns:: A ready-to-use CombinationEvaluator
Return type:: CombinationEvaluator

save(file_name: str) → None

Saves CombinationEvaluator to .json file.

Parameters:: file_name (str) – String of .json file name

Tschuprow’s T Combinations

See Tschuprow’s T for more details on the metric.

class AutoCarver.combinations.TschuprowtCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Tschuprow’s T based combination evaluation toolkit.

Search uses progressive top-K interval DP over the closed-form Pearson \(\chi^2\) decomposition (per-k DP with constant column marginals, Yates correction iff k == 2). Statistically equivalent to scipy.stats.chi2_contingency() — bit-exact agreement pinned by parity tests.

Parameters:

verbose (bool, optional) – Whether to print progress / statistics, by default False.
target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) → CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:: file (str | dict) – String of .json file name or content of the file.
Returns:: A ready-to-use CombinationEvaluator
Return type:: CombinationEvaluator

save(file_name: str) → None

Saves CombinationEvaluator to .json file.

Parameters:: file_name (str) – String of .json file name

Regression tasks

Kruskal-Wallis H (continuous targets)

Given a partition with \(n_g\) observations per group, rank sum \(R_g\), total \(N = \sum_g n_g\), and tie correction \(T = 1 - \sum_i (t_i^3 - t_i) / (N^3 - N)\) (depends only on the pooled \(y\) multiset), the Kruskal-Wallis statistic is

\[H = \frac{1}{T}\left[\,\frac{12}{N(N+1)} \sum_g \frac{R_g^2}{n_g} - 3(N+1)\,\right].\]

Two key observations:

Per-modality \((R_i, n_i)\), the total \(N\), and the tie correction \(T\) depend only on the raw feature ranking — not on the partition. They are computed once by ranking \(y\) once over the pooled sample (see _modality_rank_stats).
\(\sum_g R_g^2 / n_g\) is additive over groups. With prefix sums R_prefix and n_prefix over the raw modalities, a single interval’s contribution is closed-form:

\[\text{seg_cost}(i, j) = \frac{\big(\text{R_prefix}[j] - \text{R_prefix}[i]\big)^2} {\text{n_prefix}[j] - \text{n_prefix}[i]}.\]

The DP maximises \(\sum_g \text{seg_cost}\) over partitions; \(H\) is recovered at the end by applying the constant prefactor \(12 / (N(N+1))\), the constant offset \(-3(N+1)\), and dividing by \(T\). Statistical equivalence to scipy.stats.kruskal() is bit-exact — the DP only re-orders the search.

Kruskal’s H Combinations

See Kruskal-Wallis’ H test statistic for more details on the metric.

class AutoCarver.combinations.KruskalCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)

Kruskal-Wallis’ H based combination evaluation toolkit.

Search uses progressive top-K interval DP over the closed-form Kruskal-Wallis H decomposition (rank once over pooled y, prefix-sum per-modality rank stats). Statistically equivalent to scipy.stats.kruskal() — bit-exact agreement pinned by parity tests.

Parameters:

verbose (bool, optional) – Whether to print progress / statistics, by default False.
target_rate (TargetRate, optional) – Target rate strategy. If None, each evaluator subclass picks its own default in _init_target_rate().

classmethod load(file: str | dict) → CombinationEvaluator

Allows one to load a CombinationEvaluator saved as a .json file.

Parameters:: file (str | dict) – String of .json file name or content of the file.
Returns:: A ready-to-use CombinationEvaluator
Return type:: CombinationEvaluator

save(file_name: str) → None

Saves CombinationEvaluator to .json file.

Parameters:: file_name (str) – String of .json file name