Combinations
Combinations are at the core of Carvers. They are used to identify the best combination
from all possible combinations with up to max_n_mod modalities.
A pre-built CombinationEvaluator instance can be passed to any carver via the
combination_evaluator keyword. Each subclass defaults to a task-appropriate metric:
TschuprowtCombinations for BinaryCarver / MulticlassCarver,
KruskalCombinations for ContinuousCarver.
The animation below starts from the six ordered bins a
QuantitativeDiscretizer produces (its final state — see
Complete pipeline for continuous and discrete features) and shows the core step: every consecutive
grouping into max_n_mod groups is scored by its association with the binary
target (Tschuprow’s T) and the table fills best-first in growing top_k
batches (the progressive top-K DP search). The highest-scoring
grouping that passes the viability filter is kept
(gold row). Adjacent bins sharing a colour in a row are merged into one group.
- class AutoCarver.combinations.CombinationEvaluator(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
CombinationEvaluator class to evaluate the best combination of modalities for a feature.
- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- get_best_combination(feature: BaseFeature, xagg: Series | DataFrame | None, xagg_dev: Series | DataFrame | None = None, *, max_n_mod: int, min_freq: float, dropna: bool, min_freq_alpha: float = 0.05) dict | None
Computes the best combination of modalities for the feature.
- Parameters:
feature (BaseFeature) – Feature being carved.
xagg (pd.Series | pd.DataFrame | None) – Train-sample aggregation (per-modality y-lists for continuous, crosstab for binary).
xagg_dev (pd.Series | pd.DataFrame | None, optional) – Dev-sample aggregation, used for the robustness veto.
max_n_mod (int) – Maximum number of modalities allowed in the returned combination.
min_freq (float) – Minimum per-modality frequency, tested via a Wilson upper bound at significance
min_freq_alpha(see Minimum-frequency viability test (Wilson score interval)).dropna (bool) – Whether to group NaN as another modality (fans out NaN placements after the non-NaN DP, see Search strategy — interval dynamic programming (DP) with progressive top-K).
min_freq_alpha (float, default
0.05) – Two-sided significance level of the Wilson score interval. Smaller → wider CI → fewer rejections.alpha=1recovers the legacy strict-threshold behaviour.
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
Minimum-frequency viability test (Wilson score interval)
A candidate combination is viable on a sample only if every grouped modality is
sufficiently frequent. Comparing \(\hat p = \text{count} / n_{obs}\) directly
against min_freq is noisy on small modalities — a modality with
\(\hat p = 4\%\) out of \(n_{obs}=100\) would be rejected against
min_freq=5%, even though its 95% confidence interval comfortably straddles
5%. AutoCarver instead tests the one-sided question “is this modality’s
true proportion significantly below min_freq ?” at level \(\alpha\),
using a Wilson score interval — the small-sample-stable proportion interval
recommended over Wald in Brown, Cai & DasGupta (2001).
Decision rule. Modality \(m\) is declared under-represented iff the
upper bound of the two-sided Wilson interval for \(\hat p_m\) is
strictly below min_freq:
with \(z = \Phi^{-1}(1 - \alpha/2)\) (two-sided z-score; \(\alpha=0.05\) gives \(z \approx 1.96\)). Reject iff \(\text{UB}(\hat p_m, n_{obs}, \alpha) < \text{min_freq}\).
Properties.
Asymptotic equivalence: as \(n_{obs} \to \infty\), \(\text{UB} \to \hat p\), so the test converges to the legacy strict threshold \(\hat p < \text{min_freq}\).
Small-sample conservativity: a modality with very few observations cannot be rejected (the CI is too wide to fall below
min_freq), preventing premature merges driven by sampling noise.\(n_{obs} = 0\) returns \(\text{UB} = 1.0\), so empty groups are never rejected by this test (other checks catch them).
Where the test fires.
Inside each Discretizer to gate raw modalities before the combination search. Carvers discretize at
min_freq / 2so this gate runs at the halved threshold, giving the combination evaluator a finer granularity to recombine.Inside
CombinationEvaluatorviability checks on both train and dev samples for every candidate combination during the search.
Tuning. Set via DiscretizerConfig.min_freq_alpha (default
0.05). Smaller \(\alpha\) → wider CI → fewer rejections → less merging;
larger \(\alpha\) → tighter CI → more rejections → more aggressive merging.
\(\alpha = 1\) recovers the legacy strict-threshold behaviour
(\(\text{UB}\) collapses to \(\hat p\)).
- AutoCarver.discretizers.utils.frequency_ci.wilson_upper_bound(count: ndarray | int | float, nobs: int, alpha: float) ndarray | float
Upper bound of the two-sided Wilson score interval for
count / nobs.- Parameters:
count (array-like or scalar) – Observed successes. Accepts integer counts or float counts (e.g. weighted/aggregated frequencies).
nobs (int) – Number of trials. Must be
>= 0; returns1.0whennobs == 0so callers treat empty samples as non-significant.alpha (float) – Two-sided significance level (e.g.
0.05for a 95% interval).
- Returns:
Wilson upper bound, same shape as
count.- Return type:
array-like or scalar
- AutoCarver.discretizers.utils.frequency_ci.is_significantly_below(count: ndarray | int | float, nobs: int, min_freq: float, alpha: float) ndarray | bool
Whether the observed proportion
count / nobsis significantly belowmin_freq.A modality is significantly below
min_freqwhen the Wilson upper bound of its observed proportion is strictly belowmin_freq.
Search strategy — interval dynamic programming (DP) with progressive top-K
For fixed min_freq, max_n_mod and association metric, AutoCarver
returns the partition that maximises the metric among admissible candidates.
The DP described below is a search-strategy optimisation; it does not
prune the candidate set and does not change the statistical claim.
Bit-exact agreement with the legacy enumerate-and-score path is pinned by
parity tests (tests/combinations/binary/test_dp_chi2_parity.py,
tests/combinations/continuous/test_dp_kruskal_parity.py).
The search problem
For a feature with raw modalities \(m_0, \dots, m_{n-1}\) already ordered
(by ordinal rank, target rate, or numeric quantile), the carver searches over
consecutive segmentations with at most max_n_mod groups: a partition is
fully determined by integer split positions
\(0 = s_0 < s_1 < \dots < s_k = n\) with \(k \le \text{max_n_mod}\).
The chosen partition maximises the association metric subject to the
viability filter (Wilson min_freq on train + dev,
distinct target rates, preserved rank between train and dev when dev is
provided).
The legacy path enumerated every admissible partition, scored each, then walked them in metric-desc order. This is correct but wasteful — only the top handful of candidates ever survive the viability walk.
The DP idea
The DP exploits two properties shared by both supported metrics (Kruskal-Wallis \(H\) for continuous targets, Pearson \(\chi^2\) for binary targets):
Segmentation structure. A partition is a sequence of disjoint consecutive intervals \([s_g, s_{g+1})\). Sub-problems factorise over the right boundary \(j\) and the number of groups \(k\).
Additive decomposability of the metric over groups, given fixed \(k\). Both \(H\) and \(\chi^2\) reduce — at fixed \(k\) and after factoring out \(k\)-dependent normalising constants — to a sum over groups of a quantity that depends only on a single interval \([i, j)\) of raw modalities.
We therefore run an interval DP indexed by \((k, j)\) whose state is the top-K prefixes (by partial score) ending at split \(j\) with \(k\) groups:
The final candidate list is \(\bigcup_k \text{dp}[k][n]\), sorted desc and
truncated to top_k.
| k \ j | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| k=1 | ● | ● | ● | ● | ● | ● |
| k=2 | — | ● | ● | ● | ● | ● |
| k=3 | — | — | ● | ● | ● | ★ |
top_k
partitions by score, then the viability filter (Wilson min_freq,
distinct target rates, train/dev rank) walks them in order. If none pass,
top_k
doubles and the DP re-runs — keeping the common case (a viable winner in the
first batch) cheap, while preserving the optimality guarantee in the worst case.
Progressive top-K
The DP returns the top-K scored partitions, not all of them. The viability
walk consumes that list in metric-desc order; if no candidate is viable in the
current top-K we double top_k and re-run the DP, walking only the
newly-appeared entries. Repeats until either:
a viable candidate is found, or
the DP returns fewer than
top_kentries — every consecutive partition has been emitted; no viable exists for this feature.
Doubling guarantees the search is exhaustive in the worst case: the same
admissible candidate set is eventually considered as in the legacy
enumerate-and-score path. Total work is bounded by \(\sim 2 \times\) a
single DP run at the final top_k. The common case (viable found in the
initial top-K) costs \(O(K \cdot n^2 \cdot \text{top\_k} \cdot \log
\text{top\_k})\) ops, independent of the total combination count — which
scales combinatorially in \(n\) and max_n_mod and reached \(\sim
8\text{M}\) at \(n=40,\,\text{max_n_mod}=7\) previously.
The initial top-K is configurable via the class attribute
CombinationEvaluator.dp_top_k_initial (default 1000).
NaN fan-out path
When dropna=True and the feature has NaNs, the DP runs on the non-NaN
sub-index to produce base partitions. Each base is then fanned out across
NaN placements:
NaN folded into each existing group;
NaN as its own group when
len(base) < max_n_mod;plus the degenerate
[all_non_nan, [NaN]]partition.
Each variant is scored in closed form (_kruskal_h_for_combination /
_chi2_assoc_for_combination) against the full per-modality stats — the
NaN row is still in the aggregated sample because _apply_best_combination
rebuilt it from raw after the non-NaN DP. Variants are walked sorted desc,
dedup’d by partition key across progressive iterations so combinations carried
over from a smaller top_k are not re-tested.
What does not change
The admissible candidate set: consecutive segmentations with \(k \le \text{max_n_mod}\).
The viability filter: Wilson
min_freqon train + dev, distinct target rates, rank preservation.The optimality claim: for fixed
min_freq,max_n_mod, and metric, no admissible combination scores higher than the one returned.
The DP is a search-strategy optimisation, not a statistical change.
Classification tasks
Pearson \(\chi^2\) (binary targets)
For a 2-column contingency table (binary target), each group \(g\) contributes counts \((n_{0,g},\, n_{1,g})\). With row marginals \(R_g = n_{0,g} + n_{1,g}\), column marginals \(C_c = \sum_g n_{c,g}\), and grand total \(N = \sum_g R_g\), Pearson’s statistic is
Two key observations:
Given a fixed number of groups \(k\), the column marginals \(C_c\) and total \(N\) depend only on \(k\) (and a constant
tolshift applied to every cell — matching the legacychi2_contingency(xagg + tol)call): \(C_c = N_c + k\cdot\text{tol}\), \(N = N_0 + N_1 + 2k\cdot\text{tol}\). They are invariant under re-partitioning at fixed \(k\).Therefore, at fixed \(k\), \(\chi^2\) is additive over groups: each group contributes \((O - E)^2 / E\) summed over its two cells, with \(E\) derivable from \((n_{0,g},\, n_{1,g})\) and the constants \((C_0, C_1, N)\). The Yates correction (subtract \(0.5\) from \(|O - E|\) iff the table is exactly \(2 \times 2\), matching scipy’s default) is applied only when \(k = 2\), which is again known at the DP level.
The DP is therefore run once per \(k \in [2,\, \text{max_n_mod}]\) with the constants \((C_0, C_1, N, \text{yates_flag})\) fixed; per-\(k\) top-K lists are merged and re-truncated:
Cramér’s \(V = \sqrt{\chi^2 / N_{obs}}\) and Tschuprow’s
\(T = V / \sqrt[4]{k - 1}\) are monotone transforms of \(\chi^2\) at
fixed \(k\), so sorting by either is equivalent to sorting by \(\chi^2\)
within each \(k\) slice. The cross-\(k\) merge re-applies the
configured sort_by so the global top-K is correct under either metric.
Statistical equivalence to scipy.stats.chi2_contingency() is bit-exact
(parity tests pin the \(+\text{tol}\) shift, the Yates handling, and the
\(\text{round}(x / \text{tol}) \cdot \text{tol}\) quantisation).
Cramér’s V Combinations
See Cramér’s V for more details on the metric.
- class AutoCarver.combinations.CramervCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Cramér’s V based combination evaluation toolkit.
Same DP search as
TschuprowtCombinations(see Pearson \chi^2 (binary targets)); only thesort_bykey differs. \(V = \sqrt{\chi^2 / N_{obs}}\) is a monotone transform of \(\chi^2\) at fixed \(k\).- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
Tschuprow’s T Combinations
See Tschuprow’s T for more details on the metric.
- class AutoCarver.combinations.TschuprowtCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Tschuprow’s T based combination evaluation toolkit.
Search uses progressive top-K interval DP over the closed-form Pearson \(\chi^2\) decomposition (per-k DP with constant column marginals, Yates correction iff
k == 2). Statistically equivalent toscipy.stats.chi2_contingency()— bit-exact agreement pinned by parity tests.- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
Regression tasks
Kruskal-Wallis H (continuous targets)
Given a partition with \(n_g\) observations per group, rank sum \(R_g\), total \(N = \sum_g n_g\), and tie correction \(T = 1 - \sum_i (t_i^3 - t_i) / (N^3 - N)\) (depends only on the pooled \(y\) multiset), the Kruskal-Wallis statistic is
Two key observations:
Per-modality \((R_i, n_i)\), the total \(N\), and the tie correction \(T\) depend only on the raw feature ranking — not on the partition. They are computed once by ranking \(y\) once over the pooled sample (see
_modality_rank_stats).\(\sum_g R_g^2 / n_g\) is additive over groups. With prefix sums
R_prefixandn_prefixover the raw modalities, a single interval’s contribution is closed-form:\[\text{seg_cost}(i, j) = \frac{\big(\text{R_prefix}[j] - \text{R_prefix}[i]\big)^2} {\text{n_prefix}[j] - \text{n_prefix}[i]}.\]
The DP maximises \(\sum_g \text{seg_cost}\) over partitions; \(H\)
is recovered at the end by applying the constant prefactor
\(12 / (N(N+1))\), the constant offset \(-3(N+1)\), and dividing by
\(T\). Statistical equivalence to scipy.stats.kruskal() is
bit-exact — the DP only re-orders the search.
Kruskal’s H Combinations
See Kruskal-Wallis’ H test statistic for more details on the metric.
- class AutoCarver.combinations.KruskalCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Kruskal-Wallis’ H based combination evaluation toolkit.
Search uses progressive top-K interval DP over the closed-form Kruskal-Wallis H decomposition (rank once over pooled
y, prefix-sum per-modality rank stats). Statistically equivalent toscipy.stats.kruskal()— bit-exact agreement pinned by parity tests.- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name