Combinations
Combinations are at the core of Carvers. They are used to identify the best combination
from all possible combinations with up to max_n_mod modalities.
A pre-built CombinationEvaluator instance can be passed to any carver via the
combination_evaluator keyword. Each subclass defaults to a task-appropriate metric:
TschuprowtCombinations for BinaryCarver / MulticlassCarver,
KruskalCombinations for ContinuousCarver.
The animation below starts from the six ordered bins a
QuantitativeDiscretizer produces (its final state — see
Complete pipeline for continuous and discrete features) and shows the core step: every consecutive
grouping into max_n_mod groups is scored by its association with the binary
target (Tschuprow’s T) and the table fills best-first in growing top_k
batches (the progressive top-K DP search). The highest-scoring
grouping that passes the viability filter is kept
(gold row). Adjacent bins sharing a colour in a row are merged into one group.
- class AutoCarver.combinations.CombinationEvaluator(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
CombinationEvaluator class to evaluate the best combination of modalities for a feature.
- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- get_best_combination(feature: BaseFeature, xagg: Series | DataFrame | None, xagg_dev: Series | DataFrame | None = None, *, max_n_mod: int, min_freq: float, dropna: bool, min_freq_alpha: float = 0.05) dict | None
Computes the best combination of modalities for the feature.
- Parameters:
feature (BaseFeature) – Feature being carved.
xagg (pd.Series | pd.DataFrame | None) – Train-sample aggregation (per-modality y-lists for continuous, crosstab for binary).
xagg_dev (pd.Series | pd.DataFrame | None, optional) – Dev-sample aggregation, used for the robustness veto.
max_n_mod (int) – Maximum number of modalities allowed in the returned combination.
min_freq (float) – Minimum per-modality frequency, tested via a Wilson upper bound at significance
min_freq_alpha(see Minimum-frequency test (Wilson score interval)).dropna (bool) – Whether to group NaN as another modality (fans out NaN placements after the non-NaN DP, see Search strategy — interval dynamic programming (DP) with progressive top-K).
min_freq_alpha (float, default
0.05) – Two-sided significance level of the Wilson score interval. Smaller → wider CI → fewer rejections.alpha=1recovers the legacy strict-threshold behaviour.
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
The highest-scoring grouping is not necessarily the one that is kept: each candidate must clear the viability filter (minimum frequency via a Wilson score interval, distinct consecutive target rates, and train/dev rank preservation). That filter is documented on its own page.
Search strategy — interval dynamic programming (DP) with progressive top-K
For fixed min_freq, max_n_mod and association metric, AutoCarver
returns the partition that maximises the metric among admissible candidates.
The DP described below is a search-strategy optimisation; it does not
prune the candidate set and does not change the statistical claim.
Bit-exact agreement with the legacy enumerate-and-score path is pinned by
parity tests (tests/combinations/binary/test_dp_chi2_parity.py,
tests/combinations/continuous/test_dp_kruskal_parity.py).
The search problem
For a feature with raw modalities \(m_0, \dots, m_{n-1}\) already ordered
(by ordinal rank, target rate, or numeric quantile), the carver searches over
consecutive segmentations with at most max_n_mod groups: a partition is
fully determined by integer split positions
\(0 = s_0 < s_1 < \dots < s_k = n\) with \(k \le \text{max_n_mod}\).
The chosen partition maximises the association metric subject to the
viability filter (Wilson min_freq on train + dev,
distinct target rates, preserved rank between train and dev when dev is
provided).
The legacy path enumerated every admissible partition, scored each, then walked them in metric-desc order. This is correct but wasteful — only the top handful of candidates ever survive the viability walk.
The DP idea
The DP exploits two properties shared by both supported metrics (Kruskal-Wallis \(H\) for continuous targets, Pearson \(\chi^2\) for binary targets):
Segmentation structure. A partition is a sequence of disjoint consecutive intervals \([s_g, s_{g+1})\). Sub-problems factorise over the right boundary \(j\) and the number of groups \(k\).
Additive decomposability of the metric over groups, given fixed \(k\). Both \(H\) and \(\chi^2\) reduce — at fixed \(k\) and after factoring out \(k\)-dependent normalising constants — to a sum over groups of a quantity that depends only on a single interval \([i, j)\) of raw modalities.
We therefore run an interval DP indexed by \((k, j)\) whose state is the top-K prefixes (by partial score) ending at split \(j\) with \(k\) groups:
The final candidate list is \(\bigcup_k \text{dp}[k][n]\), sorted desc and
truncated to top_k.
| k \ j | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| k=1 | ● | ● | ● | ● | ● | ● |
| k=2 | — | ● | ● | ● | ● | ● |
| k=3 | — | — | ● | ● | ● | ★ |
top_k
partitions by score, then the viability filter (Wilson min_freq,
distinct target rates, train/dev rank) walks them in order. If none pass,
top_k
doubles and the DP re-runs — keeping the common case (a viable winner in the
first batch) cheap, while preserving the optimality guarantee in the worst case.
Progressive top-K
The DP returns the top-K scored partitions, not all of them. The viability
walk consumes that list in metric-desc order; if no candidate is viable in the
current top-K we double top_k and re-run the DP, walking only the
newly-appeared entries. Repeats until either:
a viable candidate is found, or
the DP returns fewer than
top_kentries — every consecutive partition has been emitted; no viable exists for this feature.
Doubling guarantees the search is exhaustive in the worst case: the same
admissible candidate set is eventually considered as in the legacy
enumerate-and-score path. Total work is bounded by \(\sim 2 \times\) a
single DP run at the final top_k. The common case (viable found in the
initial top-K) costs \(O(K \cdot n^2 \cdot \text{top\_k} \cdot \log
\text{top\_k})\) ops, independent of the total combination count — which
scales combinatorially in \(n\) and max_n_mod and reached \(\sim
8\text{M}\) at \(n=40,\,\text{max_n_mod}=7\) previously.
The initial top-K is configurable via the class attribute
CombinationEvaluator.dp_top_k_initial (default 1000).
NaN fan-out path
When dropna=True and the feature has NaNs, the DP runs on the non-NaN
sub-index to produce base partitions. Each base is then fanned out across
NaN placements:
NaN folded into each existing group;
NaN as its own group when
len(base) < max_n_mod;plus the degenerate
[all_non_nan, [NaN]]partition.
Each variant is scored in closed form (_kruskal_h_for_combination /
_chi2_assoc_for_combination) against the full per-modality stats — the
NaN row is still in the aggregated sample because _apply_best_combination
rebuilt it from raw after the non-NaN DP. Variants are walked sorted desc,
dedup’d by partition key across progressive iterations so combinations carried
over from a smaller top_k are not re-tested.
What does not change
The admissible candidate set: consecutive segmentations with \(k \le \text{max_n_mod}\).
The viability filter: Wilson
min_freqon train + dev, distinct target rates, rank preservation.The optimality claim: for fixed
min_freq,max_n_mod, and metric, no admissible combination scores higher than the one returned.
The DP is a search-strategy optimisation, not a statistical change.
Target rates
Every combination evaluator carries a target rate — the per-modality summary
of the target that the carver reports and, crucially, the scalar the
viability filter orders by. It is passed via the
target_rate keyword and defaults to a task-appropriate choice (the target
mean, TargetMean, for every built-in evaluator).
A target rate plays two distinct roles:
Display statistic. One value per modality, stored on
feature.statisticsand surfaced in the carved-feature summary, so the grouping can be read off in interpretable units (an event rate, an odds ratio, a mean target, …).Ordering key for viability. The same per-modality value is what the distinct-rate test requires to differ between consecutive modalities, and what the train/dev rank-preservation veto sorts on. A combination whose target-rate ordering collapses or flips is rejected.
Because of this dual role, two properties decide whether a candidate rate is a good fit:
It must be an orderable scalar. The viability checks need a single value per modality with a meaningful monotone ordering. A symmetric measure (e.g. a Gini-style impurity, maximal at \(p = 0.5\)) is fine as a display statistic but a poor ordering key.
Decomposability buys a fast path. When a rate can be reconstructed from per-raw-modality sufficient statistics it can opt into a closed-form path (
compute_from_stats) that costs \(O(k)\) per combination instead of re-aggregating the raw sample on every candidate. The continuous mean does this; rates that need the full value multiset (median, quantiles) cannot and fall back to the general aggregation path.
- class AutoCarver.combinations.utils.TargetRate
Target rate class.
Generic over
XAggfor the inner_compute()worker so that binary (DataFramecrosstabs) and continuous (Seriesof y-lists) subclasses don’t violate LSP by narrowing the worker’s parameter type. The outercompute()keeps a wideSeries | DataFrame | Nonesignature because call sites inCombinationEvaluatorandBaseCarvercarry that union directly fromAggregatedSample.raw/ pretty-printer plumbing.- abstract compute(xagg: Series | DataFrame) DataFrame
- abstract compute(xagg: None) None
Computes the target rate.
- Parameters:
xagg (pd.Series | pd.DataFrame | None) – A crosstab (binary) or Series-of-y-lists (continuous).
- Returns:
Target rate frame, or
NoneifxaggwasNone.- Return type:
pd.DataFrame | None
Binary target rates
For binary (and multiclass) targets the per-modality input is a two-column
crosstab \((n_0, n_1)\), so every rate below is closed-form. The default is
the event rate \(p = n_1 / (n_0 + n_1)\) (TargetMean).
- class AutoCarver.combinations.binary.binary_target_rates.TargetMean
Mean target rate class.
Continuous target rates
For continuous targets the per-modality input is the multiset of target values.
The default TargetMean is decomposable from per-modality \((n, \sum y)\)
aggregates and therefore implements the closed-form compute_from_stats fast
path; TargetMedian is not decomposable from sums and uses the general
aggregation path.
- class AutoCarver.combinations.continuous.continuous_target_rates.TargetMean
Mean target rate class.
- class AutoCarver.combinations.continuous.continuous_target_rates.TargetMedian
Median of target per class.
Custom target rates
A custom rate subclasses BinaryTargetRate or ContinuousTargetRate and
implements _compute (one value per modality). To make it serialisable
through save / load, add it to the evaluator’s _target_rate_classes
registry. When the rate is additively decomposable from per-modality sums,
override compute_from_stats so the search uses the closed-form path.
Candidate extensions that fit this contract (not yet implemented):
Binary, closed-form: logit / log-odds \(\log(n_1 / n_0)\), and a column-normalised weight-of-evidence \(\log\!\big((n_1/\textstyle\sum n_1)\,/\,(n_0/\textstyle\sum n_0)\big)\).
Continuous, decomposable (extend the carried stats with \(\sum y^2\)): variance and standard deviation.
Continuous, non-decomposable (general path only): inter-quartile range, arbitrary quantiles, and trimmed/robust location — useful for heavy-tailed targets where the mean ordering is unstable.
Classification tasks
Pearson \(\chi^2\) (binary targets)
For a 2-column contingency table (binary target), each group \(g\) contributes counts \((n_{0,g},\, n_{1,g})\). With row marginals \(R_g = n_{0,g} + n_{1,g}\), column marginals \(C_c = \sum_g n_{c,g}\), and grand total \(N = \sum_g R_g\), Pearson’s statistic is
Two key observations:
Given a fixed number of groups \(k\), the column marginals \(C_c\) and total \(N\) depend only on \(k\) (and a constant
tolshift applied to every cell — matching the legacychi2_contingency(xagg + tol)call): \(C_c = N_c + k\cdot\text{tol}\), \(N = N_0 + N_1 + 2k\cdot\text{tol}\). They are invariant under re-partitioning at fixed \(k\).Therefore, at fixed \(k\), \(\chi^2\) is additive over groups: each group contributes \((O - E)^2 / E\) summed over its two cells, with \(E\) derivable from \((n_{0,g},\, n_{1,g})\) and the constants \((C_0, C_1, N)\). The Yates correction (subtract \(0.5\) from \(|O - E|\) iff the table is exactly \(2 \times 2\), matching scipy’s default) is applied only when \(k = 2\), which is again known at the DP level.
The DP is therefore run once per \(k \in [2,\, \text{max_n_mod}]\) with the constants \((C_0, C_1, N, \text{yates_flag})\) fixed; per-\(k\) top-K lists are merged and re-truncated:
Cramér’s \(V = \sqrt{\chi^2 / N_{obs}}\) and Tschuprow’s
\(T = V / \sqrt[4]{k - 1}\) are monotone transforms of \(\chi^2\) at
fixed \(k\), so sorting by either is equivalent to sorting by \(\chi^2\)
within each \(k\) slice. The cross-\(k\) merge re-applies the
configured sort_by so the global top-K is correct under either metric.
Statistical equivalence to scipy.stats.chi2_contingency() is bit-exact
(parity tests pin the \(+\text{tol}\) shift, the Yates handling, and the
\(\text{round}(x / \text{tol}) \cdot \text{tol}\) quantisation).
Cramér’s V Combinations
See Cramér’s V for more details on the metric.
- class AutoCarver.combinations.CramervCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Cramér’s V based combination evaluation toolkit.
Same DP search as
TschuprowtCombinations(see Pearson \chi^2 (binary targets)); only thesort_bykey differs. \(V = \sqrt{\chi^2 / N_{obs}}\) is a monotone transform of \(\chi^2\) at fixed \(k\).- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
Tschuprow’s T Combinations
See Tschuprow’s T for more details on the metric.
- class AutoCarver.combinations.TschuprowtCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Tschuprow’s T based combination evaluation toolkit.
Search uses progressive top-K interval DP over the closed-form Pearson \(\chi^2\) decomposition (per-k DP with constant column marginals, Yates correction iff
k == 2). Statistically equivalent toscipy.stats.chi2_contingency()— bit-exact agreement pinned by parity tests.- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name
Regression tasks
Kruskal-Wallis H (continuous targets)
Given a partition with \(n_g\) observations per group, rank sum \(R_g\), total \(N = \sum_g n_g\), and tie correction \(T = 1 - \sum_i (t_i^3 - t_i) / (N^3 - N)\) (depends only on the pooled \(y\) multiset), the Kruskal-Wallis statistic is
Two key observations:
Per-modality \((R_i, n_i)\), the total \(N\), and the tie correction \(T\) depend only on the raw feature ranking — not on the partition. They are computed once by ranking \(y\) once over the pooled sample (see
_modality_rank_stats).\(\sum_g R_g^2 / n_g\) is additive over groups. With prefix sums
R_prefixandn_prefixover the raw modalities, a single interval’s contribution is closed-form:\[\text{seg_cost}(i, j) = \frac{\big(\text{R_prefix}[j] - \text{R_prefix}[i]\big)^2} {\text{n_prefix}[j] - \text{n_prefix}[i]}.\]
The DP maximises \(\sum_g \text{seg_cost}\) over partitions; \(H\)
is recovered at the end by applying the constant prefactor
\(12 / (N(N+1))\), the constant offset \(-3(N+1)\), and dividing by
\(T\). Statistical equivalence to scipy.stats.kruskal() is
bit-exact — the DP only re-orders the search.
Kruskal’s H Combinations
See Kruskal-Wallis’ H test statistic for more details on the metric.
- class AutoCarver.combinations.KruskalCombinations(*, verbose: bool = False, target_rate: TargetRate[XAgg] | None = None)
Kruskal-Wallis’ H based combination evaluation toolkit.
Search uses progressive top-K interval DP over the closed-form Kruskal-Wallis H decomposition (rank once over pooled
y, prefix-sum per-modality rank stats). Statistically equivalent toscipy.stats.kruskal()— bit-exact agreement pinned by parity tests.- Parameters:
verbose (bool, optional) – Whether to print progress / statistics, by default
False.target_rate (TargetRate, optional) – Target rate strategy. If
None, each evaluator subclass picks its own default in_init_target_rate().
- classmethod load(file: str | dict) CombinationEvaluator
Allows one to load a
CombinationEvaluatorsaved as a .json file.- Parameters:
file (str | dict) – String of .json file name or content of the file.
- Returns:
A ready-to-use
CombinationEvaluator- Return type:
- save(file_name: str) None
Saves
CombinationEvaluatorto .json file.- Parameters:
file_name (str) – String of .json file name