Viability testing

The combination search ranks every candidate grouping by its association with the target, but the highest-scoring grouping is not necessarily the one that is kept. Before a combination is selected it must pass the viability filter — a set of statistical guardrails that reject groupings which are over-fit, degenerate, or driven by sampling noise. The top-K DP search walks candidates in metric-descending order and the first one that is viable on both the train and dev samples wins.

The filter bundles three independent tests:

Minimum-frequency — every grouped modality is frequent enough (Wilson score interval).
Distinct target rates — consecutive modalities carry different target rates.
Train/dev rank preservation — the modality ordering by target rate is stable between train and dev (robustness veto).

Decision rule. The min-frequency and distinct-rate tests run on both samples; rank preservation runs on dev only (it compares dev against train). A combination is viable iff it passes on train and — when a dev sample was provided — passes on dev:

\[\text{viable} \;=\; \text{viable}_{\text{train}} \;\wedge\; \big(\text{viable}_{\text{dev}} \;\vee\; \text{no dev sample}\big).\]

When no candidate survives the filter the feature is dropped — see Dropped features (no robust combination). Each failing test contributes a human-readable reason (“Non-representative modality for min_freq=…”, “Non-distinct target rates per consecutive modalities”, “Inversion of target rates per modality”) that surfaces in carver.history and dropped_reason.

AutoCarver.combinations.utils.testing.test_viability(rates: Series | DataFrame, min_freq: int | float | None, target_rate: str, alpha: float, train_target_rate: Series | None = None) → dict

tests viability of the rates.

rates must carry per-modality count and frequency columns (added by the binary/continuous target-rate builders); CI tests use the counts and nobs = counts.sum().

AutoCarver.combinations.utils.testing.is_viable(test_results: dict): checks if combination is viable on train and dev (if provided)

Minimum-frequency test (Wilson score interval)

A candidate combination is viable on a sample only if every grouped modality is sufficiently frequent. Comparing \(\hat p = \text{count} / n_{obs}\) directly against min_freq is noisy on small modalities — a modality with \(\hat p = 4\%\) out of \(n_{obs}=100\) would be rejected against min_freq=5%, even though its 95% confidence interval comfortably straddles 5%. AutoCarver instead tests the one-sided question “is this modality’s true proportion significantly below min_freq ?” at level \(\alpha\), using a Wilson score interval — the small-sample-stable proportion interval recommended over Wald in Brown, Cai & DasGupta (2001).

Decision rule. Modality \(m\) is declared under-represented iff the upper bound of the two-sided Wilson interval for \(\hat p_m\) is strictly below min_freq:

\[\text{UB}(\hat p, n, \alpha) = \frac{\hat p + z^2/(2n)}{1 + z^2/n} + \frac{z}{1 + z^2/n}\sqrt{\frac{\hat p(1-\hat p)}{n} + \frac{z^2}{4n^2}}\]

with \(z = \Phi^{-1}(1 - \alpha/2)\) (two-sided z-score; \(\alpha=0.05\) gives \(z \approx 1.96\)). Reject iff \(\text{UB}(\hat p_m, n_{obs}, \alpha) < \text{min_freq}\).

Properties.

Asymptotic equivalence: as \(n_{obs} \to \infty\), \(\text{UB} \to \hat p\), so the test converges to the legacy strict threshold \(\hat p < \text{min_freq}\).
Small-sample conservativity: a modality with very few observations cannot be rejected (the CI is too wide to fall below min_freq), preventing premature merges driven by sampling noise.
\(n_{obs} = 0\) returns \(\text{UB} = 1.0\), so empty groups are never rejected by this test (other checks catch them).

Where the test fires.

Inside each Discretizer to gate raw modalities before the combination search. Carvers discretize at min_freq / 2 so this gate runs at the halved threshold, giving the combination evaluator a finer granularity to recombine.
Inside CombinationEvaluator viability checks on both train and dev samples for every candidate combination during the search.

Tuning. Set via DiscretizerConfig.min_freq_alpha (default 0.05). Smaller \(\alpha\) → wider CI → fewer rejections → less merging; larger \(\alpha\) → tighter CI → more rejections → more aggressive merging. \(\alpha = 1\) recovers the legacy strict-threshold behaviour (\(\text{UB}\) collapses to \(\hat p\)).

AutoCarver.discretizers.utils.frequency_ci.wilson_upper_bound(count: ndarray | int | float, nobs: int, alpha: float) → ndarray | float

Upper bound of the two-sided Wilson score interval for count / nobs.

Parameters:

count (array-like or scalar) – Observed successes. Accepts integer counts or float counts (e.g. weighted/aggregated frequencies).
nobs (int) – Number of trials. Must be >= 0; returns 1.0 when nobs == 0 so callers treat empty samples as non-significant.
alpha (float) – Two-sided significance level (e.g. 0.05 for a 95% interval).

Returns:

Wilson upper bound, same shape as count.

Return type:

array-like or scalar

AutoCarver.discretizers.utils.frequency_ci.is_significantly_below(count: ndarray | int | float, nobs: int, min_freq: float, alpha: float) → ndarray | bool

Whether the observed proportion count / nobs is significantly below min_freq.

A modality is significantly below min_freq when the Wilson upper bound of its observed proportion is strictly below min_freq.

Distinct-target-rate test

Modalities are searched as consecutive groupings of an ordered feature (by ordinal rank, target rate, or numeric quantile — see Search strategy — interval dynamic programming (DP) with progressive top-K). Two adjacent groups that end up with the same target rate are statistically indistinguishable: the split between them carries no information and the two groups should have been merged into one. A combination is rejected as soon as any consecutive pair shares its target rate:

\[\text{distinct} \;=\; \neg \, \exists\, m \,:\, \tau_m \approx \tau_{m-1},\]

where \(\tau_m\) is the target rate of modality \(m\) and \(\approx\) is a floating-point closeness check (numpy.isclose). Keeping the test on consecutive modalities (rather than all pairs) matches the ordered nature of the search: non-adjacent groups are allowed to coincide, only neighbours that would collapse are forbidden. Failing this test favours the coarser, more parsimonious combination the search will reach next.

Train/dev rank-preservation test (robustness veto)

When a dev sample is provided, a viable combination must be robust: the modalities, ranked by their target rate, must keep the same order on train and on dev. A combination whose target-rate ordering flips between the two samples is over-fit to train and is vetoed:

\[\text{rank ok} \;=\; \big[\, \operatorname{argsort}_m \tau^{\text{train}}_m \;=\; \operatorname{argsort}_m \tau^{\text{dev}}_m \,\big].\]

Both target-rate series are aligned on the same modality index before sorting by value, so the comparison is purely about order, not about the rates’ absolute magnitudes. This is the test that most often drives a feature into dropped_features: when every train-viable combination inverts on dev, it usually signals that X_dev is too small or not representative of X for that feature. The three levers are enlarging the dev sample, relaxing max_n_mod, or dropping the feature.