Viability testing
The combination search ranks every candidate grouping by its association with the target, but the highest-scoring grouping is not necessarily the one that is kept. Before a combination is selected it must pass the viability filter — a set of statistical guardrails that reject groupings which are over-fit, degenerate, or driven by sampling noise. The top-K DP search walks candidates in metric-descending order and the first one that is viable on both the train and dev samples wins.
The filter bundles three independent tests:
Minimum-frequency — every grouped modality is frequent enough (Wilson score interval).
Distinct target rates — consecutive modalities carry different target rates.
Train/dev rank preservation — the modality ordering by target rate is stable between train and dev (robustness veto).
Decision rule. The min-frequency and distinct-rate tests run on both samples; rank preservation runs on dev only (it compares dev against train). A combination is viable iff it passes on train and — when a dev sample was provided — passes on dev:
When no candidate survives the filter the feature is dropped — see
Dropped features (no robust combination). Each failing test contributes a human-readable reason
(“Non-representative modality for min_freq=…”, “Non-distinct target rates per
consecutive modalities”, “Inversion of target rates per modality”) that
surfaces in carver.history and dropped_reason.
- AutoCarver.combinations.utils.testing.test_viability(rates: Series | DataFrame, min_freq: int | float | None, target_rate: str, alpha: float, train_target_rate: Series | None = None) dict
tests viability of the rates.
ratesmust carry per-modalitycountandfrequencycolumns (added by the binary/continuous target-rate builders); CI tests use the counts andnobs = counts.sum().
- AutoCarver.combinations.utils.testing.is_viable(test_results: dict)
checks if combination is viable on train and dev (if provided)
Minimum-frequency test (Wilson score interval)
A candidate combination is viable on a sample only if every grouped modality is
sufficiently frequent. Comparing \(\hat p = \text{count} / n_{obs}\) directly
against min_freq is noisy on small modalities — a modality with
\(\hat p = 4\%\) out of \(n_{obs}=100\) would be rejected against
min_freq=5%, even though its 95% confidence interval comfortably straddles
5%. AutoCarver instead tests the one-sided question “is this modality’s
true proportion significantly below min_freq ?” at level \(\alpha\),
using a Wilson score interval — the small-sample-stable proportion interval
recommended over Wald in Brown, Cai & DasGupta (2001).
Decision rule. Modality \(m\) is declared under-represented iff the
upper bound of the two-sided Wilson interval for \(\hat p_m\) is
strictly below min_freq:
with \(z = \Phi^{-1}(1 - \alpha/2)\) (two-sided z-score; \(\alpha=0.05\) gives \(z \approx 1.96\)). Reject iff \(\text{UB}(\hat p_m, n_{obs}, \alpha) < \text{min_freq}\).
Properties.
Asymptotic equivalence: as \(n_{obs} \to \infty\), \(\text{UB} \to \hat p\), so the test converges to the legacy strict threshold \(\hat p < \text{min_freq}\).
Small-sample conservativity: a modality with very few observations cannot be rejected (the CI is too wide to fall below
min_freq), preventing premature merges driven by sampling noise.\(n_{obs} = 0\) returns \(\text{UB} = 1.0\), so empty groups are never rejected by this test (other checks catch them).
Where the test fires.
Inside each Discretizer to gate raw modalities before the combination search. Carvers discretize at
min_freq / 2so this gate runs at the halved threshold, giving the combination evaluator a finer granularity to recombine.Inside
CombinationEvaluatorviability checks on both train and dev samples for every candidate combination during the search.
Tuning. Set via DiscretizerConfig.min_freq_alpha (default
0.05). Smaller \(\alpha\) → wider CI → fewer rejections → less merging;
larger \(\alpha\) → tighter CI → more rejections → more aggressive merging.
\(\alpha = 1\) recovers the legacy strict-threshold behaviour
(\(\text{UB}\) collapses to \(\hat p\)).
- AutoCarver.discretizers.utils.frequency_ci.wilson_upper_bound(count: ndarray | int | float, nobs: int, alpha: float) ndarray | float
Upper bound of the two-sided Wilson score interval for
count / nobs.- Parameters:
count (array-like or scalar) – Observed successes. Accepts integer counts or float counts (e.g. weighted/aggregated frequencies).
nobs (int) – Number of trials. Must be
>= 0; returns1.0whennobs == 0so callers treat empty samples as non-significant.alpha (float) – Two-sided significance level (e.g.
0.05for a 95% interval).
- Returns:
Wilson upper bound, same shape as
count.- Return type:
array-like or scalar
- AutoCarver.discretizers.utils.frequency_ci.is_significantly_below(count: ndarray | int | float, nobs: int, min_freq: float, alpha: float) ndarray | bool
Whether the observed proportion
count / nobsis significantly belowmin_freq.A modality is significantly below
min_freqwhen the Wilson upper bound of its observed proportion is strictly belowmin_freq.
Distinct-target-rate test
Modalities are searched as consecutive groupings of an ordered feature (by ordinal rank, target rate, or numeric quantile — see Search strategy — interval dynamic programming (DP) with progressive top-K). Two adjacent groups that end up with the same target rate are statistically indistinguishable: the split between them carries no information and the two groups should have been merged into one. A combination is rejected as soon as any consecutive pair shares its target rate:
where \(\tau_m\) is the target rate of modality \(m\) and
\(\approx\) is a floating-point closeness check (numpy.isclose). Keeping
the test on consecutive modalities (rather than all pairs) matches the
ordered nature of the search: non-adjacent groups are allowed to coincide, only
neighbours that would collapse are forbidden. Failing this test favours the
coarser, more parsimonious combination the search will reach next.
Train/dev rank-preservation test (robustness veto)
When a dev sample is provided, a viable combination must be robust: the modalities, ranked by their target rate, must keep the same order on train and on dev. A combination whose target-rate ordering flips between the two samples is over-fit to train and is vetoed:
Both target-rate series are aligned on the same modality index before sorting by
value, so the comparison is purely about order, not about the rates’ absolute
magnitudes. This is the test that most often drives a feature into
dropped_features: when every train-viable combination
inverts on dev, it usually signals that X_dev is too small or not
representative of X for that feature. The three levers are enlarging the dev
sample, relaxing max_n_mod, or dropping the feature.