Comparison with other binning libraries

Three Python libraries are usually considered for feature discretization:

  • AutoCarver — supervised, target-association-driven binning with dev-set robustness validation.

  • optbinning — supervised binning solved as a mixed-integer program.

  • sklearn.preprocessing.KBinsDiscretizer — unsupervised quantile / uniform / k-means binning.

This page compares them on scope, algorithm, and ergonomics so you can pick the right tool for your problem. The runnable code snippets are unit-tested in tests/examples/test_comparison_snippets.py.

Scope at a glance

AutoCarver

optbinning

KBinsDiscretizer

Supervised (uses y)

yes

yes

no

Binary classification

BinaryCarver

OptimalBinning

n/a

Multiclass classification

MulticlassCarver

MulticlassOptimalBinning

n/a

Regression / continuous target

ContinuousCarver

ContinuousOptimalBinning

n/a

Quantitative features

yes

yes

yes

Categorical features

yes

yes

no (must encode first)

Ordinal features (with known order)

yes (OrdinalDiscretizer enforces the declared order)

via user_splits workaround

no

NaN as own modality

yes

yes

no (raises)

Held-out dev-set robustness check

yes (built in)

no

no

Optimality guarantee for fixed min_freq / max_n_mod / metric

yes — exhaustive search over admissible combinations

yes (MIP, under its own constraints)

n/a (no objective)

Per-bin stats + carving history after fit

yes — ``Features.summary`` and ``Features.history``

via binning_table

no

JSON round-trip persistence

yes

via pickle

via pickle

sklearn Pipeline compatible

yes (BaseEstimator / TransformerMixin)

yes

yes

Feature pre-selection helpers

ClassificationSelector, RegressionSelector

no

no

Algorithmic axis

The three libraries answer “what’s a good bin?” with very different objectives:

Library

Objective

Constraint surface

AutoCarver

Maximize Tschuprow’s T (default) or Cramér’s V between the carved feature and the target via exhaustive search over every admissible combination of consecutive modalities — so for fixed min_freq, max_n_mod and metric, no other combination scores higher. NaN groupings are explored as a separate combinatorial pass.

min_freq (minimum bucket share), max_n_mod (cap on number of modalities), monotonic ordering for ordinal features (enforced by OrdinalDiscretizer), and a dev-set veto: any candidate that flips its target-rate ordering on the dev set is rejected.

optbinning

Maximize Information Value (IV) (binary) or split-gain analogues, solved as a mixed-integer program (CBC by default).

User-declarable monotonicity, minimum bin size, maximum number of bins, optional WoE smoothing, and constraint blocks (e.g. PSI-based stability).

KBinsDiscretizer

No target awareness. Splits are placed on the marginal distribution of X only: equal-frequency (quantile), equal-width (uniform), or 1-D k-means.

n_bins per feature; that’s it.

The takeaway: AutoCarver and optbinning both optimize against the target, but AutoCarver’s robustness step (the dev-set veto) is something optbinning does not do natively — you’d have to script it yourself with cross-validation. KBinsDiscretizer is a different category: it’s a fast preprocessing primitive, not a supervised binner.

Side-by-side: bin a mixed feature set on the same data

The same problem — discretize four numeric columns and one categorical column of the Titanic data — solved three ways. All three blocks are runnable; the optbinning and KBinsDiscretizer blocks are skipped automatically in CI when those libraries are not installed.

AutoCarver

import pandas as pd
from sklearn.model_selection import train_test_split

from AutoCarver import BinaryCarver, Features

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, dev = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

features = Features(
    categoricals=["Sex"],
    quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},
)
carver = BinaryCarver(features=features, min_freq=0.05, max_n_mod=5)
carver.fit(train, train[target], X_dev=dev, y_dev=dev[target])
train_binned = carver.transform(train)
  • One call covers numeric, categorical, and ordinal columns.

  • The dev set is consumed at fit time: any bin combination whose target-rate ordering doesn’t survive on the dev sample is discarded.

  • Persisting the fitted state is carver.save("titanic_carver.json").

optbinning

import pandas as pd
from sklearn.model_selection import train_test_split
from optbinning import OptimalBinning

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

# one binner per column, dtype declared explicitly
columns = {
    "Age": "numerical",
    "Fare": "numerical",
    "Siblings/Spouses Aboard": "numerical",
    "Parents/Children Aboard": "numerical",
    "Sex": "categorical",
    "Pclass": "categorical",  # optbinning has no first-class ordinal type
}
binners = {}
train_binned = pd.DataFrame(index=train.index)
for name, dtype in columns.items():
    ob = OptimalBinning(name=name, dtype=dtype, solver="cbc")
    ob.fit(train[name].to_numpy(), train[target].to_numpy())
    train_binned[name] = ob.transform(train[name].to_numpy(), metric="bins")
    binners[name] = ob
  • Fits one binner per feature — you manage the loop.

  • No held-out validation step; you’d add cross-validation yourself.

  • Ordinal columns must be passed as categorical (with optional user_splits), losing the known order.

KBinsDiscretizer

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

numeric_cols = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
train_numeric = train[numeric_cols].fillna(train[numeric_cols].median())

kbd = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
train_binned = pd.DataFrame(
    kbd.fit_transform(train_numeric),
    columns=numeric_cols,
    index=train.index,
)
  • Unsupervised — the target is never used, so the bins do not maximize anything related to y.

  • No support for categoricals or NaN — you must impute and encode first.

  • Strong baseline when you need fast, model-agnostic binning and you accept that bins won’t be target-optimal.

When to pick which

Pick

When

AutoCarver

You want supervised binning and you have (or can carve out) a dev sample, you mix numeric / categorical / ordinal columns, you need a JSON-portable artifact to ship to a scorecard or production model, or you also need feature pre-selection.

optbinning

You want IV-driven binning solved as a true optimization problem, you need fine-grained per-feature constraints (monotonicity, WoE smoothing, PSI-based stability), and you are comfortable looping over features and managing validation yourself.

KBinsDiscretizer

You need a fast, unsupervised preprocessing step inside an sklearn Pipeline — e.g. as input to a tree-free linear model — and you don’t need target-aware bins.

A reasonable rule of thumb: reach for KBinsDiscretizer when binning is a preprocessing concern, AutoCarver when binning is a modelling concern with a held-out validation budget, and optbinning when you need to encode hard business constraints into each feature’s bin definition.

Benchmark notebook

A runnable side-by-side benchmark on two public datasets — German Credit (binary, mixed dtypes) and California Housing (regression, all-numeric) — comparing the three libraries on fit time, downstream-model score, and train``→``test score drop:

The numbers are illustrative — single run, single machine, fixed seed — and are not an IV / Tschuprow’s T leaderboard, since those metrics structurally favour the library whose objective they are. Re-run on your own data before drawing conclusions.

Caveats

  • All three libraries are actively maintained; the table reflects the public APIs as of AutoCarver Beta release. Open an issue if anything has drifted.