Comparison with other binning libraries

Three Python libraries are usually considered for feature discretization:

AutoCarver — supervised, target-association-driven binning with dev-set robustness validation.
optbinning — supervised binning solved as a mixed-integer program.
sklearn.preprocessing.KBinsDiscretizer — unsupervised quantile / uniform / k-means binning.

This page compares them on scope, algorithm, and ergonomics so you can pick the right tool for your problem. The runnable code snippets are unit-tested in tests/examples/test_comparison_snippets.py.

Scope at a glance

	AutoCarver	optbinning	KBinsDiscretizer
Supervised (uses `y`)	yes	yes	no
Binary classification	`BinaryCarver`	`OptimalBinning`	n/a
Multiclass classification	`MulticlassCarver`	`MulticlassOptimalBinning`	n/a
Regression / continuous target	`ContinuousCarver`	`ContinuousOptimalBinning`	n/a
Quantitative features	yes	yes	yes
Categorical features	yes	yes	no (must encode first)
Ordinal features (with known order)	yes (`OrdinalDiscretizer` enforces the declared order)	via `user_splits` workaround	no
`NaN` as own modality	yes	yes	no (raises)
Held-out dev-set robustness check	yes (built in)	no	no
Optimality guarantee for fixed `min_freq` / `max_n_mod` / metric	yes — exhaustive search over admissible combinations	yes (MIP, under its own constraints)	n/a (no objective)
Per-bin stats + carving history after `fit`	yes — ``Features.summary`` and ``Features.history``	via `binning_table`	no
JSON round-trip persistence	yes	via pickle	via pickle
sklearn `Pipeline` compatible	yes (`BaseEstimator` / `TransformerMixin`)	yes	yes
Feature pre-selection helpers	`ClassificationSelector`, `RegressionSelector`	no	no

Algorithmic axis

The three libraries answer “what’s a good bin?” with very different objectives:

Library	Objective	Constraint surface
AutoCarver	Maximize Tschuprow’s T (default) or Cramér’s V between the carved feature and the target via exhaustive search over every admissible combination of consecutive modalities — so for fixed `min_freq`, `max_n_mod` and metric, no other combination scores higher. NaN groupings are explored as a separate combinatorial pass.	`min_freq` (minimum bucket share), `max_n_mod` (cap on number of modalities), monotonic ordering for ordinal features (enforced by `OrdinalDiscretizer`), and a dev-set veto: any candidate that flips its target-rate ordering on the dev set is rejected.
optbinning	Maximize Information Value (IV) (binary) or split-gain analogues, solved as a mixed-integer program (CBC by default).	User-declarable monotonicity, minimum bin size, maximum number of bins, optional WoE smoothing, and constraint blocks (e.g. PSI-based stability).
KBinsDiscretizer	No target awareness. Splits are placed on the marginal distribution of `X` only: equal-frequency (`quantile`), equal-width (`uniform`), or 1-D k-means.	`n_bins` per feature; that’s it.

The takeaway: AutoCarver and optbinning both optimize against the target, but AutoCarver’s robustness step (the dev-set veto) is something optbinning does not do natively — you’d have to script it yourself with cross-validation. KBinsDiscretizer is a different category: it’s a fast preprocessing primitive, not a supervised binner.

Side-by-side: bin a mixed feature set on the same data

The same problem — discretize four numeric columns and one categorical column of the Titanic data — solved three ways. All three blocks are runnable; the optbinning and KBinsDiscretizer blocks are skipped automatically in CI when those libraries are not installed.

AutoCarver

import pandas as pd
from sklearn.model_selection import train_test_split

from AutoCarver import BinaryCarver, Features

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, dev = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

features = Features(
    categoricals=["Sex"],
    quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"],
    ordinals={"Pclass": ["1", "2", "3"]},
)
carver = BinaryCarver(features=features, min_freq=0.05, max_n_mod=5)
carver.fit(train, train[target], X_dev=dev, y_dev=dev[target])
train_binned = carver.transform(train)

One call covers numeric, categorical, and ordinal columns.
The dev set is consumed at fit time: any bin combination whose target-rate ordering doesn’t survive on the dev sample is discarded.
Persisting the fitted state is carver.save("titanic_carver.json").

optbinning

import pandas as pd
from sklearn.model_selection import train_test_split
from optbinning import OptimalBinning

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

# one binner per column, dtype declared explicitly
columns = {
    "Age": "numerical",
    "Fare": "numerical",
    "Siblings/Spouses Aboard": "numerical",
    "Parents/Children Aboard": "numerical",
    "Sex": "categorical",
    "Pclass": "categorical",  # optbinning has no first-class ordinal type
}
binners = {}
train_binned = pd.DataFrame(index=train.index)
for name, dtype in columns.items():
    ob = OptimalBinning(name=name, dtype=dtype, solver="cbc")
    ob.fit(train[name].to_numpy(), train[target].to_numpy())
    train_binned[name] = ob.transform(train[name].to_numpy(), metric="bins")
    binners[name] = ob

Fits one binner per feature — you manage the loop.
No held-out validation step; you’d add cross-validation yourself.
Ordinal columns must be passed as categorical (with optional user_splits), losing the known order.

KBinsDiscretizer

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
data = pd.read_csv(url)
target = "Survived"
train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target])

numeric_cols = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
train_numeric = train[numeric_cols].fillna(train[numeric_cols].median())

kbd = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile")
train_binned = pd.DataFrame(
    kbd.fit_transform(train_numeric),
    columns=numeric_cols,
    index=train.index,
)

Unsupervised — the target is never used, so the bins do not maximize anything related to y.
No support for categoricals or NaN — you must impute and encode first.
Strong baseline when you need fast, model-agnostic binning and you accept that bins won’t be target-optimal.

When to pick which

Pick	When
AutoCarver	You want supervised binning and you have (or can carve out) a dev sample, you mix numeric / categorical / ordinal columns, you need a JSON-portable artifact to ship to a scorecard or production model, or you also need feature pre-selection.
optbinning	You want IV-driven binning solved as a true optimization problem, you need fine-grained per-feature constraints (monotonicity, WoE smoothing, PSI-based stability), and you are comfortable looping over features and managing validation yourself.
KBinsDiscretizer	You need a fast, unsupervised preprocessing step inside an sklearn `Pipeline` — e.g. as input to a tree-free linear model — and you don’t need target-aware bins.

A reasonable rule of thumb: reach for KBinsDiscretizer when binning is a preprocessing concern, AutoCarver when binning is a modelling concern with a held-out validation budget, and optbinning when you need to encode hard business constraints into each feature’s bin definition.

Benchmark notebook

A runnable side-by-side benchmark on two public datasets — German Credit (binary, mixed dtypes) and California Housing (regression, all-numeric) — comparing the three libraries on fit time, downstream-model score, and train``→``test score drop:

Benchmark: AutoCarver vs. optbinning vs. KBinsDiscretizer

The numbers are illustrative — single run, single machine, fixed seed — and are not an IV / Tschuprow’s T leaderboard, since those metrics structurally favour the library whose objective they are. Re-run on your own data before drawing conclusions.

Caveats

All three libraries are actively maintained; the table reflects the public APIs as of AutoCarver Beta release. Open an issue if anything has drifted.