.. _comparison: Comparison with other binning libraries ======================================= Three Python libraries are usually considered for feature discretization: * **AutoCarver** — supervised, target-association-driven binning with dev-set robustness validation. * `optbinning `_ — supervised binning solved as a mixed-integer program. * `sklearn.preprocessing.KBinsDiscretizer `_ — unsupervised quantile / uniform / k-means binning. This page compares them on scope, algorithm, and ergonomics so you can pick the right tool for your problem. The runnable code snippets are unit-tested in ``tests/examples/test_comparison_snippets.py``. Scope at a glance ----------------- .. list-table:: :header-rows: 1 :widths: 30 25 25 20 * - - **AutoCarver** - **optbinning** - **KBinsDiscretizer** * - Supervised (uses ``y``) - yes - yes - no * - Binary classification - :class:`BinaryCarver` - ``OptimalBinning`` - n/a * - Multiclass classification - :class:`MulticlassCarver` - ``MulticlassOptimalBinning`` - n/a * - Regression / continuous target - :class:`ContinuousCarver` - ``ContinuousOptimalBinning`` - n/a * - Quantitative features - yes - yes - yes * - Categorical features - yes - yes - no (must encode first) * - Ordinal features (with known order) - yes (:class:`OrdinalDiscretizer` enforces the declared order) - via ``user_splits`` workaround - no * - ``NaN`` as own modality - yes - yes - no (raises) * - Held-out dev-set robustness check - **yes (built in)** - no - no * - Optimality guarantee for fixed ``min_freq`` / ``max_n_mod`` / metric - **yes — exhaustive search over admissible combinations** - yes (MIP, under its own constraints) - n/a (no objective) * - Per-bin stats + carving history after ``fit`` - **yes — ``Features.summary`` and ``Features.history``** - via ``binning_table`` - no * - JSON round-trip persistence - yes - via pickle - via pickle * - sklearn ``Pipeline`` compatible - yes (``BaseEstimator`` / ``TransformerMixin``) - yes - yes * - Feature pre-selection helpers - :class:`ClassificationSelector`, :class:`RegressionSelector` - no - no Algorithmic axis ---------------- The three libraries answer "what's a good bin?" with very different objectives: .. list-table:: :header-rows: 1 :widths: 20 40 40 * - Library - Objective - Constraint surface * - **AutoCarver** - Maximize **Tschuprow's T** (default) or **Cramér's V** between the carved feature and the target via **exhaustive search** over every admissible combination of consecutive modalities — so for fixed ``min_freq``, ``max_n_mod`` and metric, no other combination scores higher. NaN groupings are explored as a separate combinatorial pass. - ``min_freq`` (minimum bucket share), ``max_n_mod`` (cap on number of modalities), monotonic ordering for ordinal features (enforced by :class:`OrdinalDiscretizer`), and a dev-set veto: any candidate that flips its target-rate ordering on the dev set is rejected. * - **optbinning** - Maximize **Information Value (IV)** (binary) or split-gain analogues, solved as a mixed-integer program (CBC by default). - User-declarable monotonicity, minimum bin size, maximum number of bins, optional WoE smoothing, and constraint blocks (e.g. PSI-based stability). * - **KBinsDiscretizer** - **No target awareness.** Splits are placed on the marginal distribution of ``X`` only: equal-frequency (``quantile``), equal-width (``uniform``), or 1-D k-means. - ``n_bins`` per feature; that's it. The takeaway: **AutoCarver and optbinning both optimize against the target**, but AutoCarver's robustness step (the dev-set veto) is something optbinning does not do natively — you'd have to script it yourself with cross-validation. KBinsDiscretizer is a different category: it's a fast preprocessing primitive, not a supervised binner. Side-by-side: bin a mixed feature set on the same data ------------------------------------------------------- The same problem — discretize four numeric columns and one categorical column of the Titanic data — solved three ways. All three blocks are runnable; the optbinning and KBinsDiscretizer blocks are skipped automatically in CI when those libraries are not installed. AutoCarver ^^^^^^^^^^ .. code-block:: python import pandas as pd from sklearn.model_selection import train_test_split from AutoCarver import BinaryCarver, Features url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv" data = pd.read_csv(url) target = "Survived" train, dev = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target]) features = Features( categoricals=["Sex"], quantitatives=["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"], ordinals={"Pclass": ["1", "2", "3"]}, ) carver = BinaryCarver(features=features, min_freq=0.05, max_n_mod=5) carver.fit(train, train[target], X_dev=dev, y_dev=dev[target]) train_binned = carver.transform(train) * **One call** covers numeric, categorical, and ordinal columns. * The dev set is consumed at ``fit`` time: any bin combination whose target-rate ordering doesn't survive on the dev sample is discarded. * Persisting the fitted state is ``carver.save("titanic_carver.json")``. optbinning ^^^^^^^^^^ .. code-block:: python import pandas as pd from sklearn.model_selection import train_test_split from optbinning import OptimalBinning url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv" data = pd.read_csv(url) target = "Survived" train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target]) # one binner per column, dtype declared explicitly columns = { "Age": "numerical", "Fare": "numerical", "Siblings/Spouses Aboard": "numerical", "Parents/Children Aboard": "numerical", "Sex": "categorical", "Pclass": "categorical", # optbinning has no first-class ordinal type } binners = {} train_binned = pd.DataFrame(index=train.index) for name, dtype in columns.items(): ob = OptimalBinning(name=name, dtype=dtype, solver="cbc") ob.fit(train[name].to_numpy(), train[target].to_numpy()) train_binned[name] = ob.transform(train[name].to_numpy(), metric="bins") binners[name] = ob * Fits **one binner per feature** — you manage the loop. * No held-out validation step; you'd add cross-validation yourself. * Ordinal columns must be passed as ``categorical`` (with optional ``user_splits``), losing the known order. KBinsDiscretizer ^^^^^^^^^^^^^^^^ .. code-block:: python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import KBinsDiscretizer url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv" data = pd.read_csv(url) target = "Survived" train, _ = train_test_split(data, test_size=0.33, random_state=42, stratify=data[target]) numeric_cols = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"] train_numeric = train[numeric_cols].fillna(train[numeric_cols].median()) kbd = KBinsDiscretizer(n_bins=5, encode="ordinal", strategy="quantile") train_binned = pd.DataFrame( kbd.fit_transform(train_numeric), columns=numeric_cols, index=train.index, ) * **Unsupervised** — the target is never used, so the bins do not maximize anything related to ``y``. * No support for categoricals or ``NaN`` — you must impute and encode first. * Strong baseline when you need fast, model-agnostic binning and you accept that bins won't be target-optimal. When to pick which ------------------ .. list-table:: :header-rows: 1 :widths: 30 70 * - Pick - When * - **AutoCarver** - You want supervised binning **and** you have (or can carve out) a dev sample, you mix numeric / categorical / ordinal columns, you need a JSON-portable artifact to ship to a scorecard or production model, or you also need feature pre-selection. * - **optbinning** - You want IV-driven binning solved as a true optimization problem, you need fine-grained per-feature constraints (monotonicity, WoE smoothing, PSI-based stability), and you are comfortable looping over features and managing validation yourself. * - **KBinsDiscretizer** - You need a fast, unsupervised preprocessing step inside an sklearn ``Pipeline`` — e.g. as input to a tree-free linear model — and you don't need target-aware bins. A reasonable rule of thumb: reach for **KBinsDiscretizer** when binning is a *preprocessing* concern, **AutoCarver** when binning is a *modelling* concern with a held-out validation budget, and **optbinning** when you need to encode hard business constraints into each feature's bin definition. Benchmark notebook ------------------ A runnable side-by-side benchmark on two public datasets — German Credit (binary, mixed dtypes) and California Housing (regression, all-numeric) — comparing the three libraries on fit time, downstream-model score, and ``train``→``test`` score drop: .. toctree:: :glob: :maxdepth: 1 examples/Comparison/comparison_notebook The numbers are illustrative — single run, single machine, fixed seed — and are **not** an IV / Tschuprow's T leaderboard, since those metrics structurally favour the library whose objective they are. Re-run on your own data before drawing conclusions. Caveats ------- * All three libraries are actively maintained; the table reflects the public APIs as of AutoCarver |release|. Open an issue if anything has drifted.