Benchmark: AutoCarver vs. optbinning vs. KBinsDiscretizer

This notebook runs the three binning libraries side-by-side on two public datasets:

  1. German Credit — binary classification, mixed numeric / categorical features, 1,000 rows.

  2. California Housing — regression, all-numeric features, 20,640 rows.

For each library and dataset, we report:

  • ``fit`` and ``transform`` wall-clock (seconds)

  • Downstream-model score — AUC for binary, R² for regression — using a linear model (logistic regression / ridge) on the one-hot-encoded bin output

  • ``train`` → ``test`` score drop as a coarse proxy for drift sensitivity

All three libraries see the same train + dev data and are evaluated on the same held-out test. AutoCarver uses the dev sample for its built-in robustness veto; optbinning and KBinsDiscretizer don’t have a dev-set concept and so treat the union of train + dev as one pooled training set — which is the comparison practitioners actually run.

This is not an IV / Tschuprow’s T leaderboard. Those metrics structurally favour the library whose objective they are. The downstream-model score is the metric a real scorecard team would use to pick a binner.

Numbers come from a single run on a single machine with a fixed seed; treat them as illustrative, not as authoritative benchmark figures. Re-run on your own data before drawing conclusions.

Setup

[13]:
import time
import warnings

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing, fetch_openml
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import r2_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

from AutoCarver import BinaryCarver, ContinuousCarver, Features
from AutoCarver.discretizers.utils.base_discretizer import DiscretizerConfig

try:
    from optbinning import ContinuousOptimalBinning, OptimalBinning

    HAS_OPTBINNING = True
except ImportError:
    HAS_OPTBINNING = False
    print('optbinning is not installed \u2014 its rows will be skipped.')

SEED = 42
warnings.filterwarnings('ignore')
plt.rcParams['figure.figsize'] = (10, 3.5)
[14]:
def one_hot(df):
    """Treat every bin label as a categorical level and one-hot encode it.

    Lets a linear downstream model consume any of the three libraries' outputs
    uniformly, without us computing WoE per bin.
    """
    return pd.get_dummies(df.astype(str), drop_first=True).astype(float)


def fit_eval_binary(X_train, X_test, y_train, y_test):
    Xtr = one_hot(X_train)
    Xte = one_hot(X_test).reindex(columns=Xtr.columns, fill_value=0.0)
    model = LogisticRegression(max_iter=1000, random_state=SEED).fit(Xtr, y_train)
    return {
        'train_auc': roc_auc_score(y_train, model.predict_proba(Xtr)[:, 1]),
        'test_auc': roc_auc_score(y_test, model.predict_proba(Xte)[:, 1]),
    }


def fit_eval_regression(X_train, X_test, y_train, y_test):
    Xtr = one_hot(X_train)
    Xte = one_hot(X_test).reindex(columns=Xtr.columns, fill_value=0.0)
    model = Ridge(random_state=SEED).fit(Xtr, y_train)
    return {
        'train_r2': r2_score(y_train, model.predict(Xtr)),
        'test_r2': r2_score(y_test, model.predict(Xte)),
    }


def plot_bars(results_df, score_cols, title):
    fig, axes = plt.subplots(1, len(score_cols), figsize=(4 * len(score_cols), 3.5))
    if len(score_cols) == 1:
        axes = [axes]
    for ax, col in zip(axes, score_cols):
        results_df.plot.bar(x='library', y=col, ax=ax, legend=False, color='#4C72B0')
        ax.set_title(col)
        ax.set_xlabel('')
        ax.tick_params(axis='x', rotation=0)
    fig.suptitle(title)
    fig.tight_layout()
    plt.show()
[15]:
from AutoCarver.combinations.binary import CramervCombinations

MAX_N_MOD = 5
MIN_FREQ = 0.05

def bin_with_autocarver(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, kind):
    Carver = BinaryCarver if kind == 'binary' else ContinuousCarver
    features = Features(categoricals=categoricals, quantitatives=quantitatives)
    config = DiscretizerConfig(verbose=True)  # showing statistics
    combination_evaluator = CramervCombinations() if kind == 'binary' else None
    carver = Carver(features=features, min_freq=MIN_FREQ, max_n_mod=MAX_N_MOD, config=config,combination_evaluator=combination_evaluator)

    t0 = time.perf_counter()
    X_tr = carver.fit_transform(X_train.copy(), y_train, X_dev=X_dev.copy(), y_dev=y_dev)
    fit_t = time.perf_counter() - t0

    X_dv = carver.transform(X_dev.copy())
    t1 = time.perf_counter()
    X_te = carver.transform(X_test.copy())
    transform_t = time.perf_counter() - t1
    return pd.concat([X_tr, X_dv]), X_te, fit_t, transform_t, carver


def bin_with_optbinning(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, kind):
    Cls = OptimalBinning if kind == 'binary' else ContinuousOptimalBinning
    X_all = pd.concat([X_train, X_dev])
    y_all = pd.concat([y_train, y_dev])
    binners = {}
    train_binned = pd.DataFrame(index=X_all.index)
    test_binned = pd.DataFrame(index=X_test.index)

    t0 = time.perf_counter()
    for col in X_all.columns:
        dtype = 'categorical' if col in categoricals else 'numerical'
        binner = Cls(name=col, dtype=dtype, min_prebin_size=MIN_FREQ/2, max_n_bins=MAX_N_MOD)
        binner.fit(X_all[col].to_numpy(), y_all.to_numpy())
        binners[col] = binner
        train_binned[col] = binner.transform(X_all[col].to_numpy(), metric='bins')
    fit_t = time.perf_counter() - t0

    t1 = time.perf_counter()
    for col, b in binners.items():
        test_binned[col] = b.transform(X_test[col].to_numpy(), metric='bins')
    transform_t = time.perf_counter() - t1
    return train_binned, test_binned, fit_t, transform_t, binners


def bin_with_kbins(X_train, X_dev, X_test, categoricals, quantitatives, n_bins=5):
    X_all = pd.concat([X_train, X_dev])
    num_train = X_all[quantitatives].apply(lambda c: c.fillna(c.median()))
    num_test = X_test[quantitatives].apply(lambda c: c.fillna(c.median()))
    kbd = KBinsDiscretizer(n_bins=n_bins, encode='ordinal', strategy='quantile')

    t0 = time.perf_counter()
    binned_num_train = pd.DataFrame(
        kbd.fit_transform(num_train), columns=quantitatives, index=X_all.index
    )
    fit_t = time.perf_counter() - t0

    t1 = time.perf_counter()
    binned_num_test = pd.DataFrame(
        kbd.transform(num_test), columns=quantitatives, index=X_test.index
    )
    transform_t = time.perf_counter() - t1

    # KBins has no opinion on categoricals — pass them through as labels
    train = pd.concat([binned_num_train, X_all[categoricals].astype(str)], axis=1)
    test = pd.concat([binned_num_test, X_test[categoricals].astype(str)], axis=1)
    return train, test, fit_t, transform_t, kbd

Binary classification — German Credit

20 features (numeric + categorical), 1,000 rows, target = class == 'bad'. Train / dev / test split = 60 / 20 / 20 %.

[16]:
credit = fetch_openml(data_id=31, as_frame=True)
df = credit.frame.copy()

y_binary = (df['class'] == 'bad').astype(int)
X_binary = df.drop(columns=['class'])

X_train, X_rest, y_train, y_rest = train_test_split(
    X_binary, y_binary, test_size=0.4, random_state=SEED, stratify=y_binary,
)
X_dev, X_test, y_dev, y_test = train_test_split(
    X_rest, y_rest, test_size=0.5, random_state=SEED, stratify=y_rest,
)

categoricals = [c for c in X_binary.columns if X_binary[c].dtype == object or isinstance(X_binary[c].dtype, pd.CategoricalDtype)]
quantitatives = [c for c in X_binary.columns if c not in categoricals]

print(f'train={len(X_train)}, dev={len(X_dev)}, test={len(X_test)}')
print(f'categoricals={len(categoricals)}, quantitatives={len(quantitatives)}')
print(f'bad rate (train)={y_train.mean():.3f}, (test)={y_test.mean():.3f}')
train=600, dev=200, test=200
categoricals=13, quantitatives=7
bad rate (train)=0.300, (test)=0.300
[17]:
y_train_full = pd.concat([y_train, y_dev])

runs = [(
    'AutoCarver',
    lambda: bin_with_autocarver(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, 'binary'),
)]
if HAS_OPTBINNING:
    runs.append((
        'optbinning',
        lambda: bin_with_optbinning(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, 'binary'),
    ))
runs.append((
    'KBinsDiscretizer',
    lambda: bin_with_kbins(X_train, X_dev, X_test, categoricals, quantitatives),
))

rows = []
for name, run in runs:
    X_tr, X_te, fit_t, transform_t, carver = run()
    scores = fit_eval_binary(X_tr, X_te, y_train_full, y_test)
    rows.append({
        'library': name,
        'fit_s': round(fit_t, 3),
        'transform_s': round(transform_t, 4),
        'train_auc': round(scores['train_auc'], 4),
        'test_auc': round(scores['test_auc'], 4),
        'auc_drop': round(scores['train_auc'] - scores['test_auc'], 4),
    })

binary_results = pd.DataFrame(rows)
binary_results
------
--- [QuantitativeDiscretizer] Fit Features(['duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents'])
 - [ContinuousDiscretizer] Fit Features(['duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents'])
 - [OrdinalDiscretizer] Fit Features(['duration', 'installment_commitment', 'residence_since', 'existing_credits', 'num_dependents'])
------

------
--- [QualitativeDiscretizer] Fit Features(['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker'])
 - [CategoricalDiscretizer] Fit Features(['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker'])
------

---------
------ [BinaryCarver] Fit Features(['checking_status', 'credit_history', 'purpose', 'savings_status', 'employment', 'personal_status', 'other_parties', 'property_magnitude', 'other_payment_plans', 'housing', 'job', 'own_telephone', 'foreign_worker', 'duration', 'credit_amount', 'installment_commitment', 'residence_since', 'age', 'existing_credits', 'num_dependents'])
--- [BinaryCarver] Fit Categorical('checking_status') (1/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
no checking 0.1317 0.4050 243
>=200 0.2778 0.0600 36
0<=X<200 0.3896 0.2567 154
<0 0.4671 0.2783 167
X_dev distribution
target_mean frequency count
0.0694 0.3600 72
0.0833 0.0600 12
0.3710 0.3100 62
0.5741 0.2700 54
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
no checking 0.1317 0.4050 243
>=200 0.2778 0.0600 36
0<=X<200 0.3896 0.2567 154
<0 0.4671 0.2783 167
X_dev distribution
target_mean frequency count
0.0694 0.3600 72
0.0833 0.0600 12
0.3710 0.3100 62
0.5741 0.2700 54
--- [BinaryCarver] Fit Categorical('credit_history') (2/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
critical/other existing credit 0.1676 0.2883 173
existing paid 0.3185 0.5233 314
delayed previously 0.3621 0.0967 58
all paid 0.5455 0.0550 33
no credits/all paid 0.5455 0.0367 22
X_dev distribution
target_mean frequency count
0.2241 0.2900 58
0.2703 0.5550 111
0.3571 0.0700 14
0.7273 0.0550 11
0.6667 0.0300 6
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
critical/other existing credit 0.1676 0.2883 173
existing paid 0.3185 0.5233 314
delayed previously 0.3621 0.0967 58
all paid, no credits/all paid 0.5455 0.0917 55
X_dev distribution
target_mean frequency count
0.2241 0.2900 58
0.2703 0.5550 111
0.3571 0.0700 14
0.7059 0.0850 17
--- [BinaryCarver] Fit Categorical('purpose') (3/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
used car 0.1875 0.1067 64
other, retraining 0.2222 0.0150 9
radio/tv 0.2303 0.2750 165
domestic appliance 0.3000 0.0167 10
furniture/equipment 0.3333 0.1700 102
new car 0.3401 0.2450 147
business 0.3729 0.0983 59
repairs 0.3750 0.0267 16
education 0.4643 0.0467 28
X_dev distribution
target_mean frequency count
0.1250 0.0800 16
0.3000 0.0500 10
0.2295 0.3050 61
0.0000 0.0050 1
0.3235 0.1700 34
0.4222 0.2250 45
0.2778 0.0900 18
0.0000 0.0100 2
0.4615 0.0650 13
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
used car 0.1875 0.1067 64
radio/tv, other, retraining 0.2299 0.2900 174
furniture/equipment, domestic appliance 0.3304 0.1867 112
new car, business, repairs 0.3514 0.3700 222
education 0.4643 0.0467 28
X_dev distribution
target_mean frequency count
0.1250 0.0800 16
0.2394 0.3550 71
0.3143 0.1750 35
0.3692 0.3250 65
0.4615 0.0650 13
--- [BinaryCarver] Fit Categorical('savings_status') (4/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
>=1000 0.0667 0.0500 30
500<=X<1000 0.1622 0.0617 37
no known savings 0.1714 0.1750 105
100<=X<500 0.3333 0.1150 69
<100 0.3649 0.5983 359
X_dev distribution
target_mean frequency count
0.3333 0.0300 6
0.1250 0.0800 16
0.1667 0.1800 36
0.3889 0.0900 18
0.3468 0.6200 124
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
no known savings, >=1000, 500<=X<1000 0.1512 0.2867 172
<100, 100<=X<500 0.3598 0.7133 428
X_dev distribution
target_mean frequency count
0.1724 0.2900 58
0.3521 0.7100 142
--- [BinaryCarver] Fit Categorical('employment') (5/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
4<=X<7 0.1935 0.1550 93
>=7 0.2516 0.2650 159
1<=X<4 0.2911 0.3550 213
<1 0.4272 0.1717 103
unemployed 0.5000 0.0533 32
X_dev distribution
target_mean frequency count
0.2632 0.1900 38
0.2600 0.2500 50
0.3621 0.2900 58
0.3333 0.1800 36
0.2222 0.0900 18
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
>=7, 4<=X<7 0.2302 0.4200 252
unemployed, 1<=X<4, <1 0.3506 0.5800 348
X_dev distribution
target_mean frequency count
0.2614 0.4400 88
0.3304 0.5600 112
--- [BinaryCarver] Fit Categorical('personal_status') (6/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
male single 0.2679 0.5600 336
male mar/wid 0.2778 0.0900 54
female div/dep/mar 0.3559 0.2950 177
male div/sep 0.3636 0.0550 33
X_dev distribution
target_mean frequency count
0.2830 0.5300 106
0.2381 0.1050 21
0.3385 0.3250 65
0.3750 0.0400 8
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
male single, male mar/wid 0.2692 0.6500 390
female div/dep/mar 0.3559 0.2950 177
male div/sep 0.3636 0.0550 33
X_dev distribution
target_mean frequency count
0.2756 0.6350 127
0.3385 0.3250 65
0.3750 0.0400 8
--- [BinaryCarver] Fit Categorical('other_parties') (7/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
guarantor 0.1786 0.0467 28
none 0.2996 0.9067 544
co applicant 0.4286 0.0467 28
X_dev distribution
target_mean frequency count
0.2500 0.0400 8
0.2989 0.9200 184
0.3750 0.0400 8
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
guarantor 0.1786 0.0467 28
none 0.2996 0.9067 544
co applicant 0.4286 0.0467 28
X_dev distribution
target_mean frequency count
0.2500 0.0400 8
0.2989 0.9200 184
0.3750 0.0400 8
--- [BinaryCarver] Fit Categorical('property_magnitude') (8/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
real estate 0.2130 0.2817 169
life insurance 0.3125 0.2133 128
car 0.3143 0.3500 210
no known property 0.4086 0.1550 93
X_dev distribution
target_mean frequency count
0.2182 0.2750 55
0.2600 0.2500 50
0.3281 0.3200 64
0.4516 0.1550 31
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
real estate 0.2130 0.2817 169
life insurance 0.3125 0.2133 128
car 0.3143 0.3500 210
no known property 0.4086 0.1550 93
X_dev distribution
target_mean frequency count
0.2182 0.2750 55
0.2600 0.2500 50
0.3281 0.3200 64
0.4516 0.1550 31
--- [BinaryCarver] Fit Categorical('other_payment_plans') (9/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
none 0.2619 0.8083 485
stores 0.4375 0.0533 32
bank 0.4699 0.1383 83
X_dev distribution
target_mean frequency count
0.2866 0.8200 164
0.4444 0.0450 9
0.3333 0.1350 27
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
none 0.2619 0.8083 485
bank, stores 0.4609 0.1917 115
X_dev distribution
target_mean frequency count
0.2866 0.8200 164
0.3611 0.1800 36
--- [BinaryCarver] Fit Categorical('housing') (10/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
own 0.2558 0.7233 434
for free 0.3750 0.1067 64
rent 0.4412 0.1700 102
X_dev distribution
target_mean frequency count
0.2857 0.7350 147
0.4348 0.1150 23
0.2667 0.1500 30
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
own 0.2558 0.7233 434
for free, rent 0.4157 0.2767 166
X_dev distribution
target_mean frequency count
0.2857 0.7350 147
0.3396 0.2650 53
--- [BinaryCarver] Fit Categorical('job') (11/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
skilled 0.2898 0.6383 383
unskilled resident 0.2966 0.1967 118
high qualif/self emp/mgmt 0.3258 0.1483 89
unemp/unskilled non res 0.5000 0.0167 10
X_dev distribution
target_mean frequency count
0.2541 0.6100 122
0.3171 0.2050 41
0.4839 0.1550 31
0.1667 0.0300 6
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
skilled 0.2898 0.6383 383
unskilled resident 0.2966 0.1967 118
high qualif/self emp/mgmt, unemp/unskilled non res 0.3434 0.1650 99
X_dev distribution
target_mean frequency count
0.2541 0.6100 122
0.3171 0.2050 41
0.4324 0.1850 37
--- [BinaryCarver] Fit Categorical('own_telephone') (12/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
yes 0.2645 0.4033 242
none 0.3240 0.5967 358
X_dev distribution
target_mean frequency count
0.3125 0.4000 80
0.2917 0.6000 120
WARNING: No robust combination for Categorical('own_telephone'). Consider increasing the size of X_dev or dropping the feature (X not representative of X_dev for this feature).
--- [BinaryCarver] Fit Categorical('foreign_worker') (13/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
no 0.0435 0.0383 23
yes 0.3102 0.9617 577
X_dev distribution
target_mean frequency count
0.3333 0.0300 6
0.2990 0.9700 194
WARNING: No robust combination for Categorical('foreign_worker'). Consider increasing the size of X_dev or dropping the feature (X not representative of X_dev for this feature).
--- [BinaryCarver] Fit Quantitative('duration') (14/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 8.00e+00 0.0980 0.0850 51
8.00e+00 < x <= 9.00e+00 0.2333 0.0500 30
9.00e+00 < x <= 1.10e+01 0.0870 0.0383 23
1.10e+01 < x <= 1.20e+01 0.2883 0.1850 111
1.20e+01 < x <= 1.50e+01 0.2273 0.0733 44
1.50e+01 < x <= 1.80e+01 0.3692 0.1083 65
1.80e+01 < x <= 2.20e+01 0.2381 0.0350 21
2.20e+01 < x <= 2.40e+01 0.3333 0.1950 117
2.40e+01 < x <= 2.80e+01 0.2222 0.0150 9
2.80e+01 < x <= 3.30e+01 0.3846 0.0433 26
3.30e+01 < x <= 3.60e+01 0.4727 0.0917 55
3.60e+01 < x <= 4.70e+01 0.2667 0.0250 15
4.70e+01 < x 0.4242 0.0550 33
X_dev distribution
target_mean frequency count
0.1000 0.1000 20
0.3077 0.0650 13
0.0000 0.0400 8
0.2432 0.1850 37
0.0714 0.0700 14
0.3043 0.1150 23
0.4444 0.0450 9
0.3548 0.1550 31
0.7500 0.0200 4
0.4286 0.0350 7
0.3529 0.0850 17
0.6667 0.0150 3
0.5714 0.0700 14
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 8.00e+00 0.0980 0.0850 51
8.00e+00 < x <= 1.10e+01 0.1698 0.0883 53
1.10e+01 < x <= 1.50e+01 0.2710 0.2583 155
1.50e+01 < x <= 2.80e+01 0.3302 0.3533 212
2.80e+01 < x 0.4186 0.2150 129
X_dev distribution
target_mean frequency count
0.1000 0.1000 20
0.1905 0.1050 21
0.1961 0.2550 51
0.3731 0.3350 67
0.4634 0.2050 41
--- [BinaryCarver] Fit Quantitative('credit_amount') (15/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 6.18e+02 0.2000 0.0250 15
6.18e+02 < x <= 7.08e+02 0.4000 0.0250 15
7.08e+02 < x <= 7.97e+02 0.3333 0.0250 15
7.97e+02 < x <= 9.09e+02 0.4000 0.0250 15
9.09e+02 < x <= 1.03e+03 0.4000 0.0250 15
1.03e+03 < x <= 1.16e+03 0.2000 0.0250 15
1.16e+03 < x <= 1.21e+03 0.2667 0.0250 15
1.21e+03 < x <= 1.26e+03 0.2000 0.0250 15
1.26e+03 < x <= 1.31e+03 0.3333 0.0250 15
1.31e+03 < x <= 1.37e+03 0.4667 0.0250 15
1.37e+03 < x <= 1.41e+03 0.1250 0.0267 16
1.41e+03 < x <= 1.47e+03 0.1429 0.0233 14
1.47e+03 < x <= 1.53e+03 0.2667 0.0250 15
1.53e+03 < x <= 1.60e+03 0.2000 0.0250 15
1.60e+03 < x <= 1.82e+03 0.2000 0.0250 15
1.82e+03 < x <= 1.92e+03 0.5000 0.0267 16
1.92e+03 < x <= 1.98e+03 0.2857 0.0233 14
1.98e+03 < x <= 2.12e+03 0.3333 0.0250 15
2.12e+03 < x <= 2.21e+03 0.2667 0.0250 15
2.21e+03 < x <= 2.30e+03 0.2667 0.0250 15
2.30e+03 < x <= 2.38e+03 0.2000 0.0250 15
2.38e+03 < x <= 2.48e+03 0.4000 0.0250 15
2.48e+03 < x <= 2.62e+03 0.2667 0.0250 15
2.62e+03 < x <= 2.75e+03 0.3333 0.0250 15
2.75e+03 < x <= 2.92e+03 0.2000 0.0250 15
2.92e+03 < x <= 3.07e+03 0.2000 0.0250 15
3.07e+03 < x <= 3.35e+03 0.4000 0.0250 15
3.35e+03 < x <= 3.51e+03 0.1333 0.0250 15
3.51e+03 < x <= 3.63e+03 0.1333 0.0250 15
3.63e+03 < x <= 3.91e+03 0.0667 0.0250 15
3.91e+03 < x <= 4.24e+03 0.4667 0.0250 15
4.24e+03 < x <= 4.66e+03 0.4000 0.0250 15
4.66e+03 < x <= 5.08e+03 0.4667 0.0250 15
5.08e+03 < x <= 5.80e+03 0.2000 0.0250 15
5.80e+03 < x <= 6.36e+03 0.2667 0.0250 15
6.36e+03 < x <= 6.85e+03 0.4667 0.0250 15
6.85e+03 < x <= 7.48e+03 0.2000 0.0250 15
7.48e+03 < x <= 8.23e+03 0.4667 0.0250 15
8.23e+03 < x <= 9.57e+03 0.4000 0.0250 15
9.57e+03 < x 0.5333 0.0250 15
X_dev distribution
target_mean frequency count
0.2000 0.0250 5
0.5000 0.0200 4
0.5000 0.0300 6
0.0000 0.0100 2
0.3333 0.0300 6
0.1429 0.0350 7
0.5000 0.0100 2
0.3333 0.0600 12
0.0000 0.0100 2
0.2857 0.0350 7
0.0000 0.0150 3
0.3333 0.0300 6
0.2500 0.0200 4
0.0000 0.0150 3
0.3333 0.0300 6
0.2857 0.0350 7
0.2500 0.0200 4
0.0000 0.0400 8
0.5000 0.0100 2
0.5000 0.0100 2
0.0000 0.0150 3
0.0000 0.0050 1
0.6667 0.0150 3
0.0000 0.0200 4
0.0000 0.0200 4
0.3333 0.0150 3
0.2000 0.0500 10
0.5000 0.0400 8
0.0000 0.0300 6
0.1000 0.0500 10
0.2500 0.0200 4
0.8000 0.0250 5
0.3333 0.0150 3
0.4000 0.0250 5
0.2857 0.0350 7
0.0000 0.0200 4
0.6667 0.0150 3
0.6667 0.0150 3
0.6667 0.0150 3
0.6154 0.0650 13
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 1.03e+03 0.3467 0.1250 75
1.03e+03 < x <= 3.35e+03 0.2758 0.5500 330
3.35e+03 < x <= 3.91e+03 0.1111 0.0750 45
3.91e+03 < x <= 7.48e+03 0.3524 0.1750 105
7.48e+03 < x 0.4667 0.0750 45
X_dev distribution
target_mean frequency count
0.3478 0.1150 23
0.2233 0.5150 103
0.2083 0.1200 24
0.3871 0.1550 31
0.6316 0.0950 19
--- [BinaryCarver] Fit Quantitative('installment_commitment') (16/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.00e+00 0.2436 0.1300 78
1.00e+00 < x <= 2.00e+00 0.2606 0.2367 142
2.00e+00 < x <= 3.00e+00 0.2979 0.1567 94
3.00e+00 < x 0.3357 0.4767 286
X_dev distribution
target_mean frequency count
0.1071 0.1400 28
0.2667 0.2250 45
0.2414 0.1450 29
0.3878 0.4900 98
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.00e+00 0.2545 0.3667 220
2.00e+00 < x <= 3.00e+00 0.2979 0.1567 94
3.00e+00 < x 0.3357 0.4767 286
X_dev distribution
target_mean frequency count
0.2055 0.3650 73
0.2414 0.1450 29
0.3878 0.4900 98
--- [BinaryCarver] Fit Quantitative('residence_since') (17/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.00e+00 0.3117 0.1283 77
1.00e+00 < x <= 2.00e+00 0.2905 0.2983 179
2.00e+00 < x <= 3.00e+00 0.3000 0.1667 100
3.00e+00 < x 0.3033 0.4067 244
X_dev distribution
target_mean frequency count
0.2174 0.1150 23
0.3529 0.3400 68
0.3333 0.1500 30
0.2658 0.3950 79
WARNING: No robust combination for Quantitative('residence_since'). Consider increasing the size of X_dev or dropping the feature (X not representative of X_dev for this feature).
--- [BinaryCarver] Fit Quantitative('age') (18/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 2.10e+01 0.4000 0.0250 15
2.10e+01 < x <= 2.20e+01 0.3684 0.0317 19
2.20e+01 < x <= 2.30e+01 0.4500 0.0333 20
2.30e+01 < x <= 2.40e+01 0.3333 0.0350 21
2.40e+01 < x <= 2.50e+01 0.5161 0.0517 31
2.50e+01 < x <= 2.60e+01 0.2500 0.0467 28
2.60e+01 < x <= 2.70e+01 0.2258 0.0517 31
2.70e+01 < x <= 2.80e+01 0.4091 0.0367 22
2.80e+01 < x <= 2.90e+01 0.3913 0.0383 23
2.90e+01 < x <= 3.00e+01 0.2143 0.0467 28
3.00e+01 < x <= 3.10e+01 0.2308 0.0433 26
3.10e+01 < x <= 3.20e+01 0.2500 0.0333 20
3.20e+01 < x <= 3.30e+01 0.3636 0.0367 22
3.30e+01 < x <= 3.40e+01 0.3636 0.0367 22
3.40e+01 < x <= 3.50e+01 0.1724 0.0483 29
3.50e+01 < x <= 3.60e+01 0.2083 0.0400 24
3.60e+01 < x <= 3.70e+01 0.3333 0.0250 15
3.70e+01 < x <= 3.80e+01 0.1875 0.0267 16
3.80e+01 < x <= 3.90e+01 0.2941 0.0283 17
3.90e+01 < x <= 4.10e+01 0.3182 0.0367 22
4.10e+01 < x <= 4.20e+01 0.2727 0.0183 11
4.20e+01 < x <= 4.40e+01 0.1905 0.0350 21
4.40e+01 < x <= 4.60e+01 0.2632 0.0317 19
4.60e+01 < x <= 4.70e+01 0.4000 0.0167 10
4.70e+01 < x <= 4.90e+01 0.1429 0.0233 14
4.90e+01 < x <= 5.10e+01 0.1429 0.0233 14
5.10e+01 < x <= 5.40e+01 0.2941 0.0283 17
5.40e+01 < x <= 5.70e+01 0.3333 0.0200 12
5.70e+01 < x <= 6.30e+01 0.4375 0.0267 16
6.30e+01 < x 0.2667 0.0250 15
X_dev distribution
target_mean frequency count
0.3333 0.0300 6
0.5000 0.0200 4
0.3333 0.0750 15
0.6364 0.0550 11
0.3333 0.0150 3
0.3333 0.0600 12
0.1538 0.0650 13
0.1429 0.0350 7
0.4000 0.0250 5
0.5000 0.0500 10
0.3333 0.0300 6
0.2000 0.0250 5
0.3750 0.0400 8
0.3333 0.0150 3
0.2500 0.0200 4
0.1429 0.0350 7
0.2500 0.0400 8
0.2500 0.0200 4
0.0000 0.0050 1
0.2308 0.0650 13
0.6000 0.0250 5
0.3333 0.0300 6
0.1250 0.0400 8
0.0000 0.0200 4
0.2000 0.0250 5
0.5000 0.0100 2
0.6000 0.0250 5
0.2500 0.0200 4
0.2500 0.0400 8
0.0000 0.0400 8
 [BinaryCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.50e+01 0.4245 0.1767 106
2.50e+01 < x <= 3.20e+01 0.2753 0.2967 178
3.20e+01 < x <= 3.40e+01 0.3636 0.0733 44
3.40e+01 < x <= 3.60e+01 0.1887 0.0883 53
3.60e+01 < x 0.2740 0.3650 219
X_dev distribution
target_mean frequency count
0.4359 0.1950 39
0.2931 0.2900 58
0.3636 0.0550 11
0.1818 0.0550 11
0.2469 0.4050 81
--- [BinaryCarver] Fit Quantitative('existing_credits') (19/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.00e+00 0.3061 0.6317 379
1.00e+00 < x <= 2.00e+00 0.2899 0.3450 207
2.00e+00 < x 0.2857 0.0233 14
X_dev distribution
target_mean frequency count
0.3000 0.6500 130
0.3016 0.3150 63
0.2857 0.0350 7
WARNING: No robust combination for Quantitative('existing_credits'). Consider increasing the size of X_dev or dropping the feature (X not representative of X_dev for this feature).
--- [BinaryCarver] Fit Quantitative('num_dependents') (20/20)
 [BinaryCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.0e+00 0.2984 0.8433 506
1.0e+00 < x 0.3085 0.1567 94
X_dev distribution
target_mean frequency count
0.3000 0.8500 170
0.3000 0.1500 30
WARNING: No robust combination for Quantitative('num_dependents'). Consider increasing the size of X_dev or dropping the feature (X not representative of X_dev for this feature).
[17]:
library fit_s transform_s train_auc test_auc auc_drop
0 AutoCarver 1.948 0.0115 0.8474 0.8118 0.0356
1 optbinning 1.025 0.0141 0.8523 0.7931 0.0592
2 KBinsDiscretizer 0.002 0.0009 0.8401 0.7943 0.0458
[18]:
plot_bars(binary_results, ['fit_s', 'test_auc', 'auc_drop'], 'German Credit \u2014 binary classification')
../../_images/examples_Comparison_comparison_notebook_8_0.png

Here, AutoCarver has dropped 6 columns that were not stable on dev set.

Regression — California Housing

6 numeric demographic features (Latitude / Longitude dropped — see comment in the next cell), 20,640 rows, target = median house value. Same 60 / 20 / 20 split.

[19]:
housing = fetch_california_housing(as_frame=True)
X_reg = housing.frame.drop(columns=['MedHouseVal'])
y_reg = housing.frame['MedHouseVal']

X_train, X_rest, y_train, y_rest = train_test_split(X_reg, y_reg, test_size=0.4, random_state=SEED)
X_dev, X_test, y_dev, y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=SEED)

quantitatives = list(X_reg.columns)
categoricals = []

print(f'train={len(X_train)}, dev={len(X_dev)}, test={len(X_test)}')
print(f'quantitatives={len(quantitatives)} ({quantitatives})')
train=12384, dev=4128, test=4128
quantitatives=8 (['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'])
[20]:
y_train_full = pd.concat([y_train, y_dev])

runs = [(
    'AutoCarver',
    lambda: bin_with_autocarver(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, 'continuous'),
)]
if HAS_OPTBINNING:
    runs.append((
        'optbinning',
        lambda: bin_with_optbinning(X_train, y_train, X_dev, y_dev, X_test, categoricals, quantitatives, 'continuous'),
    ))
runs.append((
    'KBinsDiscretizer',
    lambda: bin_with_kbins(X_train, X_dev, X_test, categoricals, quantitatives),
))

rows = []
for name, run in runs:
    X_tr, X_te, fit_t, transform_t, carver = run()
    scores = fit_eval_regression(X_tr, X_te, y_train_full, y_test)
    rows.append({
        'library': name,
        'fit_s': round(fit_t, 3),
        'transform_s': round(transform_t, 4),
        'train_r2': round(scores['train_r2'], 4),
        'test_r2': round(scores['test_r2'], 4),
        'r2_drop': round(scores['train_r2'] - scores['test_r2'], 4),
    })

regression_results = pd.DataFrame(rows)
regression_results
------
--- [QuantitativeDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'])
 - [ContinuousDiscretizer] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'])
 - [OrdinalDiscretizer] Fit Features(['HouseAge', 'Latitude', 'Longitude'])
------

---------
------ [ContinuousCarver] Fit Features(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude'])
--- [ContinuousCarver] Fit Quantitative('MedInc') (1/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.335e+00 1.1984 0.0250 310
1.335e+00 < x <= 1.593e+00 1.0105 0.0250 310
1.593e+00 < x <= 1.740e+00 1.1133 0.0250 309
1.740e+00 < x <= 1.906e+00 1.1535 0.0252 312
1.906e+00 < x <= 2.029e+00 1.2090 0.0248 307
2.029e+00 < x <= 2.152e+00 1.2141 0.0251 311
2.152e+00 < x <= 2.243e+00 1.2417 0.0250 310
2.243e+00 < x <= 2.350e+00 1.3827 0.0249 308
2.350e+00 < x <= 2.468e+00 1.3614 0.0250 310
2.468e+00 < x <= 2.569e+00 1.4190 0.0250 309
2.569e+00 < x <= 2.655e+00 1.5264 0.0250 310
2.655e+00 < x <= 2.737e+00 1.5428 0.0250 309
2.737e+00 < x <= 2.862e+00 1.5708 0.0250 310
2.862e+00 < x <= 2.974e+00 1.6630 0.0250 310
2.974e+00 < x <= 3.054e+00 1.6270 0.0250 309
3.054e+00 < x <= 3.135e+00 1.7079 0.0250 310
3.135e+00 < x <= 3.216e+00 1.8554 0.0250 309
3.216e+00 < x <= 3.315e+00 1.8373 0.0250 310
3.315e+00 < x <= 3.423e+00 1.9121 0.0250 309
3.423e+00 < x <= 3.531e+00 1.9162 0.0251 311
3.531e+00 < x <= 3.633e+00 1.9678 0.0250 309
3.633e+00 < x <= 3.723e+00 2.0226 0.0250 309
3.723e+00 < x <= 3.839e+00 1.9891 0.0251 311
3.839e+00 < x <= 3.971e+00 2.0493 0.0249 308
3.971e+00 < x <= 4.073e+00 2.0538 0.0252 312
4.073e+00 < x <= 4.179e+00 2.2004 0.0249 308
4.179e+00 < x <= 4.315e+00 2.2417 0.0250 309
4.315e+00 < x <= 4.464e+00 2.2394 0.0250 310
4.464e+00 < x <= 4.611e+00 2.2577 0.0252 312
4.611e+00 < x <= 4.757e+00 2.4351 0.0248 307
4.757e+00 < x <= 4.946e+00 2.3482 0.0250 309
4.946e+00 < x <= 5.117e+00 2.4592 0.0250 310
5.117e+00 < x <= 5.308e+00 2.5784 0.0250 309
5.308e+00 < x <= 5.538e+00 2.6892 0.0250 310
5.538e+00 < x <= 5.828e+00 2.7867 0.0251 311
5.828e+00 < x <= 6.148e+00 3.0943 0.0249 308
6.148e+00 < x <= 6.599e+00 3.3031 0.0250 310
6.599e+00 < x <= 7.313e+00 3.6064 0.0250 309
7.313e+00 < x <= 8.433e+00 4.0191 0.0250 310
8.433e+00 < x 4.7343 0.0250 310
X_dev distribution
target_mean frequency count
1.2507 0.0247 102
1.0319 0.0262 108
1.1587 0.0257 106
1.0855 0.0252 104
1.2523 0.0225 93
1.2606 0.0293 121
1.2643 0.0208 86
1.3335 0.0274 113
1.4528 0.0257 106
1.4887 0.0305 126
1.5142 0.0237 98
1.6485 0.0208 86
1.5544 0.0293 121
1.6189 0.0257 106
1.7433 0.0233 96
1.6369 0.0213 88
1.7802 0.0276 114
1.9721 0.0283 117
1.8287 0.0279 115
1.8295 0.0242 100
1.9907 0.0300 124
1.9517 0.0216 89
2.0220 0.0269 111
2.1509 0.0269 111
2.0977 0.0291 120
2.2054 0.0225 93
2.2979 0.0274 113
2.3553 0.0274 113
2.2924 0.0184 76
2.4401 0.0213 88
2.2931 0.0250 103
2.4940 0.0237 98
2.6133 0.0250 103
2.7177 0.0189 78
2.9110 0.0276 114
3.0729 0.0213 88
3.0759 0.0271 112
3.5985 0.0228 94
4.0385 0.0206 85
4.6131 0.0264 109
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.47e+00 1.2093 0.2250 2787
2.47e+00 < x <= 3.13e+00 1.5796 0.1750 2167
3.13e+00 < x <= 4.07e+00 1.9560 0.2251 2788
4.07e+00 < x <= 5.83e+00 2.4238 0.2499 3095
5.83e+00 < x 3.7524 0.1249 1547
X_dev distribution
target_mean frequency count
1.2323 0.2275 939
1.5934 0.1747 721
1.9604 0.2425 1001
2.4652 0.2372 979
3.6870 0.1182 488
--- [ContinuousCarver] Fit Quantitative('HouseAge') (2/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 5.00e+00 2.2358 0.0271 336
5.00e+00 < x <= 8.00e+00 1.9727 0.0263 326
8.00e+00 < x <= 1.10e+01 1.8133 0.0352 436
1.10e+01 < x <= 1.40e+01 1.8538 0.0468 579
1.40e+01 < x <= 1.60e+01 1.9355 0.0652 807
1.60e+01 < x <= 1.70e+01 1.8929 0.0319 395
1.70e+01 < x <= 1.80e+01 1.9455 0.0276 342
1.80e+01 < x <= 2.00e+01 1.9470 0.0470 582
2.00e+01 < x <= 2.30e+01 1.9934 0.0632 783
2.30e+01 < x <= 2.50e+01 2.1713 0.0480 595
2.50e+01 < x <= 2.60e+01 2.0937 0.0304 377
2.60e+01 < x <= 2.70e+01 2.0568 0.0245 303
2.70e+01 < x <= 2.80e+01 1.9827 0.0241 299
2.80e+01 < x <= 2.90e+01 2.0203 0.0232 287
2.90e+01 < x <= 3.00e+01 2.0515 0.0236 292
3.00e+01 < x <= 3.20e+01 2.0453 0.0484 599
3.20e+01 < x <= 3.30e+01 2.0343 0.0316 391
3.30e+01 < x <= 3.40e+01 2.1357 0.0320 396
3.40e+01 < x <= 3.50e+01 2.0004 0.0399 494
3.50e+01 < x <= 3.60e+01 2.1148 0.0437 541
3.60e+01 < x <= 3.70e+01 2.0004 0.0257 318
3.70e+01 < x <= 3.90e+01 2.0133 0.0355 440
3.90e+01 < x <= 4.20e+01 2.0148 0.0440 545
4.20e+01 < x <= 4.40e+01 2.0742 0.0351 435
4.40e+01 < x <= 4.70e+01 2.0852 0.0343 425
4.70e+01 < x 2.5848 0.0857 1061
X_dev distribution
target_mean frequency count
2.0720 0.0245 101
1.9201 0.0269 111
1.9054 0.0344 142
1.8581 0.0412 170
1.8826 0.0606 250
1.8592 0.0375 155
1.8799 0.0283 117
1.8746 0.0436 180
2.1128 0.0577 238
2.0847 0.0579 239
2.0778 0.0296 122
2.1784 0.0216 89
2.2242 0.0208 86
1.7802 0.0213 88
1.7629 0.0233 96
2.0493 0.0504 208
1.9343 0.0259 107
2.0837 0.0349 144
2.1957 0.0417 172
2.0157 0.0431 178
2.2006 0.0296 122
2.0026 0.0351 145
1.9358 0.0499 206
2.0117 0.0312 129
2.0839 0.0380 157
2.5968 0.0911 376
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.30e+01 1.9466 0.3703 4586
2.30e+01 < x <= 2.60e+01 2.1412 0.0785 972
2.60e+01 < x <= 3.60e+01 2.0526 0.2909 3602
3.60e+01 < x <= 4.70e+01 2.0381 0.1747 2163
4.70e+01 < x 2.5848 0.0857 1061
X_dev distribution
target_mean frequency count
1.9316 0.3547 1464
2.0824 0.0875 361
2.0383 0.2829 1168
2.0347 0.1839 759
2.5968 0.0911 376
--- [ContinuousCarver] Fit Quantitative('AveRooms') (3/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 3.066e+00 1.9506 0.0250 310
3.066e+00 < x <= 3.432e+00 1.8880 0.0250 310
3.432e+00 < x <= 3.647e+00 1.8233 0.0250 309
3.647e+00 < x <= 3.792e+00 1.8292 0.0250 310
3.792e+00 < x <= 3.933e+00 1.7847 0.0250 309
3.933e+00 < x <= 4.052e+00 1.8499 0.0250 310
4.052e+00 < x <= 4.168e+00 1.8718 0.0250 310
4.168e+00 < x <= 4.276e+00 1.8333 0.0250 309
4.276e+00 < x <= 4.365e+00 1.7965 0.0250 310
4.365e+00 < x <= 4.454e+00 1.6952 0.0250 309
4.454e+00 < x <= 4.536e+00 1.7535 0.0250 310
4.536e+00 < x <= 4.621e+00 1.7952 0.0250 309
4.621e+00 < x <= 4.705e+00 1.8465 0.0250 310
4.705e+00 < x <= 4.794e+00 1.7486 0.0250 310
4.794e+00 < x <= 4.874e+00 1.7719 0.0250 309
4.874e+00 < x <= 4.941e+00 1.7219 0.0251 311
4.941e+00 < x <= 5.014e+00 1.7176 0.0249 308
5.014e+00 < x <= 5.088e+00 1.7707 0.0250 310
5.088e+00 < x <= 5.160e+00 1.7918 0.0250 309
5.160e+00 < x <= 5.233e+00 1.7791 0.0250 310
5.233e+00 < x <= 5.315e+00 1.8209 0.0250 310
5.315e+00 < x <= 5.384e+00 1.9107 0.0250 309
5.384e+00 < x <= 5.460e+00 1.7728 0.0250 310
5.460e+00 < x <= 5.532e+00 1.8996 0.0250 309
5.532e+00 < x <= 5.616e+00 1.8872 0.0250 310
5.616e+00 < x <= 5.694e+00 1.9905 0.0250 309
5.694e+00 < x <= 5.778e+00 2.0029 0.0250 310
5.778e+00 < x <= 5.858e+00 2.0107 0.0250 310
5.858e+00 < x <= 5.959e+00 2.1137 0.0250 309
5.959e+00 < x <= 6.059e+00 2.0469 0.0250 310
6.059e+00 < x <= 6.157e+00 2.1450 0.0250 309
6.157e+00 < x <= 6.270e+00 2.2477 0.0250 310
6.270e+00 < x <= 6.396e+00 2.3495 0.0250 309
6.396e+00 < x <= 6.543e+00 2.4232 0.0250 310
6.543e+00 < x <= 6.717e+00 2.6241 0.0250 310
6.717e+00 < x <= 6.946e+00 2.7573 0.0250 309
6.946e+00 < x <= 7.233e+00 3.0763 0.0250 310
7.233e+00 < x <= 7.637e+00 3.1118 0.0250 309
7.637e+00 < x <= 8.324e+00 3.5846 0.0250 310
8.324e+00 < x 2.7391 0.0250 310
X_dev distribution
target_mean frequency count
2.0908 0.0233 96
1.8579 0.0264 109
2.0031 0.0242 100
1.8060 0.0274 113
1.8137 0.0240 99
1.7725 0.0211 87
1.7723 0.0283 117
1.7839 0.0247 102
1.7902 0.0286 118
1.8121 0.0264 109
1.6265 0.0264 109
1.8349 0.0276 114
1.8339 0.0247 102
1.7725 0.0342 141
1.8188 0.0254 105
1.8480 0.0191 79
1.8333 0.0235 97
1.8191 0.0266 110
1.7419 0.0266 110
1.7642 0.0220 91
1.7645 0.0303 125
1.7917 0.0266 110
1.8651 0.0262 108
1.8645 0.0274 113
1.8082 0.0286 118
1.8483 0.0177 73
2.0778 0.0240 99
2.0005 0.0187 77
1.9724 0.0291 120
2.2623 0.0235 97
2.0818 0.0230 95
2.2889 0.0250 103
2.3280 0.0213 88
2.5373 0.0254 105
2.6787 0.0201 83
2.7457 0.0211 87
3.0108 0.0303 125
3.1596 0.0233 96
3.4340 0.0235 97
2.7568 0.0245 101
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 3.43e+00 1.9193 0.0501 620
3.43e+00 < x <= 5.62e+00 1.8031 0.5749 7120
5.62e+00 < x <= 6.16e+00 2.0516 0.1500 1857
6.16e+00 < x <= 6.54e+00 2.3401 0.0750 929
6.54e+00 < x 2.9823 0.1500 1858
X_dev distribution
target_mean frequency count
1.9670 0.0497 205
1.8045 0.6000 2477
2.0474 0.1359 561
2.3886 0.0717 296
2.9752 0.1427 589
--- [ContinuousCarver] Fit Quantitative('AveBedrms') (4/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 9.1220e-01 2.0511 0.0250 310
9.1220e-01 < x <= 9.4022e-01 2.1264 0.0250 310
9.4022e-01 < x <= 9.5595e-01 2.0638 0.0250 309
9.5595e-01 < x <= 9.6743e-01 2.0756 0.0251 311
9.6743e-01 < x <= 9.7590e-01 2.2562 0.0249 308
9.7590e-01 < x <= 9.8343e-01 2.1709 0.0250 310
9.8343e-01 < x <= 9.8987e-01 2.1450 0.0250 310
9.8987e-01 < x <= 9.9592e-01 2.1772 0.0250 309
9.9592e-01 < x <= 1.0019e+00 2.1915 0.0251 311
1.0019e+00 < x <= 1.0068e+00 2.0949 0.0249 308
1.0068e+00 < x <= 1.0112e+00 2.2440 0.0250 310
1.0112e+00 < x <= 1.0156e+00 2.1687 0.0250 310
1.0156e+00 < x <= 1.0204e+00 2.1723 0.0250 309
1.0204e+00 < x <= 1.0250e+00 2.2003 0.0254 314
1.0250e+00 < x <= 1.0290e+00 2.1324 0.0246 305
1.0290e+00 < x <= 1.0331e+00 2.1840 0.0250 310
1.0331e+00 < x <= 1.0369e+00 2.0321 0.0250 309
1.0369e+00 < x <= 1.0412e+00 2.1746 0.0250 310
1.0412e+00 < x <= 1.0453e+00 2.2536 0.0250 309
1.0453e+00 < x <= 1.0493e+00 2.1546 0.0250 310
1.0493e+00 < x <= 1.0534e+00 2.0738 0.0251 311
1.0534e+00 < x <= 1.0574e+00 2.1224 0.0249 308
1.0574e+00 < x <= 1.0615e+00 2.0414 0.0250 310
1.0615e+00 < x <= 1.0662e+00 2.1569 0.0251 311
1.0662e+00 < x <= 1.0712e+00 2.0972 0.0250 309
1.0712e+00 < x <= 1.0763e+00 2.0714 0.0249 308
1.0763e+00 < x <= 1.0816e+00 2.0244 0.0250 310
1.0816e+00 < x <= 1.0874e+00 2.0135 0.0252 312
1.0874e+00 < x <= 1.0933e+00 2.2239 0.0249 308
1.0933e+00 < x <= 1.1000e+00 2.0244 0.0262 324
1.1000e+00 < x <= 1.1071e+00 2.0077 0.0242 300
1.1071e+00 < x <= 1.1160e+00 1.9564 0.0245 304
1.1160e+00 < x <= 1.1267e+00 2.0077 0.0250 310
1.1267e+00 < x <= 1.1387e+00 1.9305 0.0250 309
1.1387e+00 < x <= 1.1538e+00 1.8130 0.0258 319
1.1538e+00 < x <= 1.1739e+00 1.8060 0.0242 300
1.1739e+00 < x <= 1.2074e+00 1.9109 0.0250 310
1.2074e+00 < x <= 1.2730e+00 1.8950 0.0250 309
1.2730e+00 < x <= 1.5018e+00 1.7962 0.0250 310
1.5018e+00 < x 1.4931 0.0250 310
X_dev distribution
target_mean frequency count
1.7961 0.0252 104
2.0098 0.0298 123
2.3039 0.0257 106
2.2390 0.0262 108
2.3293 0.0240 99
1.9318 0.0194 80
2.1575 0.0199 82
2.1740 0.0291 120
2.2207 0.0337 139
2.1811 0.0233 96
2.0475 0.0262 108
2.2743 0.0218 90
2.2627 0.0293 121
2.1068 0.0247 102
2.4459 0.0228 94
2.1280 0.0269 111
2.1193 0.0240 99
2.2280 0.0259 107
2.0336 0.0237 98
2.0195 0.0216 89
1.9898 0.0235 97
2.2270 0.0216 89
1.9244 0.0254 105
2.1509 0.0237 98
2.2223 0.0274 113
1.9654 0.0271 112
2.1085 0.0257 106
2.0332 0.0240 99
1.9262 0.0264 109
2.1139 0.0274 113
1.9025 0.0225 93
1.8628 0.0271 112
1.9501 0.0259 107
2.0231 0.0206 85
1.8622 0.0271 112
1.8137 0.0250 103
2.0399 0.0259 107
1.6392 0.0218 90
1.7221 0.0250 103
1.6019 0.0240 99
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 1.049e+00 2.1535 0.5000 6192
1.049e+00 < x <= 1.093e+00 2.0915 0.2250 2787
1.093e+00 < x <= 1.139e+00 1.9857 0.1249 1547
1.139e+00 < x <= 1.273e+00 1.8563 0.1000 1238
1.273e+00 < x 1.6446 0.0501 620
X_dev distribution
target_mean frequency count
2.1526 0.5029 2076
2.0582 0.2248 928
1.9707 0.1235 510
1.8475 0.0998 412
1.6632 0.0489 202
--- [ContinuousCarver] Fit Quantitative('Population') (5/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 2.08e+02 1.9050 0.0251 311
2.08e+02 < x <= 3.53e+02 2.0277 0.0251 311
3.53e+02 < x <= 4.42e+02 2.0655 0.0250 310
4.42e+02 < x <= 5.12e+02 2.2067 0.0249 308
5.12e+02 < x <= 5.75e+02 2.1327 0.0250 310
5.75e+02 < x <= 6.27e+02 2.0731 0.0250 310
6.27e+02 < x <= 6.75e+02 2.3627 0.0249 308
6.75e+02 < x <= 7.16e+02 2.2006 0.0250 309
7.16e+02 < x <= 7.56e+02 2.0900 0.0253 313
7.56e+02 < x <= 7.94e+02 2.0191 0.0251 311
7.94e+02 < x <= 8.32e+02 2.3248 0.0251 311
8.32e+02 < x <= 8.67e+02 2.0763 0.0253 313
8.67e+02 < x <= 9.02e+02 2.0313 0.0247 306
9.02e+02 < x <= 9.40e+02 2.1185 0.0247 306
9.40e+02 < x <= 9.78e+02 2.1790 0.0253 313
9.78e+02 < x <= 1.02e+03 2.0746 0.0249 308
1.02e+03 < x <= 1.06e+03 1.9522 0.0247 306
1.06e+03 < x <= 1.09e+03 2.1186 0.0250 310
1.09e+03 < x <= 1.13e+03 2.0592 0.0252 312
1.13e+03 < x <= 1.17e+03 2.0640 0.0252 312
1.17e+03 < x <= 1.22e+03 2.0134 0.0249 308
1.22e+03 < x <= 1.26e+03 2.1690 0.0250 310
1.26e+03 < x <= 1.30e+03 2.0558 0.0248 307
1.30e+03 < x <= 1.35e+03 1.9711 0.0249 308
1.35e+03 < x <= 1.41e+03 2.0185 0.0250 310
1.41e+03 < x <= 1.46e+03 2.0004 0.0251 311
1.46e+03 < x <= 1.52e+03 2.0911 0.0248 307
1.52e+03 < x <= 1.59e+03 2.1322 0.0254 315
1.59e+03 < x <= 1.66e+03 1.9949 0.0246 305
1.66e+03 < x <= 1.73e+03 2.0233 0.0250 309
1.73e+03 < x <= 1.82e+03 1.8946 0.0253 313
1.82e+03 < x <= 1.91e+03 1.9504 0.0247 306
1.91e+03 < x <= 2.02e+03 2.0074 0.0250 310
2.02e+03 < x <= 2.16e+03 2.0213 0.0250 310
2.16e+03 < x <= 2.32e+03 2.0541 0.0250 309
2.32e+03 < x <= 2.56e+03 2.0757 0.0250 310
2.56e+03 < x <= 2.86e+03 2.0142 0.0250 309
2.86e+03 < x <= 3.28e+03 1.9196 0.0250 309
3.28e+03 < x <= 4.25e+03 2.0439 0.0250 310
4.25e+03 < x 2.0010 0.0250 310
X_dev distribution
target_mean frequency count
1.9895 0.0269 111
1.8189 0.0271 112
2.1479 0.0271 112
2.2434 0.0266 110
2.1281 0.0269 111
2.2908 0.0257 106
2.0926 0.0283 117
2.1757 0.0213 88
2.2182 0.0259 107
2.1433 0.0286 118
2.0769 0.0293 121
2.1889 0.0240 99
2.0488 0.0218 90
2.1585 0.0247 102
2.0699 0.0259 107
2.0396 0.0247 102
1.9843 0.0254 105
2.1062 0.0213 88
1.9823 0.0242 100
2.1353 0.0271 112
2.1132 0.0230 95
1.9696 0.0252 104
2.1243 0.0196 81
1.9774 0.0245 101
1.8002 0.0245 101
2.1500 0.0264 109
1.9471 0.0293 121
1.9535 0.0262 108
2.0915 0.0274 113
2.0390 0.0228 94
2.1380 0.0211 87
1.9706 0.0203 84
1.8717 0.0264 109
1.9082 0.0247 102
2.0895 0.0233 96
1.8131 0.0266 110
2.0019 0.0269 111
2.0234 0.0201 83
2.1558 0.0262 108
2.0339 0.0225 93
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 3.53e+02 1.9663 0.0502 622
3.53e+02 < x <= 8.32e+02 2.1636 0.2253 2790
8.32e+02 < x <= 1.73e+03 2.0604 0.4745 5876
1.73e+03 < x <= 2.16e+03 1.9683 0.1000 1239
2.16e+03 < x 2.0181 0.1500 1857
X_dev distribution
target_mean frequency count
1.9038 0.0540 223
2.1659 0.2398 990
2.0445 0.4680 1932
1.9639 0.0925 382
2.0169 0.1456 601
--- [ContinuousCarver] Fit Quantitative('AveOccup') (6/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 1.699e+00 2.6141 0.0250 310
1.699e+00 < x <= 1.868e+00 2.7986 0.0250 310
1.868e+00 < x <= 1.976e+00 2.6979 0.0250 309
1.976e+00 < x <= 2.071e+00 2.5558 0.0250 310
2.071e+00 < x <= 2.161e+00 2.4582 0.0250 309
2.161e+00 < x <= 2.228e+00 2.2757 0.0250 310
2.228e+00 < x <= 2.288e+00 2.3592 0.0250 310
2.288e+00 < x <= 2.341e+00 2.2507 0.0250 309
2.341e+00 < x <= 2.388e+00 2.1371 0.0250 310
2.388e+00 < x <= 2.435e+00 2.2708 0.0250 309
2.435e+00 < x <= 2.475e+00 2.1989 0.0250 310
2.475e+00 < x <= 2.515e+00 2.1564 0.0250 309
2.515e+00 < x <= 2.557e+00 2.1279 0.0250 310
2.557e+00 < x <= 2.598e+00 2.2428 0.0250 310
2.598e+00 < x <= 2.639e+00 2.1116 0.0250 309
2.639e+00 < x <= 2.674e+00 2.2343 0.0250 310
2.674e+00 < x <= 2.712e+00 2.0489 0.0250 309
2.712e+00 < x <= 2.746e+00 2.2196 0.0250 310
2.746e+00 < x <= 2.784e+00 2.1211 0.0250 309
2.784e+00 < x <= 2.824e+00 2.2645 0.0250 310
2.824e+00 < x <= 2.861e+00 2.1565 0.0251 311
2.861e+00 < x <= 2.899e+00 2.2323 0.0250 309
2.899e+00 < x <= 2.943e+00 2.0714 0.0250 309
2.943e+00 < x <= 2.984e+00 2.0495 0.0250 309
2.984e+00 < x <= 3.026e+00 1.9917 0.0250 310
3.026e+00 < x <= 3.071e+00 1.9623 0.0250 309
3.071e+00 < x <= 3.117e+00 2.0491 0.0250 310
3.117e+00 < x <= 3.168e+00 1.9336 0.0250 310
3.168e+00 < x <= 3.221e+00 1.9472 0.0250 310
3.221e+00 < x <= 3.279e+00 1.8938 0.0250 309
3.279e+00 < x <= 3.344e+00 1.8804 0.0250 309
3.344e+00 < x <= 3.424e+00 1.8724 0.0250 310
3.424e+00 < x <= 3.508e+00 1.8000 0.0250 309
3.508e+00 < x <= 3.606e+00 1.6571 0.0250 310
3.606e+00 < x <= 3.719e+00 1.5624 0.0250 310
3.719e+00 < x <= 3.870e+00 1.5709 0.0250 309
3.870e+00 < x <= 4.089e+00 1.4854 0.0250 310
4.089e+00 < x <= 4.317e+00 1.4240 0.0250 309
4.317e+00 < x <= 4.705e+00 1.3233 0.0250 310
4.705e+00 < x 1.5280 0.0250 310
X_dev distribution
target_mean frequency count
2.7524 0.0220 91
2.7763 0.0293 121
2.6502 0.0257 106
2.5990 0.0242 100
2.4828 0.0296 122
2.4039 0.0247 102
2.2567 0.0281 116
2.4137 0.0230 95
2.3471 0.0211 87
2.2425 0.0300 124
2.0911 0.0252 104
2.2072 0.0259 107
2.1370 0.0262 108
2.0973 0.0281 116
2.0188 0.0230 95
2.0825 0.0225 93
2.2615 0.0247 102
2.0114 0.0213 88
2.2314 0.0257 106
2.0203 0.0233 96
2.0908 0.0286 118
1.8887 0.0233 96
1.9894 0.0250 103
2.2316 0.0228 94
2.0891 0.0291 120
1.9787 0.0223 92
2.0818 0.0279 115
1.8602 0.0203 84
1.9611 0.0189 78
1.7265 0.0230 95
1.7789 0.0259 107
1.8341 0.0274 113
1.6481 0.0211 87
1.6989 0.0247 102
1.6267 0.0271 112
1.5547 0.0250 103
1.4150 0.0293 121
1.5364 0.0220 91
1.4245 0.0262 108
1.5598 0.0266 110
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 2.16e+00 2.6250 0.1250 1548
2.16e+00 < x <= 2.90e+00 2.2005 0.4251 5264
2.90e+00 < x <= 3.51e+00 1.9501 0.2749 3404
3.51e+00 < x <= 3.87e+00 1.5968 0.0750 929
3.87e+00 < x 1.4402 0.1000 1239
X_dev distribution
target_mean frequency count
2.6484 0.1308 540
2.1665 0.4247 1753
1.9311 0.2636 1088
1.6265 0.0768 317
1.4801 0.1042 430
--- [ContinuousCarver] Fit Quantitative('Latitude') (7/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= 3.275e+01 1.5912 0.0287 355
3.275e+01 < x <= 3.321e+01 2.0299 0.0466 577
3.321e+01 < x <= 3.365e+01 2.7833 0.0279 345
3.365e+01 < x <= 3.374e+01 2.4326 0.0268 332
3.374e+01 < x <= 3.379e+01 2.1829 0.0262 325
3.379e+01 < x <= 3.383e+01 2.4232 0.0229 283
3.383e+01 < x <= 3.387e+01 2.3003 0.0241 299
3.387e+01 < x <= 3.391e+01 2.1570 0.0279 345
3.391e+01 < x <= 3.394e+01 1.6300 0.0242 300
3.394e+01 < x <= 3.397e+01 1.8594 0.0225 279
3.397e+01 < x <= 3.400e+01 1.9482 0.0224 278
3.400e+01 < x <= 3.403e+01 2.1267 0.0277 343
3.403e+01 < x <= 3.406e+01 2.4021 0.0339 420
3.406e+01 < x <= 3.410e+01 2.1760 0.0417 516
3.410e+01 < x <= 3.413e+01 2.3646 0.0242 300
3.413e+01 < x <= 3.417e+01 2.7771 0.0301 373
3.417e+01 < x <= 3.427e+01 2.4100 0.0435 539
3.427e+01 < x <= 3.453e+01 2.4559 0.0240 297
3.453e+01 < x <= 3.532e+01 1.4914 0.0246 305
3.532e+01 < x <= 3.623e+01 0.9208 0.0250 310
3.623e+01 < x <= 3.672e+01 1.2441 0.0262 324
3.672e+01 < x <= 3.697e+01 1.3129 0.0253 313
3.697e+01 < x <= 3.729e+01 2.6241 0.0239 296
3.729e+01 < x <= 3.737e+01 2.6574 0.0258 320
3.737e+01 < x <= 3.753e+01 3.0105 0.0255 316
3.753e+01 < x <= 3.765e+01 2.4197 0.0243 301
3.765e+01 < x <= 3.772e+01 2.1174 0.0256 317
3.772e+01 < x <= 3.777e+01 2.5537 0.0286 354
3.777e+01 < x <= 3.793e+01 2.6887 0.0459 569
3.793e+01 < x <= 3.800e+01 1.7622 0.0250 310
3.800e+01 < x <= 3.826e+01 1.5924 0.0243 301
3.826e+01 < x <= 3.850e+01 1.8570 0.0254 315
3.850e+01 < x <= 3.863e+01 1.3981 0.0241 298
3.863e+01 < x <= 3.898e+01 1.3962 0.0251 311
3.898e+01 < x <= 3.975e+01 1.1241 0.0255 316
3.975e+01 < x 0.8442 0.0244 302
X_dev distribution
target_mean frequency count
1.5761 0.0320 132
2.0768 0.0552 228
2.7115 0.0264 109
2.4368 0.0262 108
2.2910 0.0291 120
2.3528 0.0220 91
2.3233 0.0233 96
2.0937 0.0368 152
1.6319 0.0230 95
1.7992 0.0235 97
1.9408 0.0250 103
2.1292 0.0250 103
2.3261 0.0334 138
2.2762 0.0443 183
2.2228 0.0216 89
2.8224 0.0303 125
2.2938 0.0465 192
2.5025 0.0252 104
1.3719 0.0201 83
0.9336 0.0218 90
1.2516 0.0259 107
1.2597 0.0274 113
2.5507 0.0240 99
2.5351 0.0266 110
2.9827 0.0283 117
2.6519 0.0194 80
2.0869 0.0203 84
2.6145 0.0242 100
2.5853 0.0516 213
1.6630 0.0250 103
1.5156 0.0206 85
1.7549 0.0225 93
1.3101 0.0196 81
1.3997 0.0279 115
1.1114 0.0235 97
0.8671 0.0225 93
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= 3.45e+01 2.2311 0.5254 6506
3.45e+01 < x <= 3.70e+01 1.2415 0.1011 1252
3.70e+01 < x <= 3.79e+01 2.5927 0.1997 2473
3.79e+01 < x <= 3.90e+01 1.6035 0.1240 1535
3.90e+01 < x 0.9873 0.0499 618
X_dev distribution
target_mean frequency count
2.2111 0.5487 2265
1.2065 0.0952 393
2.5902 0.1945 803
1.5312 0.1156 477
0.9918 0.0460 190
--- [ContinuousCarver] Fit Quantitative('Longitude') (8/8)
 [ContinuousCarver] Raw distribution
X distribution
  target_mean frequency count
x <= -1.2269e+02 1.4063 0.0259 321
-1.2269e+02 < x <= -1.2247e+02 2.8878 0.0259 321
-1.2247e+02 < x <= -1.2241e+02 3.2397 0.0245 303
-1.2241e+02 < x <= -1.2229e+02 2.1582 0.0262 324
-1.2229e+02 < x <= -1.2215e+02 2.3071 0.0476 589
-1.2215e+02 < x <= -1.2206e+02 2.5665 0.0263 326
-1.2206e+02 < x <= -1.2199e+02 2.6265 0.0253 313
-1.2199e+02 < x <= -1.2191e+02 2.6924 0.0237 294
-1.2191e+02 < x <= -1.2181e+02 2.2919 0.0255 316
-1.2181e+02 < x <= -1.2157e+02 1.7103 0.0242 300
-1.2157e+02 < x <= -1.2139e+02 1.1736 0.0252 312
-1.2139e+02 < x <= -1.2127e+02 1.3270 0.0263 326
-1.2127e+02 < x <= -1.2101e+02 1.4857 0.0238 295
-1.2101e+02 < x <= -1.2064e+02 1.4716 0.0245 304
-1.2064e+02 < x <= -1.2007e+02 1.3376 0.0254 314
-1.2007e+02 < x <= -1.1972e+02 1.2624 0.0258 319
-1.1972e+02 < x <= -1.1929e+02 1.3332 0.0239 296
-1.1929e+02 < x <= -1.1897e+02 1.3300 0.0250 310
-1.1897e+02 < x <= -1.1852e+02 2.7211 0.0258 319
-1.1852e+02 < x <= -1.1843e+02 3.1653 0.0284 352
-1.1843e+02 < x <= -1.1838e+02 3.4432 0.0238 295
-1.1838e+02 < x <= -1.1834e+02 2.7480 0.0249 308
-1.1834e+02 < x <= -1.1830e+02 2.3435 0.0271 336
-1.1830e+02 < x <= -1.1822e+02 1.7476 0.0480 594
-1.1822e+02 < x <= -1.1818e+02 1.8055 0.0227 281
-1.1818e+02 < x <= -1.1813e+02 2.1480 0.0287 356
-1.1813e+02 < x <= -1.1808e+02 2.2494 0.0243 301
-1.1808e+02 < x <= -1.1801e+02 2.4079 0.0245 303
-1.1801e+02 < x <= -1.1790e+02 2.2304 0.0468 580
-1.1790e+02 < x <= -1.1780e+02 2.4820 0.0266 329
-1.1780e+02 < x <= -1.1766e+02 2.2864 0.0248 307
-1.1766e+02 < x <= -1.1739e+02 1.6791 0.0237 294
-1.1739e+02 < x <= -1.1725e+02 1.6380 0.0290 359
-1.1725e+02 < x <= -1.1716e+02 2.0512 0.0229 284
-1.1716e+02 < x <= -1.1708e+02 1.5113 0.0249 308
-1.1708e+02 < x <= -1.1696e+02 1.6669 0.0235 291
-1.1696e+02 < x 1.1769 0.0245 304
X_dev distribution
target_mean frequency count
1.3927 0.0216 89
3.0129 0.0233 96
3.1899 0.0225 93
2.1911 0.0271 112
2.3035 0.0453 187
2.9862 0.0240 99
2.5471 0.0240 99
2.6969 0.0230 95
2.1464 0.0250 103
1.7105 0.0218 90
1.0959 0.0220 91
1.2918 0.0291 120
1.3781 0.0230 95
1.4767 0.0225 93
1.2441 0.0252 104
1.2810 0.0281 116
1.2813 0.0252 104
1.4223 0.0274 113
2.7081 0.0218 90
3.2548 0.0266 110
3.3604 0.0242 100
2.8064 0.0262 108
2.2395 0.0305 126
1.7631 0.0434 179
1.6175 0.0298 123
2.0881 0.0264 109
2.3487 0.0245 101
2.4322 0.0235 97
2.1850 0.0497 205
2.5202 0.0288 119
2.2701 0.0235 97
1.7464 0.0225 93
1.8748 0.0310 128
2.1466 0.0266 110
1.4479 0.0279 115
1.5746 0.0271 112
1.2465 0.0259 107
 [ContinuousCarver] Carved distribution
X distribution
  target_mean frequency count
x <= -1.218e+02 2.4438 0.2509 3107
-1.218e+02 < x <= -1.190e+02 1.3787 0.2242 2776
-1.190e+02 < x <= -1.183e+02 3.0175 0.1029 1274
-1.183e+02 < x <= -1.177e+02 2.1601 0.2735 3387
-1.177e+02 < x 1.6155 0.1486 1840
X_dev distribution
target_mean frequency count
2.4780 0.2357 973
1.3487 0.2243 926
3.0414 0.0988 408
2.1328 0.2800 1156
1.6763 0.1611 665
[20]:
library fit_s transform_s train_r2 test_r2 r2_drop
0 AutoCarver 5.245 0.0778 0.6652 0.6595 0.0057
1 optbinning 2.404 0.0083 0.5145 0.5077 0.0068
2 KBinsDiscretizer 0.007 0.0015 0.6181 0.6192 -0.0011
[21]:
plot_bars(regression_results, ['fit_s', 'test_r2', 'r2_drop'], 'California Housing \u2014 regression')
../../_images/examples_Comparison_comparison_notebook_13_0.png

How to read these numbers

  • ``fit_s`` / ``transform_s`` measure only .fit / .transform wall-clock — not data loading, not one-hot encoding, not the downstream model.

  • ``test_auc`` / ``test_r2`` are the headline metric. They reflect how well a simple downstream model performs on each library’s binned output. A tree-based downstream model would tell a different (and less binning-sensitive) story.

  • ``auc_drop`` / ``r2_drop`` are train - test and measure how much each library’s bins overfit. Lower is more robust. AutoCarver’s dev-set veto is designed to keep this small.

  • Same data, same seed, same downstream model across libraries — but a single run, on one machine, with one set of hyper-parameters. Treat as illustrative.

When the result will move

  • Bigger ``max_n_mod`` / smaller ``min_freq`` will improve AutoCarver and optbinning’s in-sample scores at the cost of *_drop. KBins doesn’t have a target, so it’s mostly insensitive.

  • Different downstream model. Gradient-boosted trees on the raw features beat any binning + linear pipeline. The point of binning is interpretability, not raw accuracy.

  • Different dataset. German Credit is small; on a 10M-row credit-risk dataset, fit_s is what dominates the comparison.

See comparison.rst for the qualitative scope and algorithmic comparison.