Quick Start

Setting things up

Target type and Carver selection

Depending on one’s desired modelling task, several Carvers are implemented:

In the following quick start example, we will consider a binary classification problem:

target = "binary_target"

Hence the use of BinaryCarver and ClassificationSelector in following code blocks.

Data Sampling

AutoCarver unables testing for robustness of carved modalities on X_dev while maximizing the association between X_train and y_train.

# defining training and testing sets
train_set = ...  # used to fit the AutoCarver and the model
dev_set = ...  # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
test_set = ...  # used to evaluate the final model's performances

Setting up Features to Carve

from AutoCarver import Features

features = Features(
    quantitatives=['quantitative1', 'quantitative2', 'discrete1', 'discrete2_with_nan'],
    categoricals=['categorical1', 'categorical2', 'categorical3_with_nan'],
    ordinals={'ordinal1': ['low', 'medium', 'high'], 'ordinal2_with_nan': ['low', 'medium', 'high']},
)

Qualitative features will automatically be converted to str if necessary. Ordinal features are added, alongside there expected ordering.

To wrap already-instantiated feature objects (e.g. CategoricalFeature, OrdinalFeature, QuantitativeFeature) use Features.from_list() instead. Collection-level state (nan / default / ordinal_encoding / dropna) can be propagated to every feature via a FeaturesConfig.

Using AutoCarver

Fitting AutoCarver

from AutoCarver import BinaryCarver

# intiating AutoCarver
binary_carver = BinaryCarver(
    features=features,
    min_freq=0.02,  # minimum frequency per modality
    max_n_mod=5,  # maximum number of modality per Carved feature (mandatory)
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
x_discretized = binary_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

Note

Behavioral toggles (copy, ordinal_encoding, dropna, verbose, n_jobs, min_freq_alpha) are now grouped in DiscretizerConfig. Carvers default to DiscretizerConfig(dropna=True, ordinal_encoding=True).

min_freq is gated by a Wilson score confidence interval at significance min_freq_alpha (default 0.05): raise it for a stricter representativity test, lower it for more lenient merging — see Minimum-frequency viability test (Wilson score interval) for the formula. n_jobs > 1 parallelises the per-feature combination search via multiprocessing.Pool; useful only on hundreds-to-thousands of features (see Per-feature parallelism (n_jobs)).

To pick a different association metric, pass a pre-built combination evaluator via the combination_evaluator keyword (e.g. CramervCombinations for binary). The search uses progressive top-K interval dynamic programming (DP) for both Kruskal-H and Pearson \(\chi^2\); statistically equivalent to the legacy enumerate-and-score path.

Applying AutoCarver

# transforming dev/test sample accordingly
dev_set_discretized = binary_carver.transform(dev_set)
test_set_discretized = binary_carver.transform(tes_set)

Saving AutoCarver

All Carvers can safely be serialized as a .json file.

binary_carver.save('my_carver.json')

Loading AutoCarver

Carvers can safely be loaded from a .json file.

from AutoCarver import BinaryCarver

binary_carver = BinaryCarver.load('my_carver.json')

Feature Selection

from AutoCarver.selectors import ClassificationSelector

# select the best 25 most target associated features
classification_selector = ClassificationSelector(
    features=features,  # features to select from
    n_best_per_type=25,  # number of features to select per data type
    verbose=True,  # displays statistics
)
best_features = classification_selector.select(train_set_discretized, train_set_discretized[target])