Quick Start

Setting things up

Target type and Carver selection

Depending on one’s desired modelling task, several Carvers are implemented:

In the following quick start example, we will consider a binary classification problem:

target = "binary_target"

Hence the use of BinaryCarver and ClassificationSelector in following code blocks.

Data Sampling

AutoCarver unables testing for robustness of carved modalities on X_dev while maximizing the association between X_train and y_train.

# defining training and testing sets
train_set = ...  # used to fit the AutoCarver and the model
dev_set = ...  # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
test_set = ...  # used to evaluate the final model's performances

Setting up Features to Carve

from AutoCarver import Features

features = Features(
    numericals=['numerical1', 'numerical2', 'discrete1', 'discrete2_with_nan'],
    categoricals=['categorical1', 'categorical2', 'categorical3_with_nan'],
    ordinals={'ordinal1': ['low', 'medium', 'high'], 'ordinal2_with_nan': ['low', 'medium', 'high']},
)

Qualitative features will automatically be converted to str if necessary. Ordinal features are added, alongside there expected ordering.

To wrap already-instantiated feature objects (e.g. CategoricalFeature, OrdinalFeature, NumericalFeature) use Features.from_list() instead. Collection-level state (nan / default / ordinal_encoding / dropna) can be propagated to every feature via a FeaturesConfig.

Using AutoCarver

Fitting AutoCarver

from AutoCarver import BinaryCarver

# intiating AutoCarver
binary_carver = BinaryCarver(
    features=features,
    min_freq=0.02,  # minimum frequency per modality
    max_n_mod=5,  # maximum number of modality per Carved feature (mandatory)
)

# fitting on training sample, a dev sample can be specified to evaluate carving robustness
x_discretized = binary_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])

Note

Behavioral toggles (copy, ordinal_encoding, dropna, verbose, n_jobs, min_freq_alpha) are now grouped in ProcessingConfig. Carvers default to ProcessingConfig(dropna=True, ordinal_encoding=True).

min_freq is gated by a Wilson score confidence interval at significance min_freq_alpha (default 0.05): raise it for a stricter representativity test, lower it for more lenient merging — see Minimum-frequency test (Wilson score interval) for the formula. n_jobs > 1 parallelises the per-feature combination search via multiprocessing.Pool; useful only on hundreds-to-thousands of features (see Per-feature parallelism (n_jobs)).

To pick a different association metric, pass a pre-built combination evaluator via the combination_evaluator keyword (e.g. CramervCombinations for binary). The search uses progressive top-K interval dynamic programming (DP) for both Kruskal-H and Pearson \(\chi^2\); statistically equivalent to the legacy enumerate-and-score path.

Applying AutoCarver

# transforming dev/test sample accordingly
dev_set_discretized = binary_carver.transform(dev_set)
test_set_discretized = binary_carver.transform(tes_set)

Saving AutoCarver

All Carvers can safely be serialized as a .json file.

binary_carver.save('my_carver.json')

Loading AutoCarver

Carvers can safely be loaded from a .json file.

from AutoCarver import BinaryCarver

binary_carver = BinaryCarver.load('my_carver.json')

Feature Selection

from AutoCarver.selectors import ClassificationSelector

# select the best 25 most target associated features
classification_selector = ClassificationSelector(
    features=features,  # features to select from
    n_best_per_type=25,  # number of features to select per data type
)
best_features = classification_selector.fit(train_set_discretized, train_set_discretized[target]).selected_features

Note

Selectors mirror the carver API: fit scores/ranks/filters and stores the selection, transform restricts X to the selected columns, and selected_features returns the selected Features directly. Behavioral toggles (verbose …) are grouped in ProcessingConfig, exactly as for the carvers. Every feature is scored exactly (no sampling), yet selection stays fast through vectorized, all-columns-at-once measures.