Quick Start
Setting things up
Target type and Carver selection
Depending on one’s desired modelling task, several Carvers are implemented:
In the following quick start example, we will consider a binary classification problem:
target = "binary_target"
Hence the use of BinaryCarver and ClassificationSelector in following code blocks.
Data Sampling
AutoCarver unables testing for robustness of carved modalities on X_dev while maximizing the association between X_train and y_train.
# defining training and testing sets
train_set = ... # used to fit the AutoCarver and the model
dev_set = ... # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters
test_set = ... # used to evaluate the final model's performances
Setting up Features to Carve
from AutoCarver import Features
features = Features(
numericals=['numerical1', 'numerical2', 'discrete1', 'discrete2_with_nan'],
categoricals=['categorical1', 'categorical2', 'categorical3_with_nan'],
ordinals={'ordinal1': ['low', 'medium', 'high'], 'ordinal2_with_nan': ['low', 'medium', 'high']},
)
Qualitative features will automatically be converted to str if necessary.
Ordinal features are added, alongside there expected ordering.
To wrap already-instantiated feature objects (e.g. CategoricalFeature,
OrdinalFeature, NumericalFeature) use Features.from_list()
instead. Collection-level state (nan / default / ordinal_encoding / dropna)
can be propagated to every feature via a FeaturesConfig.
Using AutoCarver
Fitting AutoCarver
from AutoCarver import BinaryCarver
# intiating AutoCarver
binary_carver = BinaryCarver(
features=features,
min_freq=0.02, # minimum frequency per modality
max_n_mod=5, # maximum number of modality per Carved feature (mandatory)
)
# fitting on training sample, a dev sample can be specified to evaluate carving robustness
x_discretized = binary_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
Note
Behavioral toggles (copy, ordinal_encoding, dropna, verbose,
n_jobs, min_freq_alpha) are now grouped in ProcessingConfig.
Carvers default to ProcessingConfig(dropna=True, ordinal_encoding=True).
min_freq is gated by a Wilson score confidence interval at significance
min_freq_alpha (default 0.05): raise it for a stricter representativity
test, lower it for more lenient merging — see Minimum-frequency test (Wilson score interval) for the
formula. n_jobs > 1 parallelises the per-feature combination search via
multiprocessing.Pool; useful only on hundreds-to-thousands of features
(see Per-feature parallelism (n_jobs)).
To pick a different association metric, pass a pre-built
combination evaluator via the
combination_evaluator keyword (e.g. CramervCombinations for binary).
The search uses progressive top-K interval dynamic programming (DP)
for both Kruskal-H and Pearson \(\chi^2\); statistically equivalent to the
legacy enumerate-and-score path.
Applying AutoCarver
# transforming dev/test sample accordingly
dev_set_discretized = binary_carver.transform(dev_set)
test_set_discretized = binary_carver.transform(tes_set)
Saving AutoCarver
All Carvers can safely be serialized as a .json file.
binary_carver.save('my_carver.json')
Loading AutoCarver
Carvers can safely be loaded from a .json file.
from AutoCarver import BinaryCarver
binary_carver = BinaryCarver.load('my_carver.json')
Feature Selection
from AutoCarver.selectors import ClassificationSelector
# select the best 25 most target associated features
classification_selector = ClassificationSelector(
features=features, # features to select from
n_best_per_type=25, # number of features to select per data type
)
best_features = classification_selector.fit(train_set_discretized, train_set_discretized[target]).selected_features
Note
Selectors mirror the carver API: fit scores/ranks/filters and stores the
selection, transform restricts X to the selected columns, and
selected_features returns the selected Features directly.
Behavioral toggles (verbose …) are grouped in ProcessingConfig,
exactly as for the carvers. Every feature is scored exactly (no sampling),
yet selection stays fast through vectorized, all-columns-at-once measures.