Quick Start =========== Setting things up ----------------- Target type and Carver selection ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Depending on one's desired modelling task, several Carvers are implemented: * :ref:`BinaryCarver` * :ref:`MulticlassCarver` * :ref:`ContinuousCarver` In the following quick start example, we will consider a binary classification problem: .. code-block:: python target = "binary_target" Hence the use of :class:`BinaryCarver` and :class:`ClassificationSelector` in following code blocks. Data Sampling ^^^^^^^^^^^^^ **AutoCarver** unables testing for robustness of carved modalities on ``X_dev`` while maximizing the association between ``X_train`` and ``y_train``. .. code-block:: python # defining training and testing sets train_set = ... # used to fit the AutoCarver and the model dev_set = ... # used to validate the AutoCarver's buckets and optimize the model's parameters/hyperparameters test_set = ... # used to evaluate the final model's performances Setting up Features to Carve ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python from AutoCarver import Features features = Features( quantitatives=['quantitative1', 'quantitative2', 'discrete1', 'discrete2_with_nan'], categoricals=['categorical1', 'categorical2', 'categorical3_with_nan'], ordinals={'ordinal1': ['low', 'medium', 'high'], 'ordinal2_with_nan': ['low', 'medium', 'high']}, ) Qualitative features will automatically be converted to :class:`str` if necessary. Ordinal features are added, alongside there expected ordering. To wrap already-instantiated feature objects (e.g. :class:`CategoricalFeature`, :class:`OrdinalFeature`, :class:`QuantitativeFeature`) use :meth:`Features.from_list` instead. Collection-level state (``nan`` / ``default`` / ``ordinal_encoding`` / ``dropna``) can be propagated to every feature via a :class:`FeaturesConfig`. Using AutoCarver ---------------- Fitting AutoCarver ^^^^^^^^^^^^^^^^^^ .. code-block:: python from AutoCarver import BinaryCarver # intiating AutoCarver binary_carver = BinaryCarver( features=features, min_freq=0.02, # minimum frequency per modality max_n_mod=5, # maximum number of modality per Carved feature (mandatory) ) # fitting on training sample, a dev sample can be specified to evaluate carving robustness x_discretized = binary_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target]) .. note:: Behavioral toggles (``copy``, ``ordinal_encoding``, ``dropna``, ``verbose``, ``n_jobs``, ``min_freq_alpha``) are now grouped in :class:`DiscretizerConfig`. Carvers default to ``DiscretizerConfig(dropna=True, ordinal_encoding=True)``. ``min_freq`` is gated by a Wilson score confidence interval at significance ``min_freq_alpha`` (default ``0.05``): raise it for a stricter representativity test, lower it for more lenient merging — see :ref:`MinFreqViability` for the formula. ``n_jobs > 1`` parallelises the per-feature combination search via ``multiprocessing.Pool``; useful only on hundreds-to-thousands of features (see :ref:`CarverParallelism`). To pick a different association metric, pass a pre-built :ref:`combination evaluator ` via the ``combination_evaluator`` keyword (e.g. :class:`CramervCombinations` for binary). The search uses :ref:`progressive top-K interval dynamic programming (DP) ` for both Kruskal-H and Pearson :math:`\chi^2`; statistically equivalent to the legacy enumerate-and-score path. Applying AutoCarver ^^^^^^^^^^^^^^^^^^^ .. code-block:: python # transforming dev/test sample accordingly dev_set_discretized = binary_carver.transform(dev_set) test_set_discretized = binary_carver.transform(tes_set) Saving AutoCarver ^^^^^^^^^^^^^^^^^ All **Carvers** can safely be serialized as a ``.json`` file. .. code-block:: python binary_carver.save('my_carver.json') Loading AutoCarver ^^^^^^^^^^^^^^^^^^ **Carvers** can safely be loaded from a ``.json`` file. .. code-block:: python from AutoCarver import BinaryCarver binary_carver = BinaryCarver.load('my_carver.json') Feature Selection ----------------- .. code-block:: python from AutoCarver.selectors import ClassificationSelector # select the best 25 most target associated features classification_selector = ClassificationSelector( features=features, # features to select from n_best_per_type=25, # number of features to select per data type verbose=True, # displays statistics ) best_features = classification_selector.select(train_set_discretized, train_set_discretized[target])