Features

The AutoCarver.features module defines a set of features used in the AutoCarver project. This module includes classes and functions to handle different types of features, such as qualitative and quantitative features.

Features

class AutoCarver.features.Features(categoricals: list[str] | None = None, quantitatives: list[str] | None = None, ordinals: dict[str, list[str]] | None = None, datetimes: list[tuple[str, str]] | None = None, nested: dict[str, list[str]] | None = None, config: FeaturesConfig | None = None)

A set of typed features

Build a Features collection from column names.

Parameters:

categoricals (list[str], optional) – Categorical column names, by default None.
quantitatives (list[str], optional) – Quantitative column names, by default None.
ordinals (dict[str, list[str]], optional) – Ordinal column names mapped to their ordered value list, by default None.
datetimes (list[tuple[str, str]], optional) – Datetime features as (column name, reference_date) pairs, by default None. Values are discretized as the number of seconds elapsed since reference_date.
nested (dict[str, list[str]], optional) – Nested features as {output column: [parent columns coarser-ward]}, by default None. The output column is the finest level; parents are listed from nearest to farthest. Rare modalities of the output column are rolled up to their data-derived parent until frequent enough (see NestedDiscretizer).
config (FeaturesConfig, optional) – Collection-level config propagated to each feature, by default None.

Warning

At least one of categoricals, quantitatives, ordinals, datetimes or nested must be provided. To build a Features from already-instantiated feature objects, use Features.from_list() instead.

property categoricals: list[CategoricalFeature]: Returns all categorical features

property datetimes: list[DatetimeFeature]: Returns all datetime features (also part of quantitatives)

classmethod from_list(features: Iterable[BaseFeature] | Features, config: FeaturesConfig | None = None) → Features

Build a Features from already-instantiated feature objects.

Parameters:

features (Iterable[BaseFeature] | Features) – Feature instances to wrap. Iterating an existing Features is supported.
config (FeaturesConfig, optional) – Collection-level config propagated to each feature, by default None.

property history: DataFrame: Combined history of all features (concatenated, with a feature column).

classmethod load(features_json: dict) → Features

Allows one to load a set of Features

Parameters:: features_json (dict) – Dictionary of serialized Features
Returns:: Loaded Features.
Return type:: Features

property names: list[str]: Returns names of all features

property ordinals: list[OrdinalFeature]: Returns all ordinal features

property qualitatives: list[OrdinalFeature | CategoricalFeature | NestedFeature]: Returns all qualitative features (categoricals, ordinals and nested)

property quantitatives: list[QuantitativeFeature]: Returns all quantitative features (datetimes included)

property summary: DataFrame: Summary of discretization process for all features

to_json(light_mode: bool = False) → dict

Serializes Features for JSON saving

Parameters:: light_mode (bool, optional) – Whether or not to serialize in light mode (without statistics and history), by default False

property versions: list[str]: Returns versions of all features

Note

Use the default constructor when you only have column names; use Features.from_list() to wrap already-instantiated feature objects.

FeaturesConfig

Collection-level state propagated to every feature in a Features. Internal feature attributes (nan, default, ordinal_encoding, has_nan, has_default, dropna, is_fitted) are not part of the public BaseFeature constructor — set them via FeaturesConfig and pass the instance to Features or Features.from_list().

class AutoCarver.features.FeaturesConfig(nan: str | None = None, default: str | None = None, ordinal_encoding: bool = False, is_fitted: bool = False, has_nan: bool = False, has_default: bool = False, dropna: bool = False)

Collection-level config applied to each feature in a Features.

Internal feature state (nan/default/ordinal_encoding/…) is not part of the public BaseFeature constructor — pass them via this dataclass to Features or Features.from_list and they are propagated to each constituent feature.

Qualitatitve features

class AutoCarver.features.CategoricalFeature(name: str, *, max_n_chars: int = 50)

Defines a categorical feature

property has_default: bool: Whether the feature has default values.

property history: DataFrame

Combination history as a DataFrame (empty when no history yet).

Stored internally as a list of dicts in _history for JSON serialization; the DataFrame is rebuilt on access. Append entries with historize().

property summary: list[dict]: Summary of feature’s discretization process.

class AutoCarver.features.OrdinalFeature(name: str, values: list[str])

Defines an ordinal feature

Parameters:: values (list[str]) – Ordered list of all unique values for the feature

property has_default: bool: Whether the feature has default values.

property history: DataFrame

Combination history as a DataFrame (empty when no history yet).

Stored internally as a list of dicts in _history for JSON serialization; the DataFrame is rebuilt on access. Append entries with historize().

property summary: list[dict]: Summary of feature’s discretization process.

Quantitative features

class AutoCarver.features.QuantitativeFeature(name: str)

Defines a quantitative feature

property has_default: bool: Whether the feature has default values.

property history: DataFrame

Combination history as a DataFrame (empty when no history yet).

Stored internally as a list of dicts in _history for JSON serialization; the DataFrame is rebuilt on access. Append entries with historize().

property summary: list[dict]: Summary of feature’s discretization process.

Datetime features

A DatetimeFeature is a quantitative feature backed by a datetime column. It is discretized as the number of seconds elapsed since a user-provided reference_date (see DatetimeFeature.to_timedelta()), after which it behaves exactly like any other quantitative feature (quantile bucketization, carving, …).

reference_date may be either a fixed date literal or the name of another datetime column in X. The two are disambiguated at fit time: if reference_date matches a column of the fitted X, the elapsed seconds are computed row-wise against that column; otherwise it is parsed as a fixed date. A row whose reference column value is missing (NaT) yields NaN.

Datetimes can be declared from the Features constructor as (column name, reference_date) pairs:

from AutoCarver.features import Features

features = Features(
    quantitatives=["age"],
    datetimes=[
        ("signup_date", "2020-01-01"),   # seconds since a fixed date
        ("churn_date", "signup_date"),   # seconds since another column
    ],
)

They are tracked under Features.datetimes and are also part of Features.quantitatives (so the quantitative pipeline processes them transparently). The datetime-to-seconds conversion is performed by the Timedelta Discretizer.

class AutoCarver.features.DatetimeFeature(name: str, reference_date: str)

Defines a datetime feature.

A datetime feature is processed as a QuantitativeFeature after its values have been converted to a number of seconds elapsed since reference_date (see to_timedelta()). The conversion is applied by the TimedeltaDiscretizer before continuous discretization.

reference_date may be either a fixed date literal (e.g. "2020-01-01") or the name of another datetime column in X. The two are disambiguated at fit time: if reference_date matches a column of the fitted X, the conversion is computed row-wise against that column; otherwise it is parsed as a fixed date.

property history: DataFrame

Combination history as a DataFrame (empty when no history yet).

Stored internally as a list of dicts in _history for JSON serialization; the DataFrame is rebuilt on access. Append entries with historize().

property summary: list[dict]: Summary of feature’s discretization process.

to_timedelta(series: Series, reference: Series | None = None) → Series

Converts datetime values to a float number of seconds since the reference.

When reference is None the fixed reference_date literal is used; otherwise reference is a datetime Series subtracted row-wise (column reference). Non-datetime entries (numpy.nan, the nan placeholder, unparseable values) are coerced to numpy.nan so the result is a plain float Series.