Setting things up
Installation
[1]:
%pip install AutoCarver[jupyter]
Titanic Data
In this example notebook, we will use the Titanic dataset.
The Titanic dataset is a well-known and frequently used dataset in the field of machine learning and data science. It provides information about the passengers on board the Titanic, the famous ship that sank on its maiden voyage in 1912. The dataset is often used for predictive modeling, classification, and regression tasks.
The dataset includes various features such as passengers’ names, ages, genders, ticket classes, cabin information, and whether they survived or not. The primary goal when working with the Titanic dataset is often to build predictive models that can infer whether a passenger survived or perished based on their individual characteristics (binary classification).
[3]:
import pandas as pd
# URL to the Titanic dataset on Kaggle
titanic_url = "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"
# Use pandas to read the CSV file directly from the URL
titanic_data = pd.read_csv(titanic_url)
# Display the first few rows of the dataset
titanic_data.head()
[3]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | Mr. Owen Harris Braund | male | 22.0 | 1 | 0 | 7.2500 |
| 1 | 1 | 1 | Mrs. John Bradley (Florence Briggs Thayer) Cum... | female | 38.0 | 1 | 0 | 71.2833 |
| 2 | 1 | 3 | Miss. Laina Heikkinen | female | 26.0 | 0 | 0 | 7.9250 |
| 3 | 1 | 1 | Mrs. Jacques Heath (Lily May Peel) Futrelle | female | 35.0 | 1 | 0 | 53.1000 |
| 4 | 0 | 3 | Mr. William Henry Allen | male | 35.0 | 0 | 0 | 8.0500 |
Target type and Carver selection
[4]:
target = "Survived"
titanic_data[target].value_counts(dropna=False)
[4]:
Survived
0 545
1 342
Name: count, dtype: int64
The target "Survived" is a binary target of type int64 used in a classification task. Hence we will use AutoCarver.BinaryCarver and AutoCarver.selectors.ClassificationSelector in following code blocks.
Data Sampling
[5]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(titanic_data, test_size=0.33, random_state=42, stratify=titanic_data[target])
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:605: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if is_sparse(pd_dtype):
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:614: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[6]:
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
[6]:
(0.38552188552188554, 0.3856655290102389)
Picking up columns to Carve
[7]:
train_set.head()
[7]:
| Survived | Pclass | Name | Sex | Age | Siblings/Spouses Aboard | Parents/Children Aboard | Fare | |
|---|---|---|---|---|---|---|---|---|
| 617 | 0 | 3 | Mr. Antoni Yasbeck | male | 27.0 | 1 | 0 | 14.4542 |
| 489 | 0 | 1 | Mr. Harry Markland Molson | male | 55.0 | 0 | 0 | 30.5000 |
| 871 | 1 | 3 | Miss. Adele Kiamie Najib | female | 15.0 | 0 | 0 | 7.2250 |
| 654 | 0 | 3 | Mrs. John (Catherine) Bourke | female | 32.0 | 1 | 1 | 15.5000 |
| 653 | 0 | 3 | Mr. Alexander Radeff | male | 27.0 | 0 | 0 | 7.8958 |
[8]:
# column data types
train_set.dtypes
[8]:
Survived int64
Pclass int64
Name object
Sex object
Age float64
Siblings/Spouses Aboard int64
Parents/Children Aboard int64
Fare float64
dtype: object
[9]:
# values taken by Parents/Children Aboard
train_set["Parents/Children Aboard"].value_counts()
[9]:
Parents/Children Aboard
0 438
1 87
2 60
3 3
5 3
4 2
6 1
Name: count, dtype: int64
[10]:
# values taken by Pclass
train_set["Pclass"].value_counts()
[10]:
Pclass
3 326
1 142
2 126
Name: count, dtype: int64
The feature "Pclass" is of type "int64", but it can be considered a qualitative ordinal feature rather than a quantitative discrete feature (ranking of named passenger classes). Thus we will add it to the list of ordinal_features and set the ordering of its values in values_orders (string values).
"Sex" is the only quantitative categorical feature, it’s added to the list of qualitative_features.
"Age" and "Fare" are quantitative continuous features, whilst "Siblings/Spouses Aboard", "Parents/Children Aboard" can be considered as quantitative discrete features. Those four features will be added to the list of quantitative_features.
[11]:
# lists of features per data type
quantitative_features = ["Age", "Fare", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
qualitative_features = ["Sex"]
ordinal_features = ["Pclass"]
# user-specified ordering for ordinal features
values_orders = {
"Pclass": ["1", "2", "3"]
}
Using AutoCarver
AutoCarver settings
Representativness of modalities
The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:
For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.
[12]:
min_freq = 0.05
Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)
Desired number of modalities
The attribute max_n_mod allows one to choose the maximum number of modalities per carved feature. It is used by Carvers has the upper limit of number of modalities per consecutive combination of modalities.
[13]:
max_n_mod = 5
Tip: should be set between 3 (faster, more robust) and 7 (slower, preciser, less robust)
Association metric
The attribute sort_by allows one to choose the association metric used to sort combinations. Combinations of grouped modalities are ranked according to the specified modalities and the best ranked viable combination is returned by Carvers.
[14]:
# For BinaryCarver, to be choosen amongst ["tschuprowt", "cramerv"]
sort_by = "tschuprowt" # "cramerv"
Tip: use "tschuprowt" for more robust, or less output modalities, use "cramerv" for more output modalities.
Grouping NaNs
The attribute dropna allows one to choose whether or not numpy.nan should be grouped with another modality. If set to True, Carvers will first find the most suitable combination of non-NaN values, and then test out all possible combinations with numpy.nan.
[15]:
dropna = False # anyway, there are no numpy.nan in this dataset
Optional attributes
Minimal frequency per carved modality
The attribute min_freq_mod allows one to choose the minimum frequency per output modality. It is used by Carvers in viability tests to put aside combinations that are not frequent enough in train or dev sets. By default, it is set to min_freq/2.
[16]:
min_freq_mod = None # for 0.05, at least 5 % of observations per output modality in train and dev sets
Type of output carved features
The attribute output_dtype allows one to choose the output type:
Use
"float"for integer output (default)Use
"str"for strin output
[17]:
output_dtype = "float" # "str"
Fitting AutoCarver
First, all qualitative features are discretized:
Using
StringDiscretizerto convert them tostrif not already the caseFor qualitative ordinal features: using
OrdinalDiscretizerfor under-represented values (less frequent thanmin_freq=0.05) to be grouped with its closest modalityFor qualitative categorical features: using
CategoricalDiscretizerfor under-represented values (less frequent thanmin_freq=0.05) to be grouped with a default value (str_default="__OTHER__")
Second, all quantitative features are discretized:
Using
ContinuousDiscretizerfor quantile discretization that keeps track of over-represented values (more frequent thanmin_freq=0.05)Using
OrdinalDiscretizerfor any remaining under-represented values (less frequent thanmin_freq/2=0.025) to be grouped with its closest modality
Third, all features are carved following this recipe, for all classes of
train_set[target](except one):The raw distribution is printed out on provided
train_setanddev_set. It’s the output of the discretization stepGrouping modalities: all consecutive combinations of modalities are applied to
train_setComputing associations: the association metric (
sort_by="cramerv") is computed with the provided targettrain_set[target]Combinations are sorted in descending order by association value
Testing robustness: finds the first combination that checks the following:
Representativness of modalities on
train_setanddev_set(all should be more frequent thanmin_freq_mod)Distinct target rates per consecutive modalities on
train_setanddev_setNo inversion of target rates between
train_setanddev_set(same ordering of modalities by target rate)
(Optional) If requested via
dropna=True, and if any, all combinations of modalities withnumpy.nanare applied totrain_setand steps 3. and 4. are runThe carved distribution is printed out on provided
train_setanddev_set. It’s the output of the carving step
[18]:
from AutoCarver import BinaryCarver
# intiating AutoCarver
auto_carver = BinaryCarver(
quantitative_features=quantitative_features,
qualitative_features=qualitative_features,
ordinal_features=ordinal_features,
values_orders=values_orders,
min_freq=min_freq,
min_freq_mod=min_freq_mod,
max_n_mod=max_n_mod,
dropna=dropna,
sort_by=sort_by,
output_dtype=output_dtype,
verbose=True, # showing statistics
copy=True, # whether or not to return a copy of the input dataset
)
# fitting on training sample, a dev sample can be specified to evaluate carving robustness
train_set_processed = auto_carver.fit_transform(train_set, train_set[target], X_dev=dev_set, y_dev=dev_set[target])
------
[Discretizer] Fit Qualitative Features
---
- [StringDiscretizer] Fit ['Pclass']
- [OrdinalDiscretizer] Fit ['Pclass']
- [CategoricalDiscretizer] Fit ['Sex']
------
------
[Discretizer] Fit Quantitative Features
---
- [ContinuousDiscretizer] Fit ['Parents/Children Aboard', 'Fare', 'Siblings/Spouses Aboard', 'Age']
- [OrdinalDiscretizer] Fit ['Age', 'Fare', 'Siblings/Spouses Aboard', 'Parents/Children Aboard']
------
------
[AutoCarver] Fit Sex (1/6)
---
- [AutoCarver] Raw distribution
c:\Users\defra\Desktop\git\PROJECTS\AutoCarver\docs\source\examples\BinaryClassification\../../../../../AutoCarver\AutoCarver\discretizers\discretizers.py:325: UserWarning: - [QualitativeDiscretizer] Non-string features: ['Pclass']. Trying to convert them using type_discretizers.StringDiscretizer, otherwise convert them manually. Unexpected data types: [<class 'int'>].
warn(
| target_rate | frequency | |
|---|---|---|
| male | 0.1878 | 0.6364 |
| female | 0.7315 | 0.3636 |
| target_rate | frequency |
|---|---|
| 0.1949 | 0.6655 |
| 0.7653 | 0.3345 |
Grouping modalities : 100%|██████████| 1/1 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 1/1 [00:00<00:00, 663.24it/s]
Testing robustness : 0%| | 0/1 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| male | 0.1878 | 0.6364 |
| female | 0.7315 | 0.3636 |
| target_rate | frequency |
|---|---|
| 0.1949 | 0.6655 |
| 0.7653 | 0.3345 |
------
------
[AutoCarver] Fit Siblings/Spouses Aboard (2/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.000e+00 | 0.3614 | 0.6801 |
| 0.000e+00 < x <= 1.000e+00 | 0.5000 | 0.2323 |
| 1.000e+00 < x <= 3.000e+00 | 0.4138 | 0.0488 |
| 3.000e+00 < x | 0.0870 | 0.0387 |
| target_rate | frequency |
|---|---|
| 0.3200 | 0.6826 |
| 0.6056 | 0.2423 |
| 0.3333 | 0.0512 |
| 0.1429 | 0.0239 |
Grouping modalities : 100%|██████████| 7/7 [00:00<00:00, 7025.63it/s]
Computing associations: 100%|██████████| 7/7 [00:00<00:00, 3453.32it/s]
Testing robustness : 29%|██▊ | 2/7 [00:00<00:00, 155.62it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.000e+00 | 0.3614 | 0.6801 |
| 0.000e+00 < x <= 1.000e+00 | 0.5000 | 0.2323 |
| 1.000e+00 < x | 0.2692 | 0.0875 |
| target_rate | frequency |
|---|---|
| 0.3200 | 0.6826 |
| 0.6056 | 0.2423 |
| 0.2727 | 0.0751 |
------
------
[AutoCarver] Fit Fare (3/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 7.125e+00 | 0.0333 | 0.0505 |
| 7.125e+00 < x <= 7.250e+00 | 0.2258 | 0.0522 |
| 7.250e+00 < x <= 7.796e+00 | 0.3462 | 0.0875 |
| 7.796e+00 < x <= 7.896e+00 | 0.2000 | 0.0673 |
| 7.896e+00 < x <= 8.050e+00 | 0.2340 | 0.0791 |
| 8.050e+00 < x <= 1.046e+01 | 0.2258 | 0.0522 |
| 1.046e+01 < x <= 1.400e+01 | 0.4833 | 0.1010 |
| 1.400e+01 < x <= 1.610e+01 | 0.2812 | 0.0539 |
| 1.610e+01 < x <= 2.300e+01 | 0.5333 | 0.0505 |
| 2.300e+01 < x <= 2.600e+01 | 0.3333 | 0.0606 |
| 2.600e+01 < x <= 2.772e+01 | 0.5417 | 0.0404 |
| 2.772e+01 < x <= 3.127e+01 | 0.3125 | 0.0539 |
| 3.127e+01 < x <= 4.012e+01 | 0.3929 | 0.0471 |
| 4.012e+01 < x <= 5.590e+01 | 0.4333 | 0.0505 |
| 5.590e+01 < x <= 7.673e+01 | 0.5667 | 0.0505 |
| 7.673e+01 < x <= 1.109e+02 | 0.7419 | 0.0522 |
| 1.109e+02 < x | 0.8000 | 0.0505 |
| target_rate | frequency |
|---|---|
| 0.0833 | 0.0410 |
| 0.2000 | 0.0341 |
| 0.2222 | 0.0922 |
| 0.0556 | 0.0614 |
| 0.2000 | 0.0512 |
| 0.0870 | 0.0785 |
| 0.3947 | 0.1297 |
| 0.4167 | 0.0819 |
| 0.5263 | 0.0648 |
| 0.5294 | 0.0580 |
| 0.6667 | 0.0307 |
| 0.4667 | 0.0512 |
| 0.4167 | 0.0410 |
| 0.6667 | 0.0307 |
| 0.5714 | 0.0478 |
| 0.7500 | 0.0546 |
| 0.6667 | 0.0512 |
Grouping modalities : 100%|██████████| 2516/2516 [00:00<00:00, 7771.79it/s]
Computing associations: 100%|██████████| 2516/2516 [00:00<00:00, 4375.16it/s]
Testing robustness : 0%| | 0/2516 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 1.046e+01 | 0.2251 | 0.3889 |
| 1.046e+01 < x <= 7.673e+01 | 0.4305 | 0.5084 |
| 7.673e+01 < x | 0.7705 | 0.1027 |
| target_rate | frequency |
|---|---|
| 0.1429 | 0.3584 |
| 0.4841 | 0.5358 |
| 0.7097 | 0.1058 |
------
------
[AutoCarver] Fit Age (4/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 4.000e+00 | 0.7333 | 0.0505 |
| 4.000e+00 < x <= 1.400e+01 | 0.3103 | 0.0488 |
| 1.400e+01 < x <= 1.700e+01 | 0.4286 | 0.0471 |
| 1.700e+01 < x <= 1.900e+01 | 0.3636 | 0.0741 |
| 1.900e+01 < x <= 2.000e+01 | 0.1176 | 0.0286 |
| 2.000e+01 < x <= 2.200e+01 | 0.3273 | 0.0926 |
| 2.200e+01 < x <= 2.400e+01 | 0.5000 | 0.0572 |
| 2.400e+01 < x <= 2.700e+01 | 0.3556 | 0.0758 |
| 2.700e+01 < x <= 2.800e+01 | 0.2632 | 0.0320 |
| 2.800e+01 < x <= 3.100e+01 | 0.3571 | 0.0943 |
| 3.100e+01 < x <= 3.300e+01 | 0.4483 | 0.0488 |
| 3.300e+01 < x <= 3.600e+01 | 0.4146 | 0.0690 |
| 3.600e+01 < x <= 3.800e+01 | 0.4118 | 0.0286 |
| 3.800e+01 < x <= 4.100e+01 | 0.3871 | 0.0522 |
| 4.100e+01 < x <= 4.500e+01 | 0.4167 | 0.0606 |
| 4.500e+01 < x <= 4.900e+01 | 0.5000 | 0.0438 |
| 4.900e+01 < x <= 5.600e+01 | 0.3448 | 0.0488 |
| 5.600e+01 < x | 0.1786 | 0.0471 |
| target_rate | frequency |
|---|---|
| 0.5385 | 0.0444 |
| 0.6250 | 0.0546 |
| 0.3571 | 0.0478 |
| 0.3200 | 0.0853 |
| 0.3333 | 0.0205 |
| 0.1579 | 0.0648 |
| 0.3077 | 0.0887 |
| 0.4074 | 0.0922 |
| 0.2778 | 0.0614 |
| 0.4400 | 0.0853 |
| 0.5455 | 0.0375 |
| 0.6190 | 0.0717 |
| 0.2500 | 0.0273 |
| 0.1875 | 0.0546 |
| 0.2727 | 0.0375 |
| 0.3333 | 0.0410 |
| 0.5833 | 0.0410 |
| 0.3846 | 0.0444 |
Grouping modalities : 100%|██████████| 3213/3213 [00:00<00:00, 7916.00it/s]
Computing associations: 100%|██████████| 3213/3213 [00:00<00:00, 4466.30it/s]
Testing robustness : 0%| | 0/3213 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 4.000e+00 | 0.7333 | 0.0505 |
| 4.000e+00 < x | 0.3670 | 0.9495 |
| target_rate | frequency |
|---|---|
| 0.5385 | 0.0444 |
| 0.3786 | 0.9556 |
------
------
[AutoCarver] Fit Pclass (5/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| 1, 1 | 0.6197 | 0.2391 |
| 2, 2 | 0.4683 | 0.2121 |
| 3, 3 | 0.2515 | 0.5488 |
| target_rate | frequency |
|---|---|
| 0.6486 | 0.2526 |
| 0.4828 | 0.1980 |
| 0.2298 | 0.5495 |
Grouping modalities : 100%|██████████| 3/3 [00:00<?, ?it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3000.22it/s]
Testing robustness : 0%| | 0/3 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| 1 to 2 | 0.5485 | 0.4512 |
| 3, 3 | 0.2515 | 0.5488 |
| target_rate | frequency |
|---|---|
| 0.5758 | 0.4505 |
| 0.2298 | 0.5495 |
------
------
[AutoCarver] Fit Parents/Children Aboard (6/6)
---
- [AutoCarver] Raw distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.000e+00 | 0.3447 | 0.7374 |
| 0.000e+00 < x <= 1.000e+00 | 0.5057 | 0.1465 |
| 1.000e+00 < x | 0.4928 | 0.1162 |
| target_rate | frequency |
|---|---|
| 0.3475 | 0.8055 |
| 0.6774 | 0.1058 |
| 0.3846 | 0.0887 |
Grouping modalities : 100%|██████████| 3/3 [00:00<00:00, 2901.96it/s]
Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3006.67it/s]
Testing robustness : 0%| | 0/3 [00:00<?, ?it/s]
- [AutoCarver] Carved distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.000e+00 | 0.3447 | 0.7374 |
| 0.000e+00 < x | 0.5000 | 0.2626 |
| target_rate | frequency |
|---|---|
| 0.3475 | 0.8055 |
| 0.5439 | 0.1945 |
------
AutoCarver analysis
Carving Summary
[19]:
auto_carver.summary()
[19]:
| label | content | ||
|---|---|---|---|
| feature | dtype | ||
| Age | float | 0 | [x <= 4.000e+00] |
| float | 1 | [4.000e+00 < x] | |
| Fare | float | 0 | [x <= 1.046e+01] |
| float | 1 | [1.046e+01 < x <= 7.673e+01] | |
| float | 2 | [7.673e+01 < x] | |
| Parents/Children Aboard | float | 0 | [x <= 0.000e+00] |
| float | 1 | [0.000e+00 < x] | |
| Siblings/Spouses Aboard | float | 0 | [x <= 0.000e+00] |
| float | 1 | [0.000e+00 < x <= 1.000e+00] | |
| float | 2 | [1.000e+00 < x] | |
| Pclass | str | 0 | [1, 2] |
| str | 1 | [3] | |
| Sex | str | 0 | [male] |
| str | 1 | [female] |
As requested with
output_dtype="float", output labels are integers of ranks of modalitiesFor quantitative feature
Age, the selected combination of modalities groups ages as follows:modality
0: lower or equal to 4 years old (content==["x <= 4.000e+00"])modality
1: ages higher than 4 years old (content==["4.000e+00 < x "])
For qualitative categorical feature
Sex, the selected combination of modalities has left modalitiescontent=["male"]in modality0andcontent=["female"]in modality1(no combination possible)For qualitative ordinal feature
Pclass, the selected combination of modalities socio-economic status as follows:modality
0: upper and middle classes (content==[1, 2])modality
1: lower class (content==[3]).The user-provided ordering of modalities has been preserved.
Detailed overview of tested combinations
[20]:
auto_carver.history(feature="Pclass")
[20]:
| combination | tschuprowt | viability | viability_message | grouping_nan | |
|---|---|---|---|---|---|
| 0 | [[1, 1], [2, 2], [3, 3]] | 0.269965 | None | [Raw X distribution] | False |
| 1 | [[1, 1, 2, 2], [3, 3]] | 0.300144 | True | [Combination robust between X and X_dev] | False |
| 2 | [[1, 1], [2, 2], [3, 3]] | 0.269965 | None | [Not checked] | False |
| 3 | [[1, 1], [2, 2, 3, 3]] | 0.265643 | None | [Not checked] | False |
The most associated combination (the first tested out, where
viability_message!=["Raw X distribution"]) groupsPclass==1withPclass==2and leavesPclass==3as its own modalityFor feature feature
Pclass, the 1st combination is passes the tests:viability_message!=["Combination robust between X and X_dev"]Tschuprow’s T with
Survivedis0.300144for this combinationFollowing combinations (less associated with the target) where not tested:
viability_message==["Not checked"]
For all combinations
grouping_nan==Falsemeans that it is not a combination in which NaNs are being groupedwith other modalities (as requested withdropna=False)
Saving and Loading AutoCarver
Saving
All Carvers can safely be stored as a .json file.
[21]:
import json
# storing as json file
with open('binay_carver.json', 'w') as my_carver_json:
json.dump(auto_carver.to_json(), my_carver_json)
Loading
Carvers can safely be loaded from a .json file.
[22]:
import json
from AutoCarver import load_carver
# loading json file
with open('binay_carver.json', 'r') as my_carver_json:
auto_carver = load_carver(json.load(my_carver_json))
Applying AutoCarver
[23]:
dev_set_processed = auto_carver.transform(dev_set)
[24]:
dev_set_processed[auto_carver.features].apply(lambda u: u.value_counts(dropna=False, normalize=True))
[24]:
| Sex | Siblings/Spouses Aboard | Fare | Age | Pclass | Parents/Children Aboard | |
|---|---|---|---|---|---|---|
| 0.0 | 0.665529 | 0.682594 | 0.358362 | 0.044369 | 0.450512 | 0.805461 |
| 1.0 | 0.334471 | 0.242321 | 0.535836 | 0.955631 | 0.549488 | 0.194539 |
| 2.0 | NaN | 0.075085 | 0.105802 | NaN | NaN | NaN |
Feature Selection
Selectors settings
Features to select from
Here all features have been carved using BinaryCarver, hence all features are qualitative.
[25]:
features = qualitative_features + quantitative_features + ordinal_features
Number of features to select
The attribute n_best allows one to choose the number of features to be selected per data type (quantitative and qualitative).
[26]:
n_best = 6 # here the number of features is low, ClassificationSelector will only be used to compute useful statistics
Using Selectors
[27]:
from AutoCarver.selectors import ClassificationSelector
# select the most target associated qualitative features
feature_selector = ClassificationSelector(
qualitative_features=features,
n_best=n_best,
verbose=True, # displays statistics
)
best_features = feature_selector.select(train_set_processed, train_set_processed[target])
------
[Selector] Selecting from qualitative features: ['Sex', 'Siblings/Spouses Aboard', 'Fare', 'Age', 'Pclass', 'Parents/Children Aboard']
---
- [Selector] Association between X and y
| dtype | pct_nan | pct_mode | mode | chi2_statistic | tschuprowt_measure | |
|---|---|---|---|---|---|---|
| Sex | int64 | 0.0000 | 0.6364 | 0 | 169.2047 | 0.5337 |
| Pclass | int64 | 0.0000 | 0.5488 | 1 | 53.5114 | 0.3001 |
| Fare | float64 | 0.0000 | 0.5084 | 1.0000 | 65.8288 | 0.2799 |
| Age | float64 | 0.0000 | 0.9495 | 1.0000 | 14.6254 | 0.1569 |
| Parents/Children Aboard | int64 | 0.0000 | 0.7374 | 0 | 11.0576 | 0.1364 |
| Siblings/Spouses Aboard | int64 | 0.0000 | 0.6801 | 0 | 11.5963 | 0.1175 |
- [Selector] Association between X and y, filtered for inter-feature assocation
| dtype | pct_nan | pct_mode | mode | chi2_statistic | tschuprowt_measure | |
|---|---|---|---|---|---|---|
| Sex | int64 | 0.0000 | 0.6364 | 0 | 169.2047 | 0.5337 |
| Pclass | int64 | 0.0000 | 0.5488 | 1 | 53.5114 | 0.3001 |
| Fare | float64 | 0.0000 | 0.5084 | 1.0000 | 65.8288 | 0.2799 |
| Age | float64 | 0.0000 | 0.9495 | 1.0000 | 14.6254 | 0.1569 |
| Parents/Children Aboard | int64 | 0.0000 | 0.7374 | 0 | 11.0576 | 0.1364 |
| Siblings/Spouses Aboard | int64 | 0.0000 | 0.6801 | 0 | 11.5963 | 0.1175 |
- [Selector] Selected qualitative features: ['Sex', 'Pclass', 'Fare', 'Age', 'Parents/Children Aboard', 'Siblings/Spouses Aboard']
------
Feature
Sexis the most associated with the targetSurvived:Tschuprow’s T value is
tschuprowt_measure=0.5337It has 0 % of NaNs (
pct_nan=0.0)Its mode,
0, represents 64 % of observed data (pct_nan=0.6364)
Here, no feature where filtered out for there inter-feature association or over-represented values (no thresholds were set)