Examples
Credit Scoring Example
Settings¶
Installation¶
In [1]:
!pip install --upgrade autocarver
Requirement already satisfied: autocarver in c:\users\defra\.conda\envs\py39\lib\site-packages (5.1.2) Requirement already satisfied: ipython in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (8.11.0) Requirement already satisfied: numpy in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (1.24.2) Requirement already satisfied: tqdm in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (4.65.0) Requirement already satisfied: scipy in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (1.10.1) Requirement already satisfied: statsmodels in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (0.14.0) Requirement already satisfied: pandas in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (1.5.3) Requirement already satisfied: scikit-learn in c:\users\defra\.conda\envs\py39\lib\site-packages (from autocarver) (1.2.2) Requirement already satisfied: decorator in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (5.1.1) Requirement already satisfied: jedi>=0.16 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.18.2) Requirement already satisfied: traitlets>=5 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (5.9.0) Requirement already satisfied: stack-data in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.6.2) Requirement already satisfied: matplotlib-inline in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.1.6) Requirement already satisfied: colorama in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.4.6) Requirement already satisfied: backcall in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.2.0) Requirement already satisfied: pickleshare in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (0.7.5) Requirement already satisfied: pygments>=2.4.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (2.14.0) Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in c:\users\defra\.conda\envs\py39\lib\site-packages (from ipython->autocarver) (3.0.38) Requirement already satisfied: pytz>=2020.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from pandas->autocarver) (2022.7.1) Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from pandas->autocarver) (2.8.2) Requirement already satisfied: joblib>=1.1.1 in c:\users\defra\.conda\envs\py39\lib\site-packages (from scikit-learn->autocarver) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from scikit-learn->autocarver) (3.1.0) Requirement already satisfied: patsy>=0.5.2 in c:\users\defra\.conda\envs\py39\lib\site-packages (from statsmodels->autocarver) (0.5.3) Requirement already satisfied: packaging>=21.3 in c:\users\defra\.conda\envs\py39\lib\site-packages (from statsmodels->autocarver) (23.0) Requirement already satisfied: parso<0.9.0,>=0.8.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from jedi>=0.16->ipython->autocarver) (0.8.3) Requirement already satisfied: six in c:\users\defra\.conda\envs\py39\lib\site-packages (from patsy>=0.5.2->statsmodels->autocarver) (1.16.0) Requirement already satisfied: wcwidth in c:\users\defra\.conda\envs\py39\lib\site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->autocarver) (0.2.6) Requirement already satisfied: pure-eval in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->autocarver) (0.2.2) Requirement already satisfied: asttokens>=2.1.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->autocarver) (2.2.1) Requirement already satisfied: executing>=1.2.0 in c:\users\defra\.conda\envs\py39\lib\site-packages (from stack-data->ipython->autocarver) (1.2.0)
Setting up samples¶
This dataset can be found from the corresponding Kaggle competition at https://www.kaggle.com/competitions/GiveMeSomeCredit/
In [1]:
import pandas as pd
data_path = "GiveMeSomeCredit"
credit_data = pd.read_csv(f"{data_path}/cs-training.csv", index_col=0)
print(credit_data.shape)
credit_data.head()
(150000, 11)
Out[1]:
| SeriousDlqin2yrs | RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0.766127 | 45 | 2 | 0.802982 | 9120.0 | 13 | 0 | 6 | 0 | 2.0 |
| 2 | 0 | 0.957151 | 40 | 0 | 0.121876 | 2600.0 | 4 | 0 | 0 | 0 | 1.0 |
| 3 | 0 | 0.658180 | 38 | 1 | 0.085113 | 3042.0 | 2 | 1 | 0 | 0 | 0.0 |
| 4 | 0 | 0.233810 | 30 | 0 | 0.036050 | 3300.0 | 5 | 0 | 0 | 0 | 0.0 |
| 5 | 0 | 0.907239 | 49 | 1 | 0.024926 | 63588.0 | 7 | 0 | 1 | 0 | 0.0 |
In [2]:
from sklearn.model_selection import train_test_split
X_train, X_dev = train_test_split(credit_data, test_size=0.33, random_state=42)
Picking up columns to Carve¶
In [3]:
X_train.dtypes
Out[3]:
SeriousDlqin2yrs int64 RevolvingUtilizationOfUnsecuredLines float64 age int64 NumberOfTime30-59DaysPastDueNotWorse int64 DebtRatio float64 MonthlyIncome float64 NumberOfOpenCreditLinesAndLoans int64 NumberOfTimes90DaysLate int64 NumberRealEstateLoansOrLines int64 NumberOfTime60-89DaysPastDueNotWorse int64 NumberOfDependents float64 dtype: object
In [4]:
X_train.isna().mean()
Out[4]:
SeriousDlqin2yrs 0.000000 RevolvingUtilizationOfUnsecuredLines 0.000000 age 0.000000 NumberOfTime30-59DaysPastDueNotWorse 0.000000 DebtRatio 0.000000 MonthlyIncome 0.197383 NumberOfOpenCreditLinesAndLoans 0.000000 NumberOfTimes90DaysLate 0.000000 NumberRealEstateLoansOrLines 0.000000 NumberOfTime60-89DaysPastDueNotWorse 0.000000 NumberOfDependents 0.026129 dtype: float64
In [5]:
target = "SeriousDlqin2yrs"
quantitative_features = [feature for feature in X_train if feature != target]
In [6]:
from AutoCarver import AutoCarver
auto_carver = AutoCarver(
quantitative_features=quantitative_features,
qualitative_features=[],
sort_by='cramerv', # Best combination according to Cramer's V
dropna=False, # don't want to group nans with other values, leave that to XGBoost
min_freq=0.1, # minimum frequency per modality
max_n_mod=5, # maximum number of modality per carved feature
copy=True, # in order not to modify X_train directly
pretty_print=True, # prints nice tables
)
x_discretized = auto_carver.fit_transform(
# specifying dataset to carve
X_train, X_train[target],
# specifying a dataset to test robustness
X_dev=X_dev, y_dev=X_dev[target]
)
------ [Discretizer] Fit Quantitative Features --- - [QuantileDiscretizer] Fit ['age', 'NumberOfDependents', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome'] - [BaseDiscretizer] Transform Quantitative ['age', 'NumberOfDependents', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome'] - [OrdinalDiscretizer] Fit ['NumberOfDependents', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'age', 'MonthlyIncome'] ------ - [BaseDiscretizer] Transform Quantitative ['NumberOfDependents', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'age', 'MonthlyIncome'] - [BaseDiscretizer] Transform Quantitative ['NumberOfDependents', 'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'age', 'MonthlyIncome'] ------ [AutoCarver] Fit NumberOfDependents (1/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberOfDependents | ||
| x <= 0.0 | 0.060000 | 0.580000 |
| 0.0 < x <= 1.0 | 0.073000 | 0.175000 |
| 1.0 < x | 0.085000 | 0.219000 |
| __NAN__ | 0.048000 | 0.026000 |
| target_rate | frequency | |
|---|---|---|
| NumberOfDependents | ||
| x <= 0.0 | 0.057000 | 0.579000 |
| 0.0 < x <= 1.0 | 0.074000 | 0.177000 |
| 1.0 < x | 0.086000 | 0.218000 |
| __NAN__ | 0.042000 | 0.026000 |
Grouping modalities : 100%|██████████| 3/3 [00:00<?, ?it/s] Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3078.76it/s] Testing robustness : 0%| | 0/3 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.060000 | 0.580000 |
| 0.0 < x <= 1.0 | 0.073000 | 0.175000 |
| 1.0 < x | 0.085000 | 0.219000 |
| __NAN__ | 0.048000 | 0.026000 |
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.057000 | 0.579000 |
| 0.0 < x <= 1.0 | 0.074000 | 0.177000 |
| 1.0 < x | 0.086000 | 0.218000 |
| __NAN__ | 0.042000 | 0.026000 |
------ ------ [AutoCarver] Fit NumberOfTime30-59DaysPastDueNotWorse (2/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberOfTime30-59DaysPastDueNotWorse | ||
| x <= 0.0 | 0.040000 | 0.839000 |
| 0.0 < x | 0.208000 | 0.161000 |
| target_rate | frequency | |
|---|---|---|
| NumberOfTime30-59DaysPastDueNotWorse | ||
| x <= 0.0 | 0.039000 | 0.842000 |
| 0.0 < x | 0.207000 | 0.158000 |
Grouping modalities : 100%|██████████| 1/1 [00:00<?, ?it/s] Computing associations: 100%|██████████| 1/1 [00:00<00:00, 1001.98it/s] Testing robustness : 0%| | 0/1 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.040000 | 0.839000 |
| 0.0 < x | 0.208000 | 0.161000 |
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.039000 | 0.842000 |
| 0.0 < x | 0.207000 | 0.158000 |
------ ------ [AutoCarver] Fit DebtRatio (3/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| DebtRatio | ||
| x <= 31.1m | 0.052000 | 0.100000 |
| 31.1m < x <= 133.6m | 0.070000 | 0.100000 |
| 133.6m < x <= 213.9m | 0.061000 | 0.100000 |
| 213.9m < x <= 287.6m | 0.054000 | 0.100000 |
| 287.6m < x <= 467.4m | 0.061000 | 0.200000 |
| 467.4m < x <= 648.0m | 0.088000 | 0.100000 |
| 648.0m < x <= 3.8 | 0.114000 | 0.100000 |
| 3.8 < x | 0.056000 | 0.200000 |
| target_rate | frequency | |
|---|---|---|
| DebtRatio | ||
| x <= 31.1m | 0.056000 | 0.101000 |
| 31.1m < x <= 133.6m | 0.064000 | 0.099000 |
| 133.6m < x <= 213.9m | 0.058000 | 0.101000 |
| 213.9m < x <= 287.6m | 0.055000 | 0.100000 |
| 287.6m < x <= 467.4m | 0.062000 | 0.199000 |
| 467.4m < x <= 648.0m | 0.077000 | 0.099000 |
| 648.0m < x <= 3.8 | 0.115000 | 0.100000 |
| 3.8 < x | 0.054000 | 0.202000 |
Grouping modalities : 100%|██████████| 98/98 [00:00<00:00, 7372.02it/s] Computing associations: 100%|██████████| 98/98 [00:00<00:00, 3828.24it/s] Testing robustness : 2%|▏ | 2/98 [00:00<00:00, 799.98it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 31.1m | 0.059000 | 0.400000 |
| 287.6m < x <= 467.4m | 0.061000 | 0.200000 |
| 467.4m < x <= 648.0m | 0.088000 | 0.100000 |
| 648.0m < x <= 3.8 | 0.114000 | 0.100000 |
| 3.8 < x | 0.056000 | 0.200000 |
| target_rate | frequency | |
|---|---|---|
| x <= 31.1m | 0.059000 | 0.401000 |
| 287.6m < x <= 467.4m | 0.062000 | 0.199000 |
| 467.4m < x <= 648.0m | 0.077000 | 0.099000 |
| 648.0m < x <= 3.8 | 0.115000 | 0.100000 |
| 3.8 < x | 0.054000 | 0.202000 |
------ ------ [AutoCarver] Fit RevolvingUtilizationOfUnsecuredLines (4/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| RevolvingUtilizationOfUnsecuredLines | ||
| x <= 3.0m | 0.025000 | 0.100000 |
| 3.0m < x <= 19.2m | 0.014000 | 0.100000 |
| 19.2m < x <= 43.6m | 0.014000 | 0.100000 |
| 43.6m < x <= 83.8m | 0.019000 | 0.100000 |
| 83.8m < x <= 155.4m | 0.025000 | 0.100000 |
| 155.4m < x <= 273.6m | 0.037000 | 0.100000 |
| 273.6m < x <= 447.4m | 0.053000 | 0.100000 |
| 447.4m < x <= 700.7m | 0.088000 | 0.100000 |
| 700.7m < x | 0.199000 | 0.200000 |
| target_rate | frequency | |
|---|---|---|
| RevolvingUtilizationOfUnsecuredLines | ||
| x <= 3.0m | 0.025000 | 0.100000 |
| 3.0m < x <= 19.2m | 0.012000 | 0.099000 |
| 19.2m < x <= 43.6m | 0.014000 | 0.103000 |
| 43.6m < x <= 83.8m | 0.019000 | 0.103000 |
| 83.8m < x <= 155.4m | 0.022000 | 0.100000 |
| 155.4m < x <= 273.6m | 0.031000 | 0.100000 |
| 273.6m < x <= 447.4m | 0.052000 | 0.099000 |
| 447.4m < x <= 700.7m | 0.090000 | 0.098000 |
| 700.7m < x | 0.199000 | 0.198000 |
Grouping modalities : 100%|██████████| 162/162 [00:00<00:00, 6461.12it/s] Computing associations: 100%|██████████| 162/162 [00:00<00:00, 3819.84it/s] Testing robustness : 0%| | 0/162 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 3.0m | 0.020000 | 0.500000 |
| 155.4m < x <= 273.6m | 0.037000 | 0.100000 |
| 273.6m < x <= 447.4m | 0.053000 | 0.100000 |
| 447.4m < x <= 700.7m | 0.088000 | 0.100000 |
| 700.7m < x | 0.199000 | 0.200000 |
| target_rate | frequency | |
|---|---|---|
| x <= 3.0m | 0.018000 | 0.504000 |
| 155.4m < x <= 273.6m | 0.031000 | 0.100000 |
| 273.6m < x <= 447.4m | 0.052000 | 0.099000 |
| 447.4m < x <= 700.7m | 0.090000 | 0.098000 |
| 700.7m < x | 0.199000 | 0.198000 |
------ ------ [AutoCarver] Fit NumberOfTimes90DaysLate (5/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberOfTimes90DaysLate | ||
| x <= nan | 0.067000 | 1.000000 |
| target_rate | frequency | |
|---|---|---|
| NumberOfTimes90DaysLate | ||
| x <= nan | 0.066000 | 1.000000 |
- [AutoCarver] No robust combination for feature 'NumberOfTimes90DaysLate' could be found. It will be ignored. You might have to increase the size of your test sample (test sample not representative of test sample for this feature) or you should consider dropping this features. ------ ------ [AutoCarver] Fit NumberOfTime60-89DaysPastDueNotWorse (6/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberOfTime60-89DaysPastDueNotWorse | ||
| x <= nan | 0.067000 | 1.000000 |
| target_rate | frequency | |
|---|---|---|
| NumberOfTime60-89DaysPastDueNotWorse | ||
| x <= nan | 0.066000 | 1.000000 |
- [AutoCarver] No robust combination for feature 'NumberOfTime60-89DaysPastDueNotWorse' could be found. It will be ignored. You might have to increase the size of your test sample (test sample not representative of test sample for this feature) or you should consider dropping this features. ------ ------ [AutoCarver] Fit NumberOfOpenCreditLinesAndLoans (7/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberOfOpenCreditLinesAndLoans | ||
| x <= 3.0 | 0.107000 | 0.147000 |
| 3.0 < x <= 5.0 | 0.064000 | 0.163000 |
| 5.0 < x <= 8.0 | 0.053000 | 0.262000 |
| 8.0 < x <= 10.0 | 0.061000 | 0.141000 |
| 10.0 < x <= 12.0 | 0.061000 | 0.102000 |
| 12.0 < x | 0.067000 | 0.185000 |
| target_rate | frequency | |
|---|---|---|
| NumberOfOpenCreditLinesAndLoans | ||
| x <= 3.0 | 0.107000 | 0.147000 |
| 3.0 < x <= 5.0 | 0.063000 | 0.164000 |
| 5.0 < x <= 8.0 | 0.053000 | 0.265000 |
| 8.0 < x <= 10.0 | 0.055000 | 0.138000 |
| 10.0 < x <= 12.0 | 0.057000 | 0.102000 |
| 12.0 < x | 0.066000 | 0.183000 |
Grouping modalities : 100%|██████████| 30/30 [00:00<00:00, 3600.36it/s] Computing associations: 100%|██████████| 30/30 [00:00<00:00, 3562.24it/s] Testing robustness : 0%| | 0/30 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 3.0 | 0.107000 | 0.147000 |
| 3.0 < x <= 5.0 | 0.064000 | 0.163000 |
| 5.0 < x <= 8.0 | 0.053000 | 0.262000 |
| 8.0 < x <= 10.0 | 0.061000 | 0.243000 |
| 12.0 < x | 0.067000 | 0.185000 |
| target_rate | frequency | |
|---|---|---|
| x <= 3.0 | 0.107000 | 0.147000 |
| 3.0 < x <= 5.0 | 0.063000 | 0.164000 |
| 5.0 < x <= 8.0 | 0.053000 | 0.265000 |
| 8.0 < x <= 10.0 | 0.056000 | 0.240000 |
| 12.0 < x | 0.066000 | 0.183000 |
------ ------ [AutoCarver] Fit NumberRealEstateLoansOrLines (8/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| NumberRealEstateLoansOrLines | ||
| x <= 0.0 | 0.083000 | 0.376000 |
| 0.0 < x <= 1.0 | 0.053000 | 0.348000 |
| 1.0 < x | 0.065000 | 0.277000 |
| target_rate | frequency | |
|---|---|---|
| NumberRealEstateLoansOrLines | ||
| x <= 0.0 | 0.083000 | 0.373000 |
| 0.0 < x <= 1.0 | 0.052000 | 0.352000 |
| 1.0 < x | 0.059000 | 0.276000 |
Grouping modalities : 100%|██████████| 3/3 [00:00<00:00, 1506.03it/s] Computing associations: 100%|██████████| 3/3 [00:00<00:00, 3008.83it/s] Testing robustness : 0%| | 0/3 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.083000 | 0.376000 |
| 0.0 < x <= 1.0 | 0.053000 | 0.348000 |
| 1.0 < x | 0.065000 | 0.277000 |
| target_rate | frequency | |
|---|---|---|
| x <= 0.0 | 0.083000 | 0.373000 |
| 0.0 < x <= 1.0 | 0.052000 | 0.352000 |
| 1.0 < x | 0.059000 | 0.276000 |
------ ------ [AutoCarver] Fit age (9/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| age | ||
| x <= 33.0 | 0.115000 | 0.114000 |
| 33.0 < x <= 39.0 | 0.097000 | 0.100000 |
| 39.0 < x <= 48.0 | 0.085000 | 0.203000 |
| 48.0 < x <= 56.0 | 0.071000 | 0.193000 |
| 56.0 < x <= 61.0 | 0.051000 | 0.112000 |
| 61.0 < x | 0.028000 | 0.277000 |
| target_rate | frequency | |
|---|---|---|
| age | ||
| x <= 33.0 | 0.110000 | 0.114000 |
| 33.0 < x <= 39.0 | 0.094000 | 0.098000 |
| 39.0 < x <= 48.0 | 0.082000 | 0.204000 |
| 48.0 < x <= 56.0 | 0.072000 | 0.194000 |
| 56.0 < x <= 61.0 | 0.048000 | 0.113000 |
| 61.0 < x | 0.029000 | 0.277000 |
Grouping modalities : 100%|██████████| 30/30 [00:00<00:00, 3313.21it/s] Computing associations: 100%|██████████| 30/30 [00:00<00:00, 2360.64it/s] Testing robustness : 0%| | 0/30 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 33.0 | 0.115000 | 0.114000 |
| 33.0 < x <= 39.0 | 0.089000 | 0.304000 |
| 48.0 < x <= 56.0 | 0.071000 | 0.193000 |
| 56.0 < x <= 61.0 | 0.051000 | 0.112000 |
| 61.0 < x | 0.028000 | 0.277000 |
| target_rate | frequency | |
|---|---|---|
| x <= 33.0 | 0.110000 | 0.114000 |
| 33.0 < x <= 39.0 | 0.086000 | 0.302000 |
| 48.0 < x <= 56.0 | 0.072000 | 0.194000 |
| 56.0 < x <= 61.0 | 0.048000 | 0.113000 |
| 61.0 < x | 0.029000 | 0.277000 |
------ ------ [AutoCarver] Fit MonthlyIncome (10/10) --- - [AutoCarver] Raw feature distribution
| target_rate | frequency | |
|---|---|---|
| MonthlyIncome | ||
| x <= 2.3K | 0.085000 | 0.101000 |
| 2.3K < x <= 3.4K | 0.097000 | 0.101000 |
| 3.4K < x <= 5.4K | 0.080000 | 0.201000 |
| 5.4K < x <= 8.2K | 0.061000 | 0.199000 |
| 8.2K < x <= 10.7K | 0.051000 | 0.100000 |
| 10.7K < x | 0.044000 | 0.100000 |
| __NAN__ | 0.056000 | 0.197000 |
| target_rate | frequency | |
|---|---|---|
| MonthlyIncome | ||
| x <= 2.3K | 0.089000 | 0.101000 |
| 2.3K < x <= 3.4K | 0.098000 | 0.101000 |
| 3.4K < x <= 5.4K | 0.075000 | 0.199000 |
| 5.4K < x <= 8.2K | 0.060000 | 0.199000 |
| 8.2K < x <= 10.7K | 0.042000 | 0.098000 |
| 10.7K < x | 0.047000 | 0.102000 |
| __NAN__ | 0.056000 | 0.200000 |
Grouping modalities : 100%|██████████| 30/30 [00:00<00:00, 3022.63it/s] Computing associations: 100%|██████████| 30/30 [00:00<00:00, 2698.00it/s] Testing robustness : 0%| | 0/30 [00:00<?, ?it/s]
- [AutoCarver] Carved feature distribution
| target_rate | frequency | |
|---|---|---|
| x <= 2.3K | 0.085000 | 0.101000 |
| 2.3K < x <= 3.4K | 0.097000 | 0.101000 |
| 3.4K < x <= 5.4K | 0.080000 | 0.201000 |
| 5.4K < x <= 8.2K | 0.061000 | 0.199000 |
| 8.2K < x <= 10.7K | 0.047000 | 0.201000 |
| __NAN__ | 0.056000 | 0.197000 |
| target_rate | frequency | |
|---|---|---|
| x <= 2.3K | 0.089000 | 0.101000 |
| 2.3K < x <= 3.4K | 0.098000 | 0.101000 |
| 3.4K < x <= 5.4K | 0.075000 | 0.199000 |
| 5.4K < x <= 8.2K | 0.060000 | 0.199000 |
| 8.2K < x <= 10.7K | 0.044000 | 0.200000 |
| __NAN__ | 0.056000 | 0.200000 |
------ - [BaseDiscretizer] Transform Quantitative ['age', 'NumberOfDependents', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome']
Inspecting Discretization¶
In [7]:
x_discretized[quantitative_features].head()
Out[7]:
| RevolvingUtilizationOfUnsecuredLines | age | NumberOfTime30-59DaysPastDueNotWorse | DebtRatio | MonthlyIncome | NumberOfOpenCreditLinesAndLoans | NumberOfTimes90DaysLate | NumberRealEstateLoansOrLines | NumberOfTime60-89DaysPastDueNotWorse | NumberOfDependents | |
|---|---|---|---|---|---|---|---|---|---|---|
| 87936 | 0.0 | 4 | 0 | 4.0 | NaN | 2 | 0 | 1 | 0 | 0.0 |
| 3893 | 3.0 | 0 | 0 | 2.0 | 2.0 | 1 | 0 | 0 | 0 | 0.0 |
| 41405 | 4.0 | 1 | 0 | 4.0 | NaN | 2 | 0 | 1 | 0 | 0.0 |
| 91125 | 2.0 | 2 | 0 | 2.0 | 3.0 | 3 | 0 | 0 | 0 | 2.0 |
| 67373 | 3.0 | 2 | 3 | 4.0 | NaN | 2 | 2 | 1 | 1 | NaN |
In [7]:
auto_carver.summary()
Out[7]:
| label | content | ||
|---|---|---|---|
| feature | dtype | ||
| DebtRatio | float | 0 | [x <= 287.6m] |
| float | 1 | [287.6m < x <= 467.4m] | |
| float | 2 | [467.4m < x <= 648.0m] | |
| float | 3 | [648.0m < x <= 3.8] | |
| float | 4 | [3.8 < x] | |
| MonthlyIncome | float | 0 | [x <= 2.3K] |
| float | 1 | [2.3K < x <= 3.4K] | |
| float | 2 | [3.4K < x <= 5.4K] | |
| float | 3 | [5.4K < x <= 8.2K] | |
| float | 4 | [8.2K < x] | |
| float | 5 | [__NAN__] | |
| NumberOfDependents | float | 0 | [x <= 0.0] |
| float | 1 | [0.0 < x <= 1.0] | |
| float | 2 | [1.0 < x] | |
| float | 3 | [__NAN__] | |
| NumberOfOpenCreditLinesAndLoans | float | 0 | [x <= 3.0] |
| float | 1 | [3.0 < x <= 5.0] | |
| float | 2 | [5.0 < x <= 8.0] | |
| float | 3 | [8.0 < x <= 12.0] | |
| float | 4 | [12.0 < x] | |
| NumberOfTime30-59DaysPastDueNotWorse | float | 0 | [x <= 0.0] |
| float | 1 | [0.0 < x] | |
| NumberRealEstateLoansOrLines | float | 0 | [x <= 0.0] |
| float | 1 | [0.0 < x <= 1.0] | |
| float | 2 | [1.0 < x] | |
| RevolvingUtilizationOfUnsecuredLines | float | 0 | [x <= 155.4m] |
| float | 1 | [155.4m < x <= 273.6m] | |
| float | 2 | [273.6m < x <= 447.4m] | |
| float | 3 | [447.4m < x <= 700.7m] | |
| float | 4 | [700.7m < x] | |
| age | float | 0 | [x <= 33.0] |
| float | 1 | [33.0 < x <= 48.0] | |
| float | 2 | [48.0 < x <= 56.0] | |
| float | 3 | [56.0 < x <= 61.0] | |
| float | 4 | [61.0 < x] |
Saving for later uses¶
In [9]:
import json
# storing as json file
with open('my_carver.json', 'w') as my_carver_json:
json.dump(auto_carver.to_json(), my_carver_json)
Feature Selection¶
Setting up measures and filters¶
In [10]:
from AutoCarver.feature_selection import FeatureSelector
n_best = 10 # number of features to select
feature_selector = FeatureSelector(
quantitative_features=quantitative_features,
n_best=n_best,
pretty_print=True,
)
best_features = feature_selector.select(x_discretized, x_discretized[target])
------ [FeatureSelector] Selecting from Features: ['age', 'NumberOfDependents', 'DebtRatio', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'MonthlyIncome'] --- - Association between X and y
| dtype | pct_nan | pct_mode | mode | kruskal_measure | |
|---|---|---|---|---|---|
| NumberOfTimes90DaysLate | int64 | 0.000000 | 0.943960 | 0 | 11854.250713 |
| NumberOfTime60-89DaysPastDueNotWorse | int64 | 0.000000 | 0.949254 | 0 | 7636.377475 |
| RevolvingUtilizationOfUnsecuredLines | float64 | 0.000000 | 0.500000 | 0.000000 | 6158.299923 |
| NumberOfTime30-59DaysPastDueNotWorse | int64 | 0.000000 | 0.839035 | 0 | 6076.412797 |
| age | int64 | 0.000000 | 0.303592 | 1 | 1370.032527 |
| MonthlyIncome | float64 | 0.197383 | 0.200945 | 2.000000 | 337.382149 |
| NumberOfDependents | float64 | 0.026129 | 0.579741 | 0.000000 | 175.384642 |
| NumberOfOpenCreditLinesAndLoans | int64 | 0.000000 | 0.261602 | 2 | 121.575254 |
| NumberRealEstateLoansOrLines | int64 | 0.000000 | 0.375512 | 0 | 120.839911 |
| DebtRatio | float64 | 0.000000 | 0.400000 | 0.000000 | 47.123007 |
- Association between X and y, filtered for inter-feature assocation
| dtype | pct_nan | pct_mode | mode | kruskal_measure | |
|---|---|---|---|---|---|
| NumberOfTimes90DaysLate | int64 | 0.000000 | 0.943960 | 0 | 11854.250713 |
| NumberOfTime60-89DaysPastDueNotWorse | int64 | 0.000000 | 0.949254 | 0 | 7636.377475 |
| RevolvingUtilizationOfUnsecuredLines | float64 | 0.000000 | 0.500000 | 0.000000 | 6158.299923 |
| NumberOfTime30-59DaysPastDueNotWorse | int64 | 0.000000 | 0.839035 | 0 | 6076.412797 |
| age | int64 | 0.000000 | 0.303592 | 1 | 1370.032527 |
| MonthlyIncome | float64 | 0.197383 | 0.200945 | 2.000000 | 337.382149 |
| NumberOfDependents | float64 | 0.026129 | 0.579741 | 0.000000 | 175.384642 |
| NumberOfOpenCreditLinesAndLoans | int64 | 0.000000 | 0.261602 | 2 | 121.575254 |
| NumberRealEstateLoansOrLines | int64 | 0.000000 | 0.375512 | 0 | 120.839911 |
| DebtRatio | float64 | 0.000000 | 0.400000 | 0.000000 | 47.123007 |
------
Out[10]:
['NumberOfTimes90DaysLate', 'NumberOfTime60-89DaysPastDueNotWorse', 'RevolvingUtilizationOfUnsecuredLines', 'NumberOfTime30-59DaysPastDueNotWorse', 'age', 'MonthlyIncome', 'NumberOfDependents', 'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines', 'DebtRatio']