Setting things up
About this notebook
Welcome to this notebook where we will explore the Census Adult Income dataset, a rich source of socio-economic information derived from the 1994 U.S. Census Bureau database. Our focus in this analysis is on employing a robust Python data discretization pipeline: Discretizer. This versatile tool is able to discretize various types of data, both quantitative and qualitative, making it an ideal companion for preprocessing tasks.
Data discretization is a crucial step in preparing datasets for machine learning models. It involves transforming continuous or categorical variables into discrete bins, allowing for improved interpretability, handling non-linearity, and addressing potential outliers. The Discretizer we will employ is designed to handle a diverse range of data types, including quantitative variables such as age and education level, as well as qualitative variables like marital status and occupation.
Throughout this notebook, we will delve into the intricacies of the Census Adult Income dataset, exploring the distribution of features and employing the Discretizer to discretize the data effectively. By the end of this preprocessing journey, we aim to create a dataset that is well-suited for subsequent machine learning tasks, enabling the development of robust models for predicting income levels based on socio-economic attributes.
Let’s embark on this exploration and witness the power of Discretizer in preparing our data for insightful analysis and accurate modeling.
Installation
[1]:
%pip install autocarver
Census Data
In this example notebook, we will use the Census dataset.
The Census Adult Income dataset, commonly referred to as the “Adult” dataset, is a well-known dataset in the realm of machine learning and data analysis. It is frequently used for tasks related to classification and predictive modeling. The dataset is derived from the 1994 U.S. Census Bureau database and contains a diverse set of features that aim to predict whether an individual earns more than $50,000 annually, making it a binary classification problem.
The features in the Adult dataset include demographic information such as age, education level, marital status, occupation, and work-related details like hours worked per week. The primary objective when working with this dataset is typically to build a predictive model capable of discerning between individuals with annual incomes above and below the $50,000 threshold.
[3]:
from ucimlrepo import fetch_ucirepo
# fetch dataset
adult = fetch_ucirepo(id=2)
# data (as pandas dataframes)
adult_data = adult.data.features
adult_data = adult_data.join(adult.data.targets)
# Display the first few rows of the dataset
adult_data.head()
[3]:
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
| 1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
| 2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
| 3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
| 4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Target type
[4]:
target = "income"
# cleaning target
adult_data[target] = adult_data[target].apply(lambda u: u.replace(".", ""))
# conversion to 0/1
adult_data[target] = (adult_data[target] == ">50K").astype(int)
# target rate
adult_data[target].value_counts(dropna=False)
[4]:
income
0 37155
1 11687
Name: count, dtype: int64
The target "income" is a binary target used in a classification task.
Data Sampling
[5]:
from sklearn.model_selection import train_test_split
# stratified sampling by target
train_set, dev_set = train_test_split(adult_data, test_size=0.20, random_state=42, stratify=adult_data[target])
# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:605: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if is_sparse(pd_dtype):
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:614: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
[5]:
(0.23927008420136667, 0.23932848807452145)
Picking up columns to Discretize
[6]:
train_set.head()
[6]:
| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 34495 | 37 | Private | 193106 | Bachelors | 13 | Never-married | Sales | Not-in-family | White | Female | 0 | 0 | 30 | United-States | 0 |
| 18591 | 56 | Self-emp-inc | 216636 | 12th | 8 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 1651 | 40 | United-States | 0 |
| 12562 | 53 | Private | 126977 | HS-grad | 9 | Separated | Craft-repair | Not-in-family | White | Male | 0 | 0 | 35 | United-States | 0 |
| 552 | 72 | Private | 205343 | 11th | 7 | Widowed | Adm-clerical | Unmarried | White | Female | 0 | 0 | 40 | United-States | 0 |
| 3479 | 46 | State-gov | 106705 | Masters | 14 | Never-married | Exec-managerial | Not-in-family | White | Female | 0 | 0 | 38 | United-States | 0 |
[7]:
# column data types
train_set.dtypes
[7]:
age int64
workclass object
fnlwgt int64
education object
education-num int64
marital-status object
occupation object
relationship object
race object
sex object
capital-gain int64
capital-loss int64
hours-per-week int64
native-country object
income int32
dtype: object
"education"is the only qualitative ordinal feature. It will be added to the list ofordinal_featuresandvalues_ordershas to be set."sex","marital-status","occupation","relationship","race","native-country"and"workclass"are quantitative categorical features. Those features will be added to the list ofqualitative_features."capital-gain"and"capital-loss"are quantitative continuous features, whilst"education-num","Age"and"hours-per-week"can be considered as quantitative discrete features. Those features will be added to the list ofquantitative_features."fnlwgt"is the weighting column. It is not currently usable in AutoCarver
[8]:
# lists of features per data type
quantitative_features = ["capital-gain", "capital-loss", "education-num", "hours-per-week", "age"]
qualitative_features = ["sex", "workclass", "marital-status", "occupation", "relationship", "race", "native-country"]
ordinal_features = ["education"]
weighting = ["fnlwgt"]
# user-specified ordering for ordinal features
values_orders = {
"education": ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate"]
}
Using Discretizer
Discretizer Settings
Representativness of modalities
The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:
For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.
[9]:
min_freq = 0.02
Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)
Fitting Discretizer
First, all qualitative features are discretized using
QualitativeDiscretizer:(Optionaly) Using
StringDiscretizerto convert them tostrif not already the caseFor qualitative ordinal features: using
OrdinalDiscretizerfor under-represented values (less frequent thanmin_freq=0.05) to be grouped with its closest modalityFor qualitative categorical features: using
CategoricalDiscretizerfor under-represented values (less frequent thanmin_freq=0.05) to be grouped with a default value (str_default="__OTHER__")
Second, all quantitative features are discretized using
QuantitativeDiscretizer:Using
ContinuousDiscretizerfor quantile discretization that keeps track of over-represented values (more frequent thanmin_freq=0.05)Using
OrdinalDiscretizerfor any remaining under-represented values (less frequent thanmin_freq/2=0.025) to be grouped with its closest modality
[10]:
from AutoCarver.discretizers import Discretizer
# intiating AutoCarver
discretizer = Discretizer(
quantitative_features=quantitative_features,
qualitative_features=qualitative_features,
ordinal_features=ordinal_features,
values_orders=values_orders,
min_freq=min_freq,
verbose=True, # showing statistics
copy=True, # whether or not to return a copy of the input dataset
)
# fitting on training sample
train_set_processed = discretizer.fit_transform(train_set, train_set[target])
------
[Discretizer] Fit Qualitative Features
---
- [OrdinalDiscretizer] Fit ['education']
- [CategoricalDiscretizer] Fit ['workclass', 'occupation', 'marital-status', 'race', 'native-country', 'relationship', 'sex']
------
------
[Discretizer] Fit Quantitative Features
---
- [ContinuousDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
- [OrdinalDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
------
Discretizer Analysis
Quantitative Continuous Feature
[11]:
# Discretization Summary
feature = "capital-gain"
discretizer.summary(feature)
[11]:
| label | content | |
|---|---|---|
| dtype | ||
| float | 0.000e+00 < x <= 3.411e+03 | [0.000e+00 < x <= 3.411e+03] |
| float | 1.355e+04 < x | [1.355e+04 < x] |
| float | 3.411e+03 < x <= 7.298e+03 | [3.411e+03 < x <= 7.298e+03] |
| float | 7.298e+03 < x <= 1.355e+04 | [7.298e+03 < x <= 1.355e+04] |
| float | x <= 0.000e+00 | [x <= 0.000e+00] |
[12]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
Over-represented values of capital-gain:
capital-gain
0 0.918358
Name: proportion, dtype: float64
[13]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)
Discretized distribution of capital-gain:
capital-gain
x <= 0.000e+00 0.918358
3.411e+03 < x <= 7.298e+03 0.026847
0.000e+00 < x <= 3.411e+03 0.020474
1.355e+04 < x 0.019835
7.298e+03 < x <= 1.355e+04 0.014486
Name: proportion, dtype: float64
For quantitative continuous feature capital-gain:
An over-represented value has been identified and kept by itself: value
0represents 91.8% of observed data (more thanmin_freq=0.02)Remaining 8.2% of values have been discretized in quantiles of sizes 2% (as specified with
min_freq=0.02)
Quantitative Discrete Feature
[14]:
# Discretization Summary
feature = "hours-per-week"
discretizer.summary(feature)
[14]:
| label | content | |
|---|---|---|
| dtype | ||
| float | 1.000e+01 < x <= 1.500e+01 | [1.000e+01 < x <= 1.500e+01] |
| float | 1.500e+01 < x <= 2.000e+01 | [1.500e+01 < x <= 2.000e+01] |
| float | 2.000e+01 < x <= 2.500e+01 | [2.000e+01 < x <= 2.500e+01] |
| float | 2.500e+01 < x <= 3.000e+01 | [2.500e+01 < x <= 3.000e+01] |
| float | 3.000e+01 < x <= 3.400e+01 | [3.000e+01 < x <= 3.400e+01] |
| float | 3.400e+01 < x <= 3.500e+01 | [3.400e+01 < x <= 3.500e+01] |
| float | 3.500e+01 < x <= 3.900e+01 | [3.500e+01 < x <= 3.900e+01] |
| float | 3.900e+01 < x <= 4.000e+01 | [3.900e+01 < x <= 4.000e+01] |
| float | 4.000e+01 < x <= 4.400e+01 | [4.000e+01 < x <= 4.400e+01] |
| float | 4.400e+01 < x <= 4.500e+01 | [4.400e+01 < x <= 4.500e+01] |
| float | 4.500e+01 < x <= 4.900e+01 | [4.500e+01 < x <= 4.900e+01] |
| float | 4.900e+01 < x <= 5.400e+01 | [4.900e+01 < x <= 5.400e+01] |
| float | 5.400e+01 < x <= 5.500e+01 | [5.400e+01 < x <= 5.500e+01] |
| float | 5.500e+01 < x <= 6.000e+01 | [5.500e+01 < x <= 6.000e+01] |
| float | 6.000e+01 < x <= 7.000e+01 | [6.000e+01 < x <= 7.000e+01] |
| float | 7.000e+01 < x | [7.000e+01 < x] |
| float | x <= 1.000e+01 | [x <= 1.000e+01] |
[15]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
Over-represented values of hours-per-week:
hours-per-week
40 0.464976
50 0.086965
45 0.056970
60 0.044993
35 0.039567
20 0.037084
30 0.034730
55 0.021959
Name: proportion, dtype: float64
[16]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)
Discretized distribution of hours-per-week:
hours-per-week
3.900e+01 < x <= 4.000e+01 0.464976
4.900e+01 < x <= 5.400e+01 0.093901
4.400e+01 < x <= 4.500e+01 0.056970
5.500e+01 < x <= 6.000e+01 0.049267
1.500e+01 < x <= 2.000e+01 0.046835
2.500e+01 < x <= 3.000e+01 0.039669
3.400e+01 < x <= 3.500e+01 0.039567
2.000e+01 < x <= 2.500e+01 0.029816
3.500e+01 < x <= 3.900e+01 0.027948
x <= 1.000e+01 0.023367
5.400e+01 < x <= 5.500e+01 0.021959
4.500e+01 < x <= 4.900e+01 0.020884
1.000e+01 < x <= 1.500e+01 0.019911
4.000e+01 < x <= 4.400e+01 0.019169
6.000e+01 < x <= 7.000e+01 0.018504
7.000e+01 < x 0.016072
3.000e+01 < x <= 3.400e+01 0.011184
Name: proportion, dtype: float64
[17]:
# values between 50 and 55 hours per week are under-represented
print(f"Observed data for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55').shape[0] / len(train_set):.2%}")
Observed data for hours-per-week values strictly between 50 and 55: 0.69%
[18]:
print(f"Target rate for {feature} values equal to 50: {train_set.query(f'`{feature}`==50')[target].mean():.2%}")
print(f"Target rate for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55')[target].mean():.2%}")
print(f"Target rate for {feature} values equal to 55: {train_set.query(f'`{feature}`==55')[target].mean():.2%}")
Target rate for hours-per-week values equal to 50: 44.29%
Target rate for hours-per-week values strictly between 50 and 55: 32.10%
Target rate for hours-per-week values equal to 55: 47.32%
For quantitative discrete feature hours-per-week:
Some over-represented values have been identified:
values
20,30,35,40,45,50,55and60each represent more than 2.0% of observed data (more thanmin_freq=0.02)they are kept as their own modality
Some under-represented values have been identified:
values between
50and55represent only 0.7% of observed data, which is not enough to make a whole quantile out of (at leastmin_freq/2=0.01).hence there grouping with there closest modality,
50, in terms of target rate (32.1% is closer to 44.3% than to 47.3%)
Remaining values have been discretized in quantiles of sizes 2% (as specified with
min_freq=0.02)
Qualitative Categorical Feature
[19]:
# Discretization Summary
feature = "workclass"
discretizer.summary(feature)
[19]:
| label | content | |
|---|---|---|
| dtype | ||
| str | ? | [?] |
| str | Federal-gov | [Federal-gov] |
| str | Local-gov | [Local-gov] |
| str | Private | [Private] |
| str | Self-emp-inc | [Self-emp-inc] |
| str | Self-emp-not-inc | [Self-emp-not-inc] |
| str | State-gov | [State-gov] |
| str | __NAN__ | [__NAN__] |
| str | __OTHER__ | [Never-worked, Without-pay] |
[20]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])
Over-represented values of workclass:
workclass
Private 0.708568
Self-emp-not-inc 0.080096
Local-gov 0.066181
State-gov 0.040622
? 0.038221
Self-emp-inc 0.035584
Federal-gov 0.030023
Name: proportion, dtype: float64
Under-represented values of workclass:
workclass
Without-pay 0.000496
Never-worked 0.000209
Name: proportion, dtype: float64
[21]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)
Discretized distribution of workclass:
workclass
Private 0.694623
Self-emp-not-inc 0.078520
Local-gov 0.064879
State-gov 0.039823
? 0.037468
Self-emp-inc 0.034883
Federal-gov 0.029432
__NAN__ 0.019681
__OTHER__ 0.000691
Name: proportion, dtype: float64
For qualitative categorical feature workclass:
Some over-represented categories have been identified:
categories
"Private","Self-emp-not-inc","Local-gov","State-gov","?","Self-emp-inc"and"Federal-gov", each represent more than 2.0% of observed data (more thanmin_freq=0.02)they are kept as their own modality
Some under-represented categories have been identified:
categories
"Never-worked"and"Without-pay", each represent less than 2.0% of observed data (less thanmin_freq=0.02)they are grouped in the default value
str_default="__OTHER__"
Missing values are left by themselves whatsoever (nan value
str_nan="__NAN__")
Qualitative Ordinal Feature
[22]:
# Discretization Summary
feature = "education"
discretizer.summary(feature)
[22]:
| label | content | |
|---|---|---|
| dtype | ||
| str | 10th | [10th] |
| str | 11th | [11th, 12th] |
| str | 7th-8th | [1st-4th, 5th-6th, 7th-8th, 9th, Preschool] |
| str | Assoc-acdm | [Assoc-acdm] |
| str | Assoc-voc | [Assoc-voc] |
| str | Bachelors | [Bachelors] |
| str | HS-grad | [HS-grad] |
| str | Masters | [Masters] |
| str | Prof-school | [Doctorate, Prof-school] |
| str | Some-college | [Some-college] |
[23]:
# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])
Over-represented values of education:
education
HS-grad 0.322013
Some-college 0.223658
Bachelors 0.163259
Masters 0.054539
Assoc-voc 0.042408
11th 0.037136
Assoc-acdm 0.033348
10th 0.028357
Name: proportion, dtype: float64
Under-represented values of education:
education
7th-8th 0.019348
Prof-school 0.016738
9th 0.015637
12th 0.013616
Doctorate 0.011978
5th-6th 0.010928
1st-4th 0.005451
Preschool 0.001587
Name: proportion, dtype: float64
[24]:
discretizer.values_orders[feature].content
[24]:
{'7th-8th': ['9th', '5th-6th', '1st-4th', 'Preschool', '7th-8th'],
'10th': ['10th'],
'11th': ['12th', '11th'],
'HS-grad': ['HS-grad'],
'Some-college': ['Some-college'],
'Assoc-voc': ['Assoc-voc'],
'Assoc-acdm': ['Assoc-acdm'],
'Bachelors': ['Bachelors'],
'Masters': ['Masters'],
'Prof-school': ['Doctorate', 'Prof-school']}
[25]:
# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)
Discretized distribution of education:
education
HS-grad 0.322013
Some-college 0.223658
Bachelors 0.163259
Masters 0.054539
7th-8th 0.052952
11th 0.050751
Assoc-voc 0.042408
Assoc-acdm 0.033348
Prof-school 0.028715
10th 0.028357
Name: proportion, dtype: float64
[26]:
print("Provided ordering:", values_orders[feature])
Provided ordering: ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']
For qualitative ordinal feature education:
Some over-represented categories have been identified:
categories
"10th","11th","HS-grad","Some-college","Assoc-voc","Assoc-acdm","Bachelors"and"Masters"each represent more than 2.0% of observed data (more thanmin_freq=0.02)they are kept as their own modality
Some under-represented categories have been identified:
categories
"Preschool","1st-4th","5th-6th","7th-8th","9th","12th","Prof-school"and"Doctorate"each represent less than 2.0% of observed data (less thanmin_freq=0.02)starting from the least represented category, they are grouped with there respective closest modality:
"Preschool"is grouped with"1st-4th"as it is the only available modality in the specified order (see definition ofvalues_orders)In the same manner they are then grouped successively with
"5th-6th","7th-8th"and"9th"Same goes for
"12th"with"11th"and"Prof-school"with"Doctorate"
Missing values are left by themselves (nan value
str_nan="__NAN__")
Saving and Loading Discretizer
Saving
All Discretizers can safely be stored as a .json file.
[27]:
import json
# storing as json file
with open('discretizer.json', 'w') as my_discretizer_json:
json.dump(discretizer.to_json(), my_discretizer_json)
Loading
Carvers can safely be loaded from a .json file.
[28]:
import json
from AutoCarver.discretizers import load_discretizer
# loading json file
with open('discretizer.json', 'r') as my_discretizer_json:
discretizer = load_discretizer(json.load(my_discretizer_json))
Applying Discretizer
[30]:
dev_set_processed = discretizer.transform(dev_set)
What’s next?
Thanks to Discretizers all of your features are now quantitative ordinal features with representative enough modalities!
Discretizers are directly integrated in Carvers for a better user experience
Carvers make good use of this discretization step to find out the most target associated consecutive combination of modalities, so make sure to check out Carvers Examples!
Well done!
You have successfully navigated the intricacies of feature discretization using the AutoCarver package, creating a dataset with finely tuned discrete representations that enhance the interpretability and effectiveness of your features.
Your meticulous approach to discretizing both quantitative and qualitative attributes, utilizing the QuantitativeDiscretizer and QualitativeDiscretizer components, showcases your commitment to preparing data for meaningful analysis and modeling.
We appreciate your trust in the AutoCarver package, and we hope that the discretized features contribute to the success of your machine learning endeavors. As you move forward with your analyses and predictive modeling tasks, may the insights gained from this well-crafted dataset lead to impactful and informed decisions.
Thank you for choosing AutoCarver as your companion in the data preprocessing journey. Your dedication to refining and optimizing your datasets reflects a commitment to excellence in data science. We look forward to being part of your future data adventures and wish you continued success in your endeavors.