Setting things up

About this notebook

Welcome to this notebook where we will explore the Census Adult Income dataset, a rich source of socio-economic information derived from the 1994 U.S. Census Bureau database. Our focus in this analysis is on employing a robust Python data discretization pipeline: Discretizer. This versatile tool is able to discretize various types of data, both quantitative and qualitative, making it an ideal companion for preprocessing tasks.

Data discretization is a crucial step in preparing datasets for machine learning models. It involves transforming continuous or categorical variables into discrete bins, allowing for improved interpretability, handling non-linearity, and addressing potential outliers. The Discretizer we will employ is designed to handle a diverse range of data types, including quantitative variables such as age and education level, as well as qualitative variables like marital status and occupation.

Throughout this notebook, we will delve into the intricacies of the Census Adult Income dataset, exploring the distribution of features and employing the Discretizer to discretize the data effectively. By the end of this preprocessing journey, we aim to create a dataset that is well-suited for subsequent machine learning tasks, enabling the development of robust models for predicting income levels based on socio-economic attributes.

Let’s embark on this exploration and witness the power of Discretizer in preparing our data for insightful analysis and accurate modeling.

Installation

[1]:

%pip install autocarver

Census Data

In this example notebook, we will use the Census dataset.

The Census Adult Income dataset, commonly referred to as the “Adult” dataset, is a well-known dataset in the realm of machine learning and data analysis. It is frequently used for tasks related to classification and predictive modeling. The dataset is derived from the 1994 U.S. Census Bureau database and contains a diverse set of features that aim to predict whether an individual earns more than $50,000 annually, making it a binary classification problem.

The features in the Adult dataset include demographic information such as age, education level, marital status, occupation, and work-related details like hours worked per week. The primary objective when working with this dataset is typically to build a predictive model capable of discerning between individuals with annual incomes above and below the $50,000 threshold.

[3]:

from ucimlrepo import fetch_ucirepo

# fetch dataset
adult = fetch_ucirepo(id=2)

# data (as pandas dataframes)
adult_data = adult.data.features
adult_data = adult_data.join(adult.data.targets)

# Display the first few rows of the dataset
adult_data.head()

[3]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	income
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

Target type

[4]:

target = "income"

# cleaning target
adult_data[target] = adult_data[target].apply(lambda u: u.replace(".", ""))

# conversion to 0/1
adult_data[target] = (adult_data[target] == ">50K").astype(int)

# target rate
adult_data[target].value_counts(dropna=False)

[4]:

income
0    37155
1    11687
Name: count, dtype: int64

The target "income" is a binary target used in a classification task.

Data Sampling

[5]:

from sklearn.model_selection import train_test_split

# stratified sampling by target
train_set, dev_set = train_test_split(adult_data, test_size=0.20, random_state=42, stratify=adult_data[target])

# checking target rate per dataset
train_set[target].mean(), dev_set[target].mean()

c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:605: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if is_sparse(pd_dtype):
c:\Users\defra\.conda\envs\py39\lib\site-packages\sklearn\utils\validation.py:614: FutureWarning: is_sparse is deprecated and will be removed in a future version. Check `isinstance(dtype, pd.SparseDtype)` instead.
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):

[5]:

(0.23927008420136667, 0.23932848807452145)

Picking up columns to Discretize

[6]:

train_set.head()

[6]:

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-loss	hours-per-week	native-country
34495	37	Private	193106	Bachelors	13	Never-married	Sales	Not-in-family	White	Female	0	30	United-States
18591	56	Self-emp-inc	216636	12th	8	Married-civ-spouse	Exec-managerial	Husband	White	Male	1651	40	United-States
12562	53	Private	126977	HS-grad	9	Separated	Craft-repair	Not-in-family	White	Male	0	35	United-States
552	72	Private	205343	11th	7	Widowed	Adm-clerical	Unmarried	White	Female	0	40	United-States
3479	46	State-gov	106705	Masters	14	Never-married	Exec-managerial	Not-in-family	White	Female	0	38	United-States

[7]:

# column data types
train_set.dtypes

[7]:

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income             int32
dtype: object

"education" is the only qualitative ordinal feature. It will be added to the list of ordinal_features and values_orders has to be set.
"sex", "marital-status", "occupation", "relationship", "race", "native-country" and "workclass" are quantitative categorical features. Those features will be added to the list of qualitative_features.
"capital-gain" and "capital-loss" are quantitative continuous features, whilst "education-num", "Age" and "hours-per-week" can be considered as quantitative discrete features. Those features will be added to the list of quantitative_features.
"fnlwgt" is the weighting column. It is not currently usable in AutoCarver

[8]:

# lists of features per data type
quantitative_features = ["capital-gain", "capital-loss", "education-num", "hours-per-week", "age"]
qualitative_features = ["sex", "workclass", "marital-status", "occupation", "relationship", "race", "native-country"]
ordinal_features = ["education"]
weighting = ["fnlwgt"]

# user-specified ordering for ordinal features
values_orders = {
    "education": ["Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "10th", "11th", "12th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors", "Masters", "Prof-school", "Doctorate"]
}

Using Discretizer

Discretizer Settings

Representativness of modalities

The attribute min_freq allows one to choose the minimum frequency per basic modalities. It is used by Discretizers:

For quantitative features, it defines the number of quantiles to initialy discretize the features with.
For qualitative features, it defines the threshold under which a modality is grouped to either a default value or its closest modality.

[9]:

min_freq = 0.02

Tip: should be set between 0.01 (slower, preciser, less robust) and 0.2 (faster, more robust)

Fitting Discretizer

First, all qualitative features are discretized using QualitativeDiscretizer:
1. (Optionaly) Using StringDiscretizer to convert them to str if not already the case
2. For qualitative ordinal features: using OrdinalDiscretizer for under-represented values (less frequent than min_freq=0.05) to be grouped with its closest modality
3. For qualitative categorical features: using CategoricalDiscretizer for under-represented values (less frequent than min_freq=0.05) to be grouped with a default value (str_default="__OTHER__")
Second, all quantitative features are discretized using QuantitativeDiscretizer:
1. Using ContinuousDiscretizer for quantile discretization that keeps track of over-represented values (more frequent than min_freq=0.05)
2. Using OrdinalDiscretizer for any remaining under-represented values (less frequent than min_freq/2=0.025) to be grouped with its closest modality

[10]:

from AutoCarver.discretizers import Discretizer

# intiating AutoCarver
discretizer = Discretizer(
    quantitative_features=quantitative_features,
    qualitative_features=qualitative_features,
    ordinal_features=ordinal_features,
    values_orders=values_orders,
    min_freq=min_freq,
    verbose=True,  # showing statistics
    copy=True,  # whether or not to return a copy of the input dataset
)

# fitting on training sample
train_set_processed = discretizer.fit_transform(train_set, train_set[target])

------
[Discretizer] Fit Qualitative Features
---
 - [OrdinalDiscretizer] Fit ['education']
 - [CategoricalDiscretizer] Fit ['workclass', 'occupation', 'marital-status', 'race', 'native-country', 'relationship', 'sex']
------

------
[Discretizer] Fit Quantitative Features
---
 - [ContinuousDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
 - [OrdinalDiscretizer] Fit ['capital-loss', 'hours-per-week', 'capital-gain', 'education-num', 'age']
------

Discretizer Analysis

Quantitative Continuous Feature

[11]:

# Discretization Summary
feature = "capital-gain"
discretizer.summary(feature)

[11]:

	label	content
dtype
float	0.000e+00 < x <= 3.411e+03	[0.000e+00 < x <= 3.411e+03]
float	1.355e+04 < x	[1.355e+04 < x]
float	3.411e+03 < x <= 7.298e+03	[3.411e+03 < x <= 7.298e+03]
float	7.298e+03 < x <= 1.355e+04	[7.298e+03 < x <= 1.355e+04]
float	x <= 0.000e+00	[x <= 0.000e+00]

[12]:

# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])

Over-represented values of capital-gain:
 capital-gain
0    0.918358
Name: proportion, dtype: float64

[13]:

# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of capital-gain:
 capital-gain
x <= 0.000e+00                0.918358
3.411e+03 < x <= 7.298e+03    0.026847
0.000e+00 < x <= 3.411e+03    0.020474
1.355e+04 < x                 0.019835
7.298e+03 < x <= 1.355e+04    0.014486
Name: proportion, dtype: float64

For quantitative continuous feature capital-gain:

An over-represented value has been identified and kept by itself: value 0 represents 91.8% of observed data (more than min_freq=0.02)
Remaining 8.2% of values have been discretized in quantiles of sizes 2% (as specified with min_freq=0.02)

Quantitative Discrete Feature

[14]:

# Discretization Summary
feature = "hours-per-week"
discretizer.summary(feature)

[14]:

	label	content
dtype
float	1.000e+01 < x <= 1.500e+01	[1.000e+01 < x <= 1.500e+01]
float	1.500e+01 < x <= 2.000e+01	[1.500e+01 < x <= 2.000e+01]
float	2.000e+01 < x <= 2.500e+01	[2.000e+01 < x <= 2.500e+01]
float	2.500e+01 < x <= 3.000e+01	[2.500e+01 < x <= 3.000e+01]
float	3.000e+01 < x <= 3.400e+01	[3.000e+01 < x <= 3.400e+01]
float	3.400e+01 < x <= 3.500e+01	[3.400e+01 < x <= 3.500e+01]
float	3.500e+01 < x <= 3.900e+01	[3.500e+01 < x <= 3.900e+01]
float	3.900e+01 < x <= 4.000e+01	[3.900e+01 < x <= 4.000e+01]
float	4.000e+01 < x <= 4.400e+01	[4.000e+01 < x <= 4.400e+01]
float	4.400e+01 < x <= 4.500e+01	[4.400e+01 < x <= 4.500e+01]
float	4.500e+01 < x <= 4.900e+01	[4.500e+01 < x <= 4.900e+01]
float	4.900e+01 < x <= 5.400e+01	[4.900e+01 < x <= 5.400e+01]
float	5.400e+01 < x <= 5.500e+01	[5.400e+01 < x <= 5.500e+01]
float	5.500e+01 < x <= 6.000e+01	[5.500e+01 < x <= 6.000e+01]
float	6.000e+01 < x <= 7.000e+01	[6.000e+01 < x <= 7.000e+01]
float	7.000e+01 < x	[7.000e+01 < x]
float	x <= 1.000e+01	[x <= 1.000e+01]

[15]:

# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])

Over-represented values of hours-per-week:
 hours-per-week
40    0.464976
50    0.086965
45    0.056970
60    0.044993
35    0.039567
20    0.037084
30    0.034730
55    0.021959
Name: proportion, dtype: float64

[16]:

# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of hours-per-week:
 hours-per-week
3.900e+01 < x <= 4.000e+01    0.464976
4.900e+01 < x <= 5.400e+01    0.093901
4.400e+01 < x <= 4.500e+01    0.056970
5.500e+01 < x <= 6.000e+01    0.049267
1.500e+01 < x <= 2.000e+01    0.046835
2.500e+01 < x <= 3.000e+01    0.039669
3.400e+01 < x <= 3.500e+01    0.039567
2.000e+01 < x <= 2.500e+01    0.029816
3.500e+01 < x <= 3.900e+01    0.027948
x <= 1.000e+01                0.023367
5.400e+01 < x <= 5.500e+01    0.021959
4.500e+01 < x <= 4.900e+01    0.020884
1.000e+01 < x <= 1.500e+01    0.019911
4.000e+01 < x <= 4.400e+01    0.019169
6.000e+01 < x <= 7.000e+01    0.018504
7.000e+01 < x                 0.016072
3.000e+01 < x <= 3.400e+01    0.011184
Name: proportion, dtype: float64

[17]:

# values between 50 and 55 hours per week are under-represented
print(f"Observed data for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55').shape[0] / len(train_set):.2%}")

Observed data for hours-per-week values strictly between 50 and 55: 0.69%

[18]:

print(f"Target rate for {feature} values equal to 50: {train_set.query(f'`{feature}`==50')[target].mean():.2%}")
print(f"Target rate for {feature} values strictly between 50 and 55: {train_set.query(f'50<`{feature}`<55')[target].mean():.2%}")
print(f"Target rate for {feature} values equal to 55: {train_set.query(f'`{feature}`==55')[target].mean():.2%}")

Target rate for hours-per-week values equal to 50: 44.29%
Target rate for hours-per-week values strictly between 50 and 55: 32.10%
Target rate for hours-per-week values equal to 55: 47.32%

For quantitative discrete feature hours-per-week:

Some over-represented values have been identified:
- values 20, 30, 35, 40, 45, 50, 55 and 60 each represent more than 2.0% of observed data (more than min_freq=0.02)
- they are kept as their own modality
Some under-represented values have been identified:
- values between 50 and 55 represent only 0.7% of observed data, which is not enough to make a whole quantile out of (at least min_freq/2=0.01).
- hence there grouping with there closest modality, 50, in terms of target rate (32.1% is closer to 44.3% than to 47.3%)
Remaining values have been discretized in quantiles of sizes 2% (as specified with min_freq=0.02)

Qualitative Categorical Feature

[19]:

# Discretization Summary
feature = "workclass"
discretizer.summary(feature)

[19]:

	label	content
dtype
str	?	[?]
str	Federal-gov	[Federal-gov]
str	Local-gov	[Local-gov]
str	Private	[Private]
str	Self-emp-inc	[Self-emp-inc]
str	Self-emp-not-inc	[Self-emp-not-inc]
str	State-gov	[State-gov]
str	__NAN__	[__NAN__]
str	__OTHER__	[Never-worked, Without-pay]

[20]:

# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])

Over-represented values of workclass:
 workclass
Private             0.708568
Self-emp-not-inc    0.080096
Local-gov           0.066181
State-gov           0.040622
?                   0.038221
Self-emp-inc        0.035584
Federal-gov         0.030023
Name: proportion, dtype: float64

Under-represented values of workclass:
 workclass
Without-pay     0.000496
Never-worked    0.000209
Name: proportion, dtype: float64

[21]:

# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of workclass:
 workclass
Private             0.694623
Self-emp-not-inc    0.078520
Local-gov           0.064879
State-gov           0.039823
?                   0.037468
Self-emp-inc        0.034883
Federal-gov         0.029432
__NAN__             0.019681
__OTHER__           0.000691
Name: proportion, dtype: float64

For qualitative categorical feature workclass:

Some over-represented categories have been identified:
- categories "Private", "Self-emp-not-inc", "Local-gov", "State-gov", "?", "Self-emp-inc" and "Federal-gov", each represent more than 2.0% of observed data (more than min_freq=0.02)
- they are kept as their own modality
Some under-represented categories have been identified:
- categories "Never-worked" and "Without-pay", each represent less than 2.0% of observed data (less than min_freq=0.02)
- they are grouped in the default value str_default="__OTHER__"
Missing values are left by themselves whatsoever (nan value str_nan="__NAN__")

Qualitative Ordinal Feature

[22]:

# Discretization Summary
feature = "education"
discretizer.summary(feature)

[22]:

	label	content
dtype
str	10th	[10th]
str	11th	[11th, 12th]
str	7th-8th	[1st-4th, 5th-6th, 7th-8th, 9th, Preschool]
str	Assoc-acdm	[Assoc-acdm]
str	Assoc-voc	[Assoc-voc]
str	Bachelors	[Bachelors]
str	HS-grad	[HS-grad]
str	Masters	[Masters]
str	Prof-school	[Doctorate, Prof-school]
str	Some-college	[Some-college]

[23]:

# Not discretized distribution
stats = train_set[feature].value_counts(dropna=True, normalize=True)
print(f"Over-represented values of {feature}:\n", stats[stats >= min_freq])
print(f"\nUnder-represented values of {feature}:\n", stats[stats < min_freq])

Over-represented values of education:
 education
HS-grad         0.322013
Some-college    0.223658
Bachelors       0.163259
Masters         0.054539
Assoc-voc       0.042408
11th            0.037136
Assoc-acdm      0.033348
10th            0.028357
Name: proportion, dtype: float64

Under-represented values of education:
 education
7th-8th        0.019348
Prof-school    0.016738
9th            0.015637
12th           0.013616
Doctorate      0.011978
5th-6th        0.010928
1st-4th        0.005451
Preschool      0.001587
Name: proportion, dtype: float64

[24]:

discretizer.values_orders[feature].content

[24]:

{'7th-8th': ['9th', '5th-6th', '1st-4th', 'Preschool', '7th-8th'],
 '10th': ['10th'],
 '11th': ['12th', '11th'],
 'HS-grad': ['HS-grad'],
 'Some-college': ['Some-college'],
 'Assoc-voc': ['Assoc-voc'],
 'Assoc-acdm': ['Assoc-acdm'],
 'Bachelors': ['Bachelors'],
 'Masters': ['Masters'],
 'Prof-school': ['Doctorate', 'Prof-school']}

[25]:

# Discretized distribution
disc_stats = train_set_processed[feature].value_counts(dropna=True, normalize=True)
print(f"Discretized distribution of {feature}:\n", disc_stats)

Discretized distribution of education:
 education
HS-grad         0.322013
Some-college    0.223658
Bachelors       0.163259
Masters         0.054539
7th-8th         0.052952
11th            0.050751
Assoc-voc       0.042408
Assoc-acdm      0.033348
Prof-school     0.028715
10th            0.028357
Name: proportion, dtype: float64

[26]:

print("Provided ordering:", values_orders[feature])

Provided ordering: ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad', 'Some-college', 'Assoc-voc', 'Assoc-acdm', 'Bachelors', 'Masters', 'Prof-school', 'Doctorate']

For qualitative ordinal feature education:

Some over-represented categories have been identified:
- categories "10th", "11th", "HS-grad", "Some-college", "Assoc-voc", "Assoc-acdm", "Bachelors" and "Masters" each represent more than 2.0% of observed data (more than min_freq=0.02)
- they are kept as their own modality
Some under-represented categories have been identified:
- categories "Preschool", "1st-4th", "5th-6th", "7th-8th", "9th", "12th", "Prof-school" and "Doctorate" each represent less than 2.0% of observed data (less than min_freq=0.02)
- starting from the least represented category, they are grouped with there respective closest modality:
  - "Preschool" is grouped with "1st-4th" as it is the only available modality in the specified order (see definition of values_orders)
  - In the same manner they are then grouped successively with "5th-6th", "7th-8th" and "9th"
  - Same goes for "12th" with "11th" and "Prof-school" with "Doctorate"
Missing values are left by themselves (nan value str_nan="__NAN__")

Saving and Loading Discretizer

Saving

All Discretizers can safely be stored as a .json file.

[27]:

import json

# storing as json file
with open('discretizer.json', 'w') as my_discretizer_json:
    json.dump(discretizer.to_json(), my_discretizer_json)

Loading

Carvers can safely be loaded from a .json file.

[28]:

import json

from AutoCarver.discretizers import load_discretizer

# loading json file
with open('discretizer.json', 'r') as my_discretizer_json:
    discretizer = load_discretizer(json.load(my_discretizer_json))

Applying Discretizer

[30]:

dev_set_processed = discretizer.transform(dev_set)

What’s next?

Thanks to Discretizers all of your features are now quantitative ordinal features with representative enough modalities!
Discretizers are directly integrated in Carvers for a better user experience
Carvers make good use of this discretization step to find out the most target associated consecutive combination of modalities, so make sure to check out Carvers Examples!

Well done!

You have successfully navigated the intricacies of feature discretization using the AutoCarver package, creating a dataset with finely tuned discrete representations that enhance the interpretability and effectiveness of your features.

Your meticulous approach to discretizing both quantitative and qualitative attributes, utilizing the QuantitativeDiscretizer and QualitativeDiscretizer components, showcases your commitment to preparing data for meaningful analysis and modeling.

We appreciate your trust in the AutoCarver package, and we hope that the discretized features contribute to the success of your machine learning endeavors. As you move forward with your analyses and predictive modeling tasks, may the insights gained from this well-crafted dataset lead to impactful and informed decisions.

Thank you for choosing AutoCarver as your companion in the data preprocessing journey. Your dedication to refining and optimizing your datasets reflects a commitment to excellence in data science. We look forward to being part of your future data adventures and wish you continued success in your endeavors.