Predicting Age in Census Data¶
Introduction¶
The objective of this toy project is to predict the age of an individual with the 1994 US Census Data using multiple linear regression. We use the Statsmodels
and Patsy
modules for this task with Pyhon version >= 3.6. The dataset was sourced from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/adult (Lichman, 2013).
This report is organized as follows:
- Overview section describes the dataset used and the features in this dataset.
- Data Preparation section covers data cleaning and data preparation steps.
- Data Exploration section explores dataset features and their inter-relationships.
- Statistical Modeling and Performance Evaluation section first fits a full multiple linear regression model and performs diagnostic checks. Next, we perform backwards variable selection using p-values to obtain a reduced model, after which we perform another set of diagnostic checks on the reduced model.
- Summary and Conclusions section provides a summary of our work and presents our findings.
Overview¶
Data Source¶
The UCI Machine Learning Repository provides five datasets, but only adult.data
, adult.test
, and adult.names
were useful in this project. The adult.data
and adult.test
are the training and test datasets respectively. The adult.names
file contains the details of the variables (a.k.a. features or attributes). The training dataset has 32,561 observations (a.k.a. instances or records) and the test dataset has 16,281 observations. Both datasets consist of 14 descriptive (a.k.a. independent) features and one target (a.k.a. response or dependent) feature. In this project, we combine both training and test data into one.
Project Objective¶
Our goal is to see if we can predict an individual's age within an acceptable margin of error using multiple linear regression primarily with just main affects.
Target Feature¶
Our target feature is age
, which is a continuous numerical feature. Hence, our project is on a regression problem.
Descriptive Features¶
The variable descriptions below are from the adult.names
file:
workclass
: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.fnlwgt
: continuous.education
: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.education-num
: continuous.marital-status
: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-*inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.sex
: Female, Male.capital-gain
: continuous.capital-loss
: continuous.hours-per-week
: continuous.native-country
: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.income
: binary, 1: earns over \$50k a year, 0: earns less than \\$50k a year.
Most of the descriptive features are self-explanatory, except fnlwgt
which stands for "Final Weight" defined by the US Census. This weight is an "estimate of the number of units in the target population that the responding unit represents" (Lichman, 2013). This feature aims to allocate similar weights to people with similar demographic characteristics.
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
We read the training and test datasets directly from the data URLs. Also, since the datasets do not contain the attribute names, they are explicitly specified during data loading process. The adultData
dataset is read first and then it is concatenated with adultTest
as just data
.
# Gettting data from the web
url = (
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)
# Specifying the attribute names
attributeNames = [
'age',
'workclass',
'fnlwgt',
'education',
'education-num',
'marital-status',
'occupation',
'relationship',
'race',
'sex',
'capital-gain',
'capital-loss',
'hours-per-week',
'native-country',
'income',
]
# Read in data
adultData = pd.read_csv(url[0], sep = ',', names = attributeNames, header = None)
adultTest = pd.read_csv(url[1] , sep = ',', names = attributeNames, skiprows = 1)
# Join the two datasets together
data = pd.concat([adultData,adultTest])
# we will not need the datasets below anymore, so let's delete them to save memory
del adultData, adultTest
# Display randomly selected 10 rows
data.sample(10, random_state=999)
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
77 | 67 | ? | 212759 | 10th | 6 | Married-civ-spouse | ? | Husband | White | Male | 0 | 0 | 2 | United-States | <=50K |
30807 | 62 | Self-emp-not-inc | 224520 | HS-grad | 9 | Married-civ-spouse | Sales | Husband | White | Male | 0 | 0 | 90 | United-States | >50K |
23838 | 32 | Private | 262153 | HS-grad | 9 | Divorced | Craft-repair | Unmarried | White | Male | 0 | 0 | 35 | United-States | <=50K |
24124 | 41 | Private | 39581 | Some-college | 10 | Divorced | Adm-clerical | Not-in-family | Black | Female | 0 | 0 | 24 | El-Salvador | <=50K |
22731 | 66 | ? | 186032 | Assoc-voc | 11 | Widowed | ? | Not-in-family | White | Female | 2964 | 0 | 30 | United-States | <=50K |
2975 | 27 | Private | 37933 | Some-college | 10 | Never-married | Adm-clerical | Unmarried | Black | Female | 0 | 0 | 40 | United-States | <=50K |
4457 | 31 | State-gov | 93589 | HS-grad | 9 | Divorced | Protective-serv | Own-child | Other | Male | 0 | 0 | 40 | United-States | <=50K |
32318 | 34 | Private | 117963 | Bachelors | 13 | Married-civ-spouse | Craft-repair | Husband | White | Male | 0 | 0 | 40 | United-States | >50K |
12685 | 61 | Self-emp-not-inc | 176965 | Prof-school | 15 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 50 | United-States | >50K |
2183 | 47 | Private | 156926 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Husband | White | Male | 0 | 0 | 40 | United-States | >50K. |
Note: Alternatively, you can download the datasets from UCI Repository as txt
files to your computer and then read them in as below.
adultData = pd.read_csv('adult.data.txt', sep = ',', names = attributeNames, header = None)
adultTest = pd.read_csv('adult.test.txt', sep = ',', names = attributeNames, skiprows = 1)
Data Cleaning and Transformation¶
We first confirm that the feature types match the descriptions outlined in the documentation.
print(f"Shape of the dataset is {data.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(data.dtypes)
Shape of the dataset is (48842, 15) Data types are below where 'object' indicates a string type: age int64 workclass object fnlwgt int64 education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object income object dtype: object
Checking for Missing Values¶
print(f"\nNumber of missing values for each feature:")
print(data.isnull().sum())
Number of missing values for each feature: age 0 workclass 0 fnlwgt 0 education 0 education-num 0 marital-status 0 occupation 0 relationship 0 race 0 sex 0 capital-gain 0 capital-loss 0 hours-per-week 0 native-country 0 income 0 dtype: int64
On surface, no attribute contains any missing values, though we shall see below that the missing values are coded with a question mark. We will address this issue later.
Summary Statistics¶
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
data.describe(include='int64')
age | fnlwgt | education-num | capital-gain | capital-loss | hours-per-week | |
---|---|---|---|---|---|---|
count | 48842.000000 | 4.884200e+04 | 48842.000000 | 48842.000000 | 48842.000000 | 48842.000000 |
mean | 38.643585 | 1.896641e+05 | 10.078089 | 1079.067626 | 87.502314 | 40.422382 |
std | 13.710510 | 1.056040e+05 | 2.570973 | 7452.019058 | 403.004552 | 12.391444 |
min | 17.000000 | 1.228500e+04 | 1.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 28.000000 | 1.175505e+05 | 9.000000 | 0.000000 | 0.000000 | 40.000000 |
50% | 37.000000 | 1.781445e+05 | 10.000000 | 0.000000 | 0.000000 | 40.000000 |
75% | 48.000000 | 2.376420e+05 | 12.000000 | 0.000000 | 0.000000 | 45.000000 |
max | 90.000000 | 1.490400e+06 | 16.000000 | 99999.000000 | 4356.000000 | 99.000000 |
display(HTML('<b>Table 2: Summary of categorical features</b>'))
data.describe(include='object')
workclass | education | marital-status | occupation | relationship | race | sex | native-country | income | |
---|---|---|---|---|---|---|---|---|---|
count | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 | 48842 |
unique | 9 | 16 | 7 | 15 | 6 | 5 | 2 | 42 | 4 |
top | Private | HS-grad | Married-civ-spouse | Prof-specialty | Husband | White | Male | United-States | <=50K |
freq | 33906 | 15784 | 22379 | 6172 | 19716 | 41762 | 32650 | 43832 | 24720 |
Table 2 shows the feature income
has four categories (or a cardinality of 4). It was supposed to be 2 since income must be binary. We shall explain how to fix this cardinality issue later.
Continuous Features¶
As discussed earlier, the fnlwgt
variable has no predictive power, so it is removed. In addition, since we have an education categorical feature, we will go ahead and remove the education-num
feature as it relays the same information and therefore redundant.
data = data.drop(columns=['fnlwgt', 'education-num'])
data['age'].describe()
count 48842.000000 mean 38.643585 std 13.710510 min 17.000000 25% 28.000000 50% 37.000000 75% 48.000000 max 90.000000 Name: age, dtype: float64
The range of age
appears reasonable as the minimum and maximum ages are 17 and 90 respectively.
Next, we define capital = capital-gain - capital-loss
and then remove the individual gain and loss variables. The summary statistic for capital
is displayed below.
data['capital'] = data['capital-gain'] - data['capital-loss']
data = data.drop(columns=['capital-gain', 'capital-loss'])
data['capital'].describe()
count 48842.000000 mean 991.565313 std 7475.549906 min -4356.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 99999.000000 Name: capital, dtype: float64
Fixing Column Names¶
Some column names contain a minus sign, which can be problematic when modeling. Basically, when we write the regression formula, a minus sign will mean "exclude this variable", which is clearly not our intent here. For this reason, we modify the column names so that the minus signs are replaced with an underscore sign.
data.columns = [colname.replace('-', '_') for colname in list(data.columns)]
data.head()
age | workclass | education | marital_status | occupation | relationship | race | sex | hours_per_week | native_country | income | capital | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | Bachelors | Never-married | Adm-clerical | Not-in-family | White | Male | 40 | United-States | <=50K | 2174 |
1 | 50 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | Male | 13 | United-States | <=50K | 0 |
2 | 38 | Private | HS-grad | Divorced | Handlers-cleaners | Not-in-family | White | Male | 40 | United-States | <=50K | 0 |
3 | 53 | Private | 11th | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 40 | United-States | <=50K | 0 |
4 | 28 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 40 | Cuba | <=50K | 0 |
Categorical Features¶
Let's have a look at the unique values of the categorical columns. In Pandas
, string types are of data type "object", and usually these would be the categorical features.
categoricalColumns = data.columns[data.dtypes==object].tolist()
for col in categoricalColumns:
print('Unique values for ' + col)
print(data[col].unique())
print('')
Unique values for workclass [' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov' ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked'] Unique values for education [' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college' ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school' ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th'] Unique values for marital_status [' Never-married' ' Married-civ-spouse' ' Divorced' ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed'] Unique values for occupation [' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty' ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving' ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?' ' Protective-serv' ' Armed-Forces' ' Priv-house-serv'] Unique values for relationship [' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried' ' Other-relative'] Unique values for race [' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other'] Unique values for sex [' Male' ' Female'] Unique values for native_country [' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South' ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran' ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand' ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic' ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia' ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago' ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary' ' Holand-Netherlands'] Unique values for income [' <=50K' ' >50K' ' <=50K.' ' >50K.']
Some categorical attributes contain excessive white spaces, which makes life hard when filtering data. We will apply the strip()
function to remove extra white spaces.
for col in categoricalColumns:
data[col] = data[col].str.strip()
WARNING: The Statsmodels
module does not play nice when you have a minus sign in levels of categorical variables, especially when you try to some sort of automatic variable selection. The reason is that the minus sign has a special meaning in the underlying Patsy
module: it means remove this feature. So, we will replace all the minus signs in categorical variable level names with an underscore sign. In addition, a dot sign is not allowed either.
for col in categoricalColumns:
data[col] = data[col].str.replace('-', '_')
data.head()
age | workclass | education | marital_status | occupation | relationship | race | sex | hours_per_week | native_country | income | capital | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State_gov | Bachelors | Never_married | Adm_clerical | Not_in_family | White | Male | 40 | United_States | <=50K | 2174 |
1 | 50 | Self_emp_not_inc | Bachelors | Married_civ_spouse | Exec_managerial | Husband | White | Male | 13 | United_States | <=50K | 0 |
2 | 38 | Private | HS_grad | Divorced | Handlers_cleaners | Not_in_family | White | Male | 40 | United_States | <=50K | 0 |
3 | 53 | Private | 11th | Married_civ_spouse | Handlers_cleaners | Husband | Black | Male | 40 | United_States | <=50K | 0 |
4 | 28 | Private | Bachelors | Married_civ_spouse | Prof_specialty | Wife | Black | Female | 40 | Cuba | <=50K | 0 |
The workclass
, occupation
, and native-country
features contain some missing values encoded as "?". These observations comprise 7.4% of the total number of observations.
mask = (data['workclass'] == '?') | (data['occupation'] == '?') | (data['native_country'] == '?')
mask.value_counts(normalize = True)*100
False 92.588346 True 7.411654 dtype: float64
We now remove the rows with missing occupation, workclass, and native-country where missing values are encoded as "?".
data = data[data['workclass'] != "?"]
data = data[data['occupation'] != "?"]
data = data[data['native_country'] != "?"]
Since native-country
is too granular and unbalanced, we group countries as "Other" and "United_States". Likewise, we also categorize races as "Other" and "White".
data.loc[data['native_country'] != 'United_States', 'native_country'] = 'Other'
data.loc[data['race'] != 'White', 'race'] = 'Other'
TIP 1: As a general comment, sometimes numerical features in a dataset actually represent categorical features. As an example, suppose each state of Australia is encoded as an integer between 1 and 8, say as StateID
column in a dataframe called df
. This would still be a categorical variable. For your code to work correctly in such cases, you need to change the data type of this column from numeric to string as follows:
df['StateID'] = df['StateID'].astype(str)
TIP 2: If you have a string column, say named col
, that is actually real-valued, you can change its data type as follows:
df['col'] = df['col'].astype(float)
Dependent Variable¶
We need to correct the levels of the income
categorical feature, which is our dependent variable, to make sure it is binary.
print('Before correction, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")
data['income'] = data['income'].str.rstrip(".")
print('After removing the dot, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")
Before correction, the number of unique income labels are: <=50K 22654 <=50K. 11360 >50K 7508 >50K. 3700 Name: income, dtype: int64 After removing the dot, the number of unique income labels are: <=50K 34014 >50K 11208 Name: income, dtype: int64
WARNING: The Statsmodels
module does not play nice when you have mathematical symbols such as "-", "+", "<" and ">" in levels of categorical variables (try it and you will get a syntax error). For this reason, we will re-code income as low
and high
as below using the replace()
function in Pandas
.
data['income'] = data['income'].replace({'<=50K': 'low', '>50K': 'high'})
data['income'].value_counts()
low 34014 high 11208 Name: income, dtype: int64
Data Exploration¶
Our dataset can now be considered "clean" and ready for visualisation and statistical modeling.
Univariate Visualisation¶
Let's get a histogram of native country.
ax = data['native_country'].value_counts().plot(kind = 'bar')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.tight_layout()
plt.title('Figure 1: Bar Chart of Native Country', fontsize = 15)
plt.show();
Let's display a boxplot and histogram for age
. Figure 2 shows that this variable is right-skewed.
# get a box plot of age
sns.boxplot(data['age']).set_title('Figure 2: Box Plot of Age', fontsize = 15)
plt.show();
# get a histogram of age with kernel density estimate
sns.distplot(data['age'], kde = True).set_title('Figure 3: Histogram of Age', fontsize = 15)
plt.show();
# store the values of hours-per-week
hpw = data['hours_per_week']
# get a scatter plot
plt.scatter(hpw, data['age'], alpha = 0.3)
plt.title('Figure 4: Scatterplot of hours per week and age', fontsize = 15)
plt.xlabel('hours_per_week')
plt.ylabel('age')
plt.show();
Categorical Attributes by Age¶
We can see that the distribution of age between each gender is similar.
# Creating a boxplot
sns.boxplot(data['sex'], data['age']);
plt.title('Figure 5: Boxplot of age by gender', fontsize = 15)
plt.show();
The distribution of married women and single women differ but have a similar median as seen in Figure 6. The whiskers suggest that women above the age of 70 are more likely to be single.
# Storing a list of booleans corresponding to whether the person is female and a wife or Not_in_family
family_female_mask = (data['relationship'].isin(['Not_in_family','Wife'])) & (data['sex'].isin(['Female']))
# Using the list of booleans previously found to select the index of rows
family_female = data[family_female_mask]
# Creating the boxplot
sns.boxplot(family_female['relationship'], family_female['age']);
plt.title('Figure 6: Boxplot of female relationship status by age', fontsize = 15)
plt.show();
Facet plots¶
From Figure 7, we see that generally the distribution of male bachelors and high school graduates is somewhat comparable within each relationship level.
# Getting the index of those who have completed their Bachelors or HS graduate
edu_mask = data['education'].isin(['Bachelors','HS_grad'])
# Getting the index of those who are male and Not_in_family or a Husband
family_male_mask = (data['relationship'].isin(['Not_in_family','Husband'])) & (data['sex'].isin(['Male']))
# Selecting the rows of those who are Not_in_family, husband or wife and
# have completed either a Bachelors or just graduated high school
education_relationship = data[(edu_mask & family_female_mask) | (edu_mask & family_male_mask)]
# Creating the boxplot
sns.boxplot(education_relationship['relationship'], education_relationship['age'],
hue = education_relationship['education'])
plt.title('Figure 7: Boxplot of age broken down by relationship and education', fontsize = 15)
plt.show();
Although there is no clear overall pattern, we observe in Figure 8 that no government worker under the age of 40 earns over \$50k a year unless they work more than 20 hours a week.
# Getting the index of those who work in the government
gov_mask = data['workclass'].isin(['Federal_gov','Local_gov','State_gov'])
# creating a dataframe of those who work in the government
gov = data[gov_mask]
# creating a scatterplot
sns.scatterplot(gov['hours_per_week'], gov['age'], hue = gov['income'])
plt.title('Figure 8: Scatterplot of hours per week by age coloured by income for government workers', fontsize = 15);
plt.legend(loc = 'upper right')
plt.show();