Regression Case Study

Regression_Case_Study1_web

Predicting Age in Census Data

Introduction

The objective of this toy project is to predict the age of an individual with the 1994 US Census Data using multiple linear regression. We use the Statsmodels and Patsy modules for this task with Pyhon version >= 3.6. The dataset was sourced from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/adult (Lichman, 2013).

This report is organized as follows:

  • Overview section describes the dataset used and the features in this dataset.
  • Data Preparation section covers data cleaning and data preparation steps.
  • Data Exploration section explores dataset features and their inter-relationships.
  • Statistical Modeling and Performance Evaluation section first fits a full multiple linear regression model and performs diagnostic checks. Next, we perform backwards variable selection using p-values to obtain a reduced model, after which we perform another set of diagnostic checks on the reduced model.
  • Summary and Conclusions section provides a summary of our work and presents our findings.

Overview

Data Source

The UCI Machine Learning Repository provides five datasets, but only adult.data, adult.test, and adult.names were useful in this project. The adult.data and adult.test are the training and test datasets respectively. The adult.names file contains the details of the variables (a.k.a. features or attributes). The training dataset has 32,561 observations (a.k.a. instances or records) and the test dataset has 16,281 observations. Both datasets consist of 14 descriptive (a.k.a. independent) features and one target (a.k.a. response or dependent) feature. In this project, we combine both training and test data into one.

Project Objective

Our goal is to see if we can predict an individual's age within an acceptable margin of error using multiple linear regression primarily with just main affects.

Target Feature

Our target feature is age, which is a continuous numerical feature. Hence, our project is on a regression problem.

Descriptive Features

The variable descriptions below are from the adult.names file:

  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-*inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • income: binary, 1: earns over \$50k a year, 0: earns less than \\$50k a year.

Most of the descriptive features are self-explanatory, except fnlwgt which stands for "Final Weight" defined by the US Census. This weight is an "estimate of the number of units in the target population that the responding unit represents" (Lichman, 2013). This feature aims to allocate similar weights to people with similar demographic characteristics.

Data Preparation

Preliminaries

For further information on how to prepare your data for statistical modeling, please refer to this page on our website.

First, let's import all the common modules we will be using.

In [1]:
# Importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import patsy
import warnings
###
warnings.filterwarnings('ignore')
###
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")

We read the training and test datasets directly from the data URLs. Also, since the datasets do not contain the attribute names, they are explicitly specified during data loading process. The adultData dataset is read first and then it is concatenated with adultTest as just data.

In [2]:
# Gettting data from the web
url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)

# Specifying the attribute names
attributeNames = [
    'age',
    'workclass',
    'fnlwgt',
    'education',
    'education-num',
    'marital-status',
    'occupation',
    'relationship',
    'race',
    'sex',
    'capital-gain',
    'capital-loss',
    'hours-per-week',
    'native-country',
    'income',
]

# Read in data
adultData = pd.read_csv(url[0], sep = ',', names = attributeNames, header = None)
adultTest = pd.read_csv(url[1] , sep = ',', names = attributeNames, skiprows = 1)

# Join the two datasets together
data = pd.concat([adultData,adultTest])

# we will not need the datasets below anymore, so let's delete them to save memory
del adultData, adultTest

# Display randomly selected 10 rows
data.sample(10, random_state=999)
Out[2]:
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
77 67 ? 212759 10th 6 Married-civ-spouse ? Husband White Male 0 0 2 United-States <=50K
30807 62 Self-emp-not-inc 224520 HS-grad 9 Married-civ-spouse Sales Husband White Male 0 0 90 United-States >50K
23838 32 Private 262153 HS-grad 9 Divorced Craft-repair Unmarried White Male 0 0 35 United-States <=50K
24124 41 Private 39581 Some-college 10 Divorced Adm-clerical Not-in-family Black Female 0 0 24 El-Salvador <=50K
22731 66 ? 186032 Assoc-voc 11 Widowed ? Not-in-family White Female 2964 0 30 United-States <=50K
2975 27 Private 37933 Some-college 10 Never-married Adm-clerical Unmarried Black Female 0 0 40 United-States <=50K
4457 31 State-gov 93589 HS-grad 9 Divorced Protective-serv Own-child Other Male 0 0 40 United-States <=50K
32318 34 Private 117963 Bachelors 13 Married-civ-spouse Craft-repair Husband White Male 0 0 40 United-States >50K
12685 61 Self-emp-not-inc 176965 Prof-school 15 Married-civ-spouse Prof-specialty Husband White Male 0 0 50 United-States >50K
2183 47 Private 156926 Bachelors 13 Married-civ-spouse Prof-specialty Husband White Male 0 0 40 United-States >50K.

Note: Alternatively, you can download the datasets from UCI Repository as txt files to your computer and then read them in as below.

adultData = pd.read_csv('adult.data.txt', sep = ',', names = attributeNames, header = None)
adultTest = pd.read_csv('adult.test.txt', sep = ',', names = attributeNames, skiprows = 1)

Data Cleaning and Transformation

We first confirm that the feature types match the descriptions outlined in the documentation.

In [3]:
print(f"Shape of the dataset is {data.shape} \n")
print(f"Data types are below where 'object' indicates a string type: ")
print(data.dtypes)
Shape of the dataset is (48842, 15) 

Data types are below where 'object' indicates a string type: 
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
income            object
dtype: object

Checking for Missing Values

In [4]:
print(f"\nNumber of missing values for each feature:")
print(data.isnull().sum())
Number of missing values for each feature:
age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
income            0
dtype: int64

On surface, no attribute contains any missing values, though we shall see below that the missing values are coded with a question mark. We will address this issue later.

Summary Statistics

In [5]:
from IPython.display import display, HTML
display(HTML('<b>Table 1: Summary of continuous features</b>'))
data.describe(include='int64')
Table 1: Summary of continuous features
Out[5]:
age fnlwgt education-num capital-gain capital-loss hours-per-week
count 48842.000000 4.884200e+04 48842.000000 48842.000000 48842.000000 48842.000000
mean 38.643585 1.896641e+05 10.078089 1079.067626 87.502314 40.422382
std 13.710510 1.056040e+05 2.570973 7452.019058 403.004552 12.391444
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.175505e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.781445e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.376420e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.490400e+06 16.000000 99999.000000 4356.000000 99.000000
In [6]:
display(HTML('<b>Table 2: Summary of categorical features</b>'))
data.describe(include='object')
Table 2: Summary of categorical features
Out[6]:
workclass education marital-status occupation relationship race sex native-country income
count 48842 48842 48842 48842 48842 48842 48842 48842 48842
unique 9 16 7 15 6 5 2 42 4
top Private HS-grad Married-civ-spouse Prof-specialty Husband White Male United-States <=50K
freq 33906 15784 22379 6172 19716 41762 32650 43832 24720

Table 2 shows the feature income has four categories (or a cardinality of 4). It was supposed to be 2 since income must be binary. We shall explain how to fix this cardinality issue later.

Continuous Features

As discussed earlier, the fnlwgt variable has no predictive power, so it is removed. In addition, since we have an education categorical feature, we will go ahead and remove the education-num feature as it relays the same information and therefore redundant.

In [7]:
data = data.drop(columns=['fnlwgt', 'education-num'])
In [8]:
data['age'].describe()
Out[8]:
count    48842.000000
mean        38.643585
std         13.710510
min         17.000000
25%         28.000000
50%         37.000000
75%         48.000000
max         90.000000
Name: age, dtype: float64

The range of age appears reasonable as the minimum and maximum ages are 17 and 90 respectively.

Next, we define capital = capital-gain - capital-loss and then remove the individual gain and loss variables. The summary statistic for capital is displayed below.

In [9]:
data['capital'] = data['capital-gain'] - data['capital-loss']
data = data.drop(columns=['capital-gain', 'capital-loss'])
data['capital'].describe()
Out[9]:
count    48842.000000
mean       991.565313
std       7475.549906
min      -4356.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      99999.000000
Name: capital, dtype: float64

Fixing Column Names

Some column names contain a minus sign, which can be problematic when modeling. Basically, when we write the regression formula, a minus sign will mean "exclude this variable", which is clearly not our intent here. For this reason, we modify the column names so that the minus signs are replaced with an underscore sign.

In [10]:
data.columns = [colname.replace('-', '_') for colname in list(data.columns)]
data.head()
Out[10]:
age workclass education marital_status occupation relationship race sex hours_per_week native_country income capital
0 39 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male 40 United-States <=50K 2174
1 50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 United-States <=50K 0
2 38 Private HS-grad Divorced Handlers-cleaners Not-in-family White Male 40 United-States <=50K 0
3 53 Private 11th Married-civ-spouse Handlers-cleaners Husband Black Male 40 United-States <=50K 0
4 28 Private Bachelors Married-civ-spouse Prof-specialty Wife Black Female 40 Cuba <=50K 0

Categorical Features

Let's have a look at the unique values of the categorical columns. In Pandas, string types are of data type "object", and usually these would be the categorical features.

In [11]:
categoricalColumns = data.columns[data.dtypes==object].tolist()

for col in categoricalColumns:
    print('Unique values for ' + col)
    print(data[col].unique())
    print('')
Unique values for workclass
[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked']

Unique values for education
[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th']

Unique values for marital_status
[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed']

Unique values for occupation
[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv']

Unique values for relationship
[' Not-in-family' ' Husband' ' Wife' ' Own-child' ' Unmarried'
 ' Other-relative']

Unique values for race
[' White' ' Black' ' Asian-Pac-Islander' ' Amer-Indian-Eskimo' ' Other']

Unique values for sex
[' Male' ' Female']

Unique values for native_country
[' United-States' ' Cuba' ' Jamaica' ' India' ' ?' ' Mexico' ' South'
 ' Puerto-Rico' ' Honduras' ' England' ' Canada' ' Germany' ' Iran'
 ' Philippines' ' Italy' ' Poland' ' Columbia' ' Cambodia' ' Thailand'
 ' Ecuador' ' Laos' ' Taiwan' ' Haiti' ' Portugal' ' Dominican-Republic'
 ' El-Salvador' ' France' ' Guatemala' ' China' ' Japan' ' Yugoslavia'
 ' Peru' ' Outlying-US(Guam-USVI-etc)' ' Scotland' ' Trinadad&Tobago'
 ' Greece' ' Nicaragua' ' Vietnam' ' Hong' ' Ireland' ' Hungary'
 ' Holand-Netherlands']

Unique values for income
[' <=50K' ' >50K' ' <=50K.' ' >50K.']

Some categorical attributes contain excessive white spaces, which makes life hard when filtering data. We will apply the strip() function to remove extra white spaces.

In [12]:
for col in categoricalColumns:
    data[col] = data[col].str.strip()

WARNING: The Statsmodels module does not play nice when you have a minus sign in levels of categorical variables, especially when you try to some sort of automatic variable selection. The reason is that the minus sign has a special meaning in the underlying Patsy module: it means remove this feature. So, we will replace all the minus signs in categorical variable level names with an underscore sign. In addition, a dot sign is not allowed either.

In [13]:
for col in categoricalColumns:
    data[col] = data[col].str.replace('-', '_')
data.head()
Out[13]:
age workclass education marital_status occupation relationship race sex hours_per_week native_country income capital
0 39 State_gov Bachelors Never_married Adm_clerical Not_in_family White Male 40 United_States <=50K 2174
1 50 Self_emp_not_inc Bachelors Married_civ_spouse Exec_managerial Husband White Male 13 United_States <=50K 0
2 38 Private HS_grad Divorced Handlers_cleaners Not_in_family White Male 40 United_States <=50K 0
3 53 Private 11th Married_civ_spouse Handlers_cleaners Husband Black Male 40 United_States <=50K 0
4 28 Private Bachelors Married_civ_spouse Prof_specialty Wife Black Female 40 Cuba <=50K 0

The workclass, occupation, and native-country features contain some missing values encoded as "?". These observations comprise 7.4% of the total number of observations.

In [14]:
mask = (data['workclass'] == '?') | (data['occupation'] == '?') | (data['native_country'] == '?')
mask.value_counts(normalize = True)*100
Out[14]:
False    92.588346
True      7.411654
dtype: float64

We now remove the rows with missing occupation, workclass, and native-country where missing values are encoded as "?".

In [15]:
data = data[data['workclass'] != "?"] 
data = data[data['occupation'] != "?"] 
data = data[data['native_country'] != "?"]

Since native-country is too granular and unbalanced, we group countries as "Other" and "United_States". Likewise, we also categorize races as "Other" and "White".

In [16]:
data.loc[data['native_country'] != 'United_States', 'native_country'] = 'Other'
data.loc[data['race'] != 'White', 'race'] = 'Other'

TIP 1: As a general comment, sometimes numerical features in a dataset actually represent categorical features. As an example, suppose each state of Australia is encoded as an integer between 1 and 8, say as StateID column in a dataframe called df. This would still be a categorical variable. For your code to work correctly in such cases, you need to change the data type of this column from numeric to string as follows:

df['StateID'] = df['StateID'].astype(str)

TIP 2: If you have a string column, say named col, that is actually real-valued, you can change its data type as follows:

df['col'] = df['col'].astype(float)

Dependent Variable

We need to correct the levels of the income categorical feature, which is our dependent variable, to make sure it is binary.

In [17]:
print('Before correction, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")

data['income'] = data['income'].str.rstrip(".")

print('After removing the dot, the number of unique income labels are: ')
print(data['income'].value_counts())
print("")
Before correction, the number of unique income labels are: 
<=50K     22654
<=50K.    11360
>50K       7508
>50K.      3700
Name: income, dtype: int64

After removing the dot, the number of unique income labels are: 
<=50K    34014
>50K     11208
Name: income, dtype: int64

WARNING: The Statsmodels module does not play nice when you have mathematical symbols such as "-", "+", "<" and ">" in levels of categorical variables (try it and you will get a syntax error). For this reason, we will re-code income as low and high as below using the replace() function in Pandas.

In [18]:
data['income'] = data['income'].replace({'<=50K': 'low', '>50K': 'high'})
data['income'].value_counts()
Out[18]:
low     34014
high    11208
Name: income, dtype: int64

Data Exploration

Our dataset can now be considered "clean" and ready for visualisation and statistical modeling.

Univariate Visualisation

Let's get a histogram of native country.

In [19]:
ax = data['native_country'].value_counts().plot(kind = 'bar')
ax.set_xticklabels(ax.get_xticklabels(), rotation = 90)
plt.tight_layout()
plt.title('Figure 1: Bar Chart of Native Country', fontsize = 15)
plt.show();

Let's display a boxplot and histogram for age. Figure 2 shows that this variable is right-skewed.

In [20]:
# get a box plot of age
sns.boxplot(data['age']).set_title('Figure 2: Box Plot of Age', fontsize = 15)
plt.show();
In [21]:
# get a histogram of age with kernel density estimate
sns.distplot(data['age'], kde = True).set_title('Figure 3: Histogram of Age', fontsize = 15)
plt.show();

Multivariate Visualisation

Scatterplot of Numeric Features and Age

The scatterplot in Figure 4 shows no clear correlation between the age and hours_per_week numeric variables.

In [22]:
# store the values of hours-per-week
hpw = data['hours_per_week']

# get a scatter plot
plt.scatter(hpw, data['age'], alpha = 0.3)
plt.title('Figure 4: Scatterplot of hours per week and age', fontsize = 15)
plt.xlabel('hours_per_week')
plt.ylabel('age')
plt.show();

Categorical Attributes by Age

We can see that the distribution of age between each gender is similar.

In [23]:
# Creating a boxplot
sns.boxplot(data['sex'], data['age']);
plt.title('Figure 5: Boxplot of age by gender', fontsize = 15)
plt.show();

The distribution of married women and single women differ but have a similar median as seen in Figure 6. The whiskers suggest that women above the age of 70 are more likely to be single.

In [24]:
# Storing a list of booleans corresponding to whether the person is female and a wife or Not_in_family
family_female_mask = (data['relationship'].isin(['Not_in_family','Wife'])) & (data['sex'].isin(['Female']))

# Using the list of booleans previously found to select the index of rows
family_female = data[family_female_mask]

# Creating the boxplot
sns.boxplot(family_female['relationship'], family_female['age']);
plt.title('Figure 6: Boxplot of female relationship status by age', fontsize = 15)
plt.show();

Facet plots

From Figure 7, we see that generally the distribution of male bachelors and high school graduates is somewhat comparable within each relationship level.

In [25]:
# Getting the index of those who have completed their Bachelors or HS graduate
edu_mask = data['education'].isin(['Bachelors','HS_grad'])

# Getting the index of those who are male and Not_in_family or a Husband
family_male_mask = (data['relationship'].isin(['Not_in_family','Husband'])) & (data['sex'].isin(['Male']))

# Selecting the rows of those who are Not_in_family, husband or wife and 
# have completed either a Bachelors or just graduated high school
education_relationship = data[(edu_mask & family_female_mask) | (edu_mask & family_male_mask)]

# Creating the boxplot
sns.boxplot(education_relationship['relationship'], education_relationship['age'], 
            hue = education_relationship['education'])
plt.title('Figure 7: Boxplot of age broken down by relationship and education', fontsize = 15)
plt.show();

Although there is no clear overall pattern, we observe in Figure 8 that no government worker under the age of 40 earns over \$50k a year unless they work more than 20 hours a week.

In [26]:
# Getting the index of those who work in the government
gov_mask = data['workclass'].isin(['Federal_gov','Local_gov','State_gov'])

# creating a dataframe of those who work in the government
gov = data[gov_mask]

# creating a scatterplot
sns.scatterplot(gov['hours_per_week'], gov['age'], hue = gov['income'])
plt.title('Figure 8: Scatterplot of hours per week by age coloured by income for government workers', fontsize = 15);
plt.legend(loc = 'upper right')
plt.show();