intro_to_data

# Introduction to data¶

Some define Statistics as the field that focuses on turning information into knowledge. The first step in that process is to summarize and describe the raw information - the data. In this lab, you will gain insight into public health by generating simple graphical and numerical summaries of a data set collected by the Centers for Disease Control and Prevention (CDC). As this is a large data set, along the way you'll also learn the indispensable skills of data processing and subsetting.

## Getting started¶

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

We begin by importing the dataset of 20,000 observations from the Cloud.

In [1]:
# for Mac OS users only!
# if you run into any SSL certification issues,
# you may need to run the following command for a Mac OS installation.
# \$/Applications/Python 3.x/Install Certificates.command
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
getattr(ssl, '_create_unverified_context', None)):
ssl._create_default_https_context = ssl._create_unverified_context

In [2]:
import pandas as pd

cdc

Out[2]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f
5 very good 1 1 0 64 114 114 55 f
6 very good 1 1 0 71 194 185 31 m
7 very good 0 1 0 67 170 160 45 m
8 good 0 1 1 65 150 130 27 f
9 good 1 1 0 70 180 170 44 m
10 excellent 1 1 1 69 186 175 46 m
11 fair 1 1 1 69 168 148 62 m
12 excellent 1 0 1 66 185 220 21 m
13 excellent 1 1 1 70 170 170 69 m
14 fair 1 0 0 69 170 170 23 m
15 good 1 1 1 73 185 175 79 m
16 good 0 0 1 67 156 150 47 m
17 fair 0 1 1 71 185 185 76 m
18 good 1 1 1 75 200 190 43 m
19 very good 1 1 0 67 125 120 33 f
20 very good 1 1 0 69 200 150 48 f
21 good 1 1 1 65 160 140 54 f
22 very good 0 1 1 73 160 160 43 m
23 good 1 1 1 67 165 158 30 m
24 very good 0 0 1 64 105 120 27 f
25 good 0 1 0 68 190 150 33 f
26 excellent 1 0 1 67 190 165 44 f
27 excellent 1 0 1 69 160 150 42 f
28 very good 1 1 0 61 115 105 32 f
29 excellent 1 1 1 74 185 175 63 m
... ... ... ... ... ... ... ... ... ...
19970 excellent 1 1 0 73 168 178 19 m
19971 very good 1 0 1 59 85 100 36 f
19972 excellent 1 1 1 65 145 135 42 f
19973 fair 1 1 0 67 110 120 35 f
19974 good 0 1 0 66 156 150 79 f
19975 fair 1 1 0 65 230 190 45 m
19976 good 1 1 0 67 198 140 39 f
19977 very good 0 1 1 65 180 150 87 f
19978 good 0 1 1 64 135 135 39 m
19979 good 1 1 0 69 265 200 28 m
19980 excellent 1 1 1 75 195 195 36 m
19981 very good 0 1 1 74 210 210 40 m
19982 very good 1 0 1 63 171 130 31 f
19983 very good 1 1 1 71 190 180 41 m
19984 good 1 1 1 69 180 160 57 m
19985 very good 1 1 1 64 120 115 71 f
19986 excellent 1 1 1 63 140 115 38 f
19987 good 1 1 0 72 200 195 40 m
19988 very good 1 1 1 74 230 190 53 m
19989 good 1 1 0 73 230 200 38 m
19990 excellent 1 1 0 71 195 190 43 m
19991 very good 1 1 1 72 210 175 52 m
19992 very good 1 1 0 71 180 180 36 m
19993 very good 0 1 1 63 165 120 31 f
19994 good 0 1 1 69 224 224 73 m
19995 good 1 1 0 66 215 140 23 f
19996 excellent 0 1 0 73 200 185 35 m
19997 poor 0 1 0 65 216 150 57 f
19998 good 1 1 0 67 165 165 81 f
19999 good 1 1 1 69 170 165 83 m

20000 rows × 9 columns

The data set cdc that shows up is a data matrix, with each row representing a case and each column representing a variable. These kind of data format are called data frame, which is a term that will be used throughout the labs.

To view the names of the variables, use columns.values

In [3]:
cdc.columns.values

Out[3]:
array(['genhlth', 'exerany', 'hlthplan', 'smoke100', 'height', 'weight',
'wtdesire', 'age', 'gender'], dtype=object)

This returns the names genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

#### Exercise 1

How many cases are there in this data set? How many variables? For each variable, identify its data type (e.g. categorical, discrete).

We can have a look at the first few entries (rows) of our data with the command

In [4]:
cdc.head()

Out[4]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f

and similarly we can look at the last few by typing.

In [5]:
cdc.tail()

Out[5]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
19995 good 1 1 0 66 215 140 23 f
19996 excellent 0 1 0 73 200 185 35 m
19997 poor 0 1 0 65 216 150 57 f
19998 good 1 1 0 67 165 165 81 f
19999 good 1 1 1 69 170 165 83 m

You could also look at all of the data frame at once by typing its name into the console, but that might be unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding your screen. It’s better to take small peeks at the data with head, tail or the subsetting techniques that you’ll learn in a moment.

## Summaries and tables¶

The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill all of that information into a few summary statistics and graphics. As a simple example, the function describe returns a numerical summary: count, mean, standard deviation, minimum, first quartile, median, second quartile, and maximum. For weight this is

In [6]:
cdc['weight'].describe()

Out[6]:
count    20000.00000
mean       169.68295
std         40.08097
min         68.00000
25%        140.00000
50%        165.00000
75%        190.00000
max        500.00000
Name: weight, dtype: float64

If you wanted to compute the interquartile range for the respondents’ weight, you would look at the output from the summary command above and then enter

In [7]:
190 - 140

Out[7]:
50

Python also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean, median, and variance of weight, type

In [8]:
cdc['weight'].mean()
cdc['weight'].var()
cdc['weight'].median()

Out[8]:
165.0

While it makes sense to describe a quantitative variable like weight in terms of these statistics, what about categorical data? We would instead consider the sample frequency or relative frequency distribution. The function value_counts does this for you by counting the number of times each kind of response was given. For example, to see the number of people who have smoked 100 cigarettes in their lifetime, type

In [9]:
cdc['smoke100'].value_counts()

Out[9]:
0    10559
1     9441
Name: smoke100, dtype: int64

or instead look at the relative frequency distribution by typing

In [10]:
cdc['smoke100'].value_counts(normalize = True)

Out[10]:
0    0.52795
1    0.47205
Name: smoke100, dtype: float64

Notice how Python automatically shows the relative frequency distributions by setting the parameter normalize as True.

Now let's import matplotlib library to create plots. When running Python using the command line, the graphs are typically shown in a separate window. In a Jupyter Notebook, you can simply output the graphs within the notebook itself by running the %matplotlib inline magic command.

In [11]:
import matplotlib.pyplot as plt
%matplotlib inline


You can change the format to svg for better quality figures. You can also try the retina format and see which one looks better on your computer's screen.

In [12]:
%config InlineBackend.figure_format = 'retina'


You can also change the default style of plots. Let's go for our favourite style, ggplot from R.

In [13]:
plt.style.use('ggplot')


Let's also make the size of plots bigger.

In [14]:
plt.rcParams['figure.figsize'] = (10,5)


Now we can make a bar plot of the entries in the table by putting the table inside the barplot command.

In [15]:
cdc['smoke100'].value_counts().plot(kind = 'bar')
plt.show();


Notice what we’ve done here! We created the bar plot using kind = bar. You could also break this into two steps by typing the following:

In [16]:
smoke = cdc['smoke100'].value_counts()
smoke.plot(kind = 'bar')
plt.show();


Here, we’ve made a new object, called smoke (the contents of which we can see by typing smoke into the console) and then used it in as the input for plot.

#### Exercise 2

Create a numerical summary for height and age, and compute the interquartile range for each. Compute the relative frequency distribution for gender and exerany. How many males are in the sample? What proportion of the sample reports being in excellent health?

The value_counts() with groupby command can be used to tabulate any number of variables that you provide. For example, to examine which participants have smoked across each gender, we could use the following.

In [17]:
cdc.groupby('gender')['smoke100'].value_counts().unstack() #  By doing unstack we are transforming the last level of the index to the columns.

Out[17]:
smoke100 0 1
gender
f 6012 4419
m 4547 5022

Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100 cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the following command.

In [18]:
from statsmodels.graphics.mosaicplot import mosaic
mosaic(cdc, ['gender', 'smoke100'])
plt.show();


#### Exercise 3

What does the mosaic plot reveal about smoking habits and gender?

## Interlude: How Python thinks about data¶

DataFrames are like a type of spreadsheet. Each row is a different observation (a different respondent) and each column is a different variable (the first is genhlth, the second exerany and so on). We can see the size of the DataFrame by typing

In [19]:
cdc.shape

Out[19]:
(20000, 9)

which will return the number of rows and columns. Now, if we want to access a subset of the full DataFrame, we can use row-and-column notation. For example, to see the sixth variable of the 567th respondent, use the format

In [20]:
cdc.iloc[566,5] # This is the equivalent of cdc[567,6] in R.

Out[20]:
160

which gives us the weight of the 567th person (or observation). Remember that, in Python indexing starts at 0, so the first element of a list or DataFrame is selected by the 0-th index.

To see the weights for the first 10 respondents we can type

In [21]:
cdc.iloc[0:10, 5] # Keep in mind that the ending index is excluded in Python.

Out[21]:
0    175
1    125
2    105
3    132
4    150
5    114
6    194
7    170
8    150
9    180
Name: weight, dtype: int64

Finally, if we want all of the data for the first 10 respondents, type

In [22]:
cdc.iloc[0:10,]

Out[22]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f
5 very good 1 1 0 64 114 114 55 f
6 very good 1 1 0 71 194 185 31 m
7 very good 0 1 0 67 170 160 45 m
8 good 0 1 1 65 150 130 27 f
9 good 1 1 0 70 180 170 44 m

By leaving out an index or a range (we didn’t type anything between the comma and the square bracket), we get all the columns. As a rule, we omit the column number to see all columns in a DataFrame. To access all the observations, just leave a colon inside of the bracket. Try the following to see the weights for all 20,000 respondents fly by on your screen

In [23]:
cdc.iloc[:, 5]

Out[23]:
0        175
1        125
2        105
3        132
4        150
5        114
6        194
7        170
8        150
9        180
10       186
11       168
12       185
13       170
14       170
15       185
16       156
17       185
18       200
19       125
20       200
21       160
22       160
23       165
24       105
25       190
26       190
27       160
28       115
29       185
...
19970    168
19971     85
19972    145
19973    110
19974    156
19975    230
19976    198
19977    180
19978    135
19979    265
19980    195
19981    210
19982    171
19983    190
19984    180
19985    120
19986    140
19987    200
19988    230
19989    230
19990    195
19991    210
19992    180
19993    165
19994    224
19995    215
19996    200
19997    216
19998    165
19999    170
Name: weight, Length: 20000, dtype: int64

Recall that column 6 represents respondents’ weight, so the command above reported all of the weights in the data set. An alternative method to access the weight data is by referring to the name. Previously, we typed list(cdc) to see all the variables contained in the cdc data set. We can use any of the variable names to select items in our data set.

In [24]:
cdc['weight']

Out[24]:
0        175
1        125
2        105
3        132
4        150
5        114
6        194
7        170
8        150
9        180
10       186
11       168
12       185
13       170
14       170
15       185
16       156
17       185
18       200
19       125
20       200
21       160
22       160
23       165
24       105
25       190
26       190
27       160
28       115
29       185
...
19970    168
19971     85
19972    145
19973    110
19974    156
19975    230
19976    198
19977    180
19978    135
19979    265
19980    195
19981    210
19982    171
19983    190
19984    180
19985    120
19986    140
19987    200
19988    230
19989    230
19990    195
19991    210
19992    180
19993    165
19994    224
19995    215
19996    200
19997    216
19998    165
19999    170
Name: weight, Length: 20000, dtype: int64

This tells Python to look in DataFrame cdc for the column called weight. Since that’s a single vector, we can subset it by just adding another single index inside square brackets. We see the weight for the 567th respondent by typing

In [25]:
cdc['weight'][566]

Out[25]:
160

Similarly, for just the first 10 respondents

In [26]:
cdc['weight'][0:10]

Out[26]:
0    175
1    125
2    105
3    132
4    150
5    114
6    194
7    170
8    150
9    180
Name: weight, dtype: int64

The command above returns the same result as the cdc.iloc[0:10, 5] command.

## A little more on subsetting¶

It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accomplish this through conditioning commands. First, consider expressions like

In [27]:
cdc['gender'] == 'm'

Out[27]:
0         True
1        False
2        False
3        False
4        False
5        False
6         True
7         True
8        False
9         True
10        True
11        True
12        True
13        True
14        True
15        True
16        True
17        True
18        True
19       False
20       False
21       False
22        True
23        True
24       False
25       False
26       False
27       False
28       False
29        True
...
19970     True
19971    False
19972    False
19973    False
19974    False
19975     True
19976    False
19977    False
19978     True
19979     True
19980     True
19981     True
19982    False
19983     True
19984     True
19985    False
19986    False
19987     True
19988     True
19989     True
19990     True
19991     True
19992     True
19993    False
19994     True
19995    False
19996     True
19997    False
19998    False
19999     True
Name: gender, Length: 20000, dtype: bool

or

In [28]:
cdc['age'] > 30

Out[28]:
0         True
1         True
2         True
3         True
4         True
5         True
6         True
7         True
8        False
9         True
10        True
11        True
12       False
13        True
14       False
15        True
16        True
17        True
18        True
19        True
20        True
21        True
22        True
23       False
24       False
25        True
26        True
27        True
28        True
29        True
...
19970    False
19971     True
19972     True
19973     True
19974     True
19975     True
19976     True
19977     True
19978     True
19979    False
19980     True
19981     True
19982     True
19983     True
19984     True
19985     True
19986     True
19987     True
19988     True
19989     True
19990     True
19991     True
19992     True
19993     True
19994     True
19995    False
19996     True
19997     True
19998     True
19999     True
Name: age, Length: 20000, dtype: bool

These commands produce a series of TRUE and FALSE values. There is one value for each respondent, where TRUE indicates that the person was male (via the first command) or older than 30 (second command).

Suppose we want to extract just the data for the men in the sample, or just for those over 30. For example, the command

In [29]:
mdata = cdc[cdc['gender'] == 'm']


will create a new data set called mdata that contains only the men from the cdc data set. In addition to finding it in your workspace alongside its dimensions, you can take a peek at the first several rows as usual

In [30]:
mdata.head()

Out[30]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
6 very good 1 1 0 71 194 185 31 m
7 very good 0 1 0 67 170 160 45 m
9 good 1 1 0 70 180 170 44 m
10 excellent 1 1 1 69 186 175 46 m

This new data set contains all the same variables but just under half the rows. It is also possible to tell R to keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing is that we can carve up the data based on values of one or more variables.

As an aside, you can use several of these conditions together with & and |. The & is read “and” so that

In [31]:
m_and_over30 = cdc[(cdc['gender'] == 'm') & (cdc['age'] > 30)]

Out[31]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
6 very good 1 1 0 71 194 185 31 m
7 very good 0 1 0 67 170 160 45 m
9 good 1 1 0 70 180 170 44 m
10 excellent 1 1 1 69 186 175 46 m

will give you the data for men over the age of 30. The | character is read “or” so that

In [32]:
m_and_over30 = cdc[(cdc['gender'] == 'm') | (cdc['age'] > 30)]

Out[32]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f

will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, but right now the mechanics of this are the important thing). In principle, you may use as many “and” and “or” clauses as you like when forming a subset.

#### Exercise 4

Create a new object called under23_and_smoke that contains all observations of respondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write the command you used to create the new object as the answer to this exercise.

## Quantitative data¶

With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of the BRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turn our attention to quantitative data. Two common ways to visualize quantitative data are with box plots and histograms. We can construct a box plot for a single variable with the following command.

In [33]:
cdc['height'].plot(kind='box')
plt.show();


You can compare the locations of the components of the box by examining the summary statistics.

In [34]:
cdc['height'].describe()

Out[34]:
count    20000.000000
mean        67.182900
std          4.125954
min         48.000000
25%         64.000000
50%         67.000000
75%         70.000000
max         93.000000
Name: height, dtype: float64

Confirm that the median and upper and lower quartiles reported in the numerical summary match those in the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purpose of comparing across several categories. So we can, for example, compare the heights of men and women with

In [35]:
cdc.boxplot(column = 'height', by = 'gender')
plt.show();


Instead using matplotlib.pyplot, we can use pandas' boxplot() to give us a box plots of heights where the groups are defined by gender.

Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI). BMI is a weight to height ratio and can be calculated as:

In [36]:
from IPython.display import Image
Image(url= 'https://wikimedia.org/api/rest_v1/media/math/render/svg/a25f48e7bcb8270653f7b027e6dce80f0b6fcd90')

Out[36]:

703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds).

The following two lines first make a new object called bmi and then creates box plots of these values using seaborn library, defining groups by the variable genhlth.

In [37]:
bmi = (cdc['weight'] / (cdc['height'])**2) * 703

In [38]:
import seaborn as sns
sns.boxplot(x = cdc['genhlth'], y = bmi)
plt.show();


Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc data set. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared and then multiply by 703. The result is 20,000 BMI values, one for each respondent.

#### Exercise 5

What does this box plot show? Pick another categorical variable from the data set and see how it relates to BMI. List the variable you chose, why you might think it would have a relationship to BMI, and indicate what the figure seems to suggest.

Finally, let's make some histograms. We can look at the histogram for the age of our respondents with the command

In [39]:
cdc['age'].plot(kind = 'hist', edgecolor = 'black', linewidth = 1.2)
plt.show();


Histograms are generally a very good way to see the shape of a single distribution, but that shape can change depending on how the data is split between the different bins. You can control the number of bins by adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with the bin size of 50.

In [40]:
bmi.plot(kind = 'hist', color = 'yellow', edgecolor = 'black', linewidth = 1.2)
plt.show();
bmi.plot(kind = 'hist', color = 'purple', edgecolor = 'black', linewidth = 1.2, bins = 50)
plt.show();


How do these two histograms compare?

At this point, we've done a good first pass at analyzing the information in the BRFSS questionnaire. We've found an interesting association between smoking and gender, and we can say something about the relationship between people's assessment of their general health and their own BMI. We've also picked up essential computing tools – summary statistics, subsetting, and plots – that will serve us well throughout this course.

1. Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

2. Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight). Create this new variable by subtracting the two columns in the DataFrame and assigning them to a new object called wdiff.

3. What type of data is wdiff? If an observation wdiff is 0, what does this mean about the person's weight and desired weight. What if wdiff is positive or negative?

4. Describe the distribution of wdiff in terms of its center, shape, and spread, including any plots you use. What does this tell us about how people feel about their current weight?

5. Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

6. Now it’s time to get creative. Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.
This lab was adapted by Vural Aksakalli and Imran Ture from OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

www.featureranking.com