Seaborn¶
Seaborn is a Python data visualization library based on Matplotlib
. Some tasks are a bit cumbersome to perform using just Matplotlib
, such as automatically plotting one numeric variable vs. another numeric variable for each level of a third (categorical) variable. Using Matplotlib
functions behind the scenes, Seaborn
provides an easy-to-use syntax to perform such tasks. An additional benefit of Seaborn
is that it works natively with Pandas
dataframes, which is not the case for Matplotlib
. This tutorial provides a brief overview of Seaborn
for basic data visualization.
Table of Contents¶
The BRFSS Dataset¶
The dataset used in this tutorial is collected by the Centers for Disease Control and Prevention (CDC). As explained here, the Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS website contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data. We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.
The variables in the dataset are genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.
We begin by importing necessary modules and setting up Matplotlib
environment followed by importing the dataset from the Cloud.
import random
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# import the pyplot module from the the matplotlib package
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")
import io
import requests
# so that we can see all the columns
pd.set_option('display.max_columns', None)
# how to read a csv file from a github account
url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/cdc.csv'
url_content = requests.get(url_name, verify=False).content
cdc = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
cdc.describe(include='all')
genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender | |
---|---|---|---|---|---|---|---|---|---|
count | 20000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.000000 | 20000.00000 | 20000.000000 | 20000.000000 | 20000 |
unique | 5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 |
top | very good | NaN | NaN | NaN | NaN | NaN | NaN | NaN | f |
freq | 6972 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 10431 |
mean | NaN | 0.745700 | 0.873800 | 0.472050 | 67.182900 | 169.68295 | 155.093850 | 45.068250 | NaN |
std | NaN | 0.435478 | 0.332083 | 0.499231 | 4.125954 | 40.08097 | 32.013306 | 17.192689 | NaN |
min | NaN | 0.000000 | 0.000000 | 0.000000 | 48.000000 | 68.00000 | 68.000000 | 18.000000 | NaN |
25% | NaN | 0.000000 | 1.000000 | 0.000000 | 64.000000 | 140.00000 | 130.000000 | 31.000000 | NaN |
50% | NaN | 1.000000 | 1.000000 | 0.000000 | 67.000000 | 165.00000 | 150.000000 | 43.000000 | NaN |
75% | NaN | 1.000000 | 1.000000 | 1.000000 | 70.000000 | 190.00000 | 175.000000 | 57.000000 | NaN |
max | NaN | 1.000000 | 1.000000 | 1.000000 | 93.000000 | 500.00000 | 680.000000 | 99.000000 | NaN |
cdc.sample(10, random_state=999)
genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender | |
---|---|---|---|---|---|---|---|---|---|
6743 | very good | 1 | 1 | 0 | 71 | 170 | 170 | 23 | m |
19360 | very good | 1 | 0 | 1 | 64 | 120 | 117 | 45 | f |
8104 | good | 1 | 1 | 1 | 70 | 192 | 170 | 64 | m |
8535 | excellent | 1 | 1 | 1 | 64 | 165 | 140 | 67 | f |
8275 | very good | 1 | 1 | 0 | 69 | 130 | 140 | 69 | m |
3511 | very good | 0 | 1 | 0 | 63 | 128 | 128 | 37 | f |
1521 | good | 1 | 1 | 0 | 68 | 176 | 135 | 37 | f |
976 | fair | 0 | 1 | 1 | 64 | 150 | 125 | 43 | f |
14484 | good | 1 | 1 | 1 | 68 | 185 | 185 | 78 | m |
3591 | fair | 1 | 1 | 0 | 71 | 165 | 175 | 34 | m |
cdc.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 genhlth 20000 non-null object 1 exerany 20000 non-null int64 2 hlthplan 20000 non-null int64 3 smoke100 20000 non-null int64 4 height 20000 non-null int64 5 weight 20000 non-null int64 6 wtdesire 20000 non-null int64 7 age 20000 non-null int64 8 gender 20000 non-null object dtypes: int64(7), object(2) memory usage: 1.4+ MB
Let's get a value count for each one of the categorical variables.
categorical_cols = cdc.columns[cdc.dtypes==object].tolist()
for col in categorical_cols:
print(f"Column Name: {col}")
print(cdc[col].value_counts())
print("\n")
Column Name: genhlth very good 6972 good 5675 excellent 4657 fair 2019 poor 677 Name: genhlth, dtype: int64 Column Name: gender f 10431 m 9569 Name: gender, dtype: int64
Scatter Plots¶
Let's first plot height vs. weight in the dataset with both univariate and bivariate plots.
sns.jointplot(x="height",
y="weight",
data=cdc);
Let's again plot height vs. weight, but this time, let's break this by gender. For this task, we use the relplot()
, which is a workhorse function in Seaborn
. The default option with this function is "scatter'. You can change this by setting the kind option to "line" in order to get a line plot. Notice the height
option below to make the plot slightly bigger. The hue
option specifies how to color the data points.
sns.relplot(x="height",
y="weight",
hue="gender",
height=6,
data=cdc);
We are curious if the above plot would be different for smokers vs. nonsmokers as well as respondents who exercised in the past month vs. who did not. For this, we use the col
and row
options.
This kind of plot is called a facetted plot.
sns.relplot(x="height",
y="weight",
hue="gender",
col='smoke100',
row='exerany',
data=cdc);