Seaborn

seaborn

Seaborn

Seaborn is a Python data visualization library based on Matplotlib. Some tasks are a bit cumbersome to perform using just Matplotlib, such as automatically plotting one numeric variable vs. another numeric variable for each level of a third (categorical) variable. Using Matplotlib functions behind the scenes, Seaborn provides an easy-to-use syntax to perform such tasks. An additional benefit of Seaborn is that it works natively with Pandas dataframes, which is not the case for Matplotlib. This tutorial provides a brief overview of Seaborn for basic data visualization.

The BRFSS Dataset

The dataset used in this tutorial is collected by the Centers for Disease Control and Prevention (CDC). As explained here, the Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS website contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data. We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset.

The variables in the dataset are genhlth, exerany, hlthplan, smoke100, height, weight, wtdesire, age, and gender. Each one of these variables corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender.

We begin by importing necessary modules and setting up Matplotlib environment followed by importing the dataset from the Cloud.

In [1]:
import random
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

# import the pyplot module from the the matplotlib package
import matplotlib.pyplot as plt
%matplotlib inline 
%config InlineBackend.figure_format = 'retina'
plt.style.use("seaborn")

import seaborn as sns
sns.set()
In [2]:
import warnings
warnings.filterwarnings("ignore")

import io
import requests

# so that we can see all the columns
pd.set_option('display.max_columns', None) 

# how to read a csv file from a github account
url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/cdc.csv'
url_content = requests.get(url_name, verify=False).content
cdc = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
In [3]:
cdc.describe(include='all')
Out[3]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
count 20000 20000.000000 20000.000000 20000.000000 20000.000000 20000.00000 20000.000000 20000.000000 20000
unique 5 NaN NaN NaN NaN NaN NaN NaN 2
top very good NaN NaN NaN NaN NaN NaN NaN f
freq 6972 NaN NaN NaN NaN NaN NaN NaN 10431
mean NaN 0.745700 0.873800 0.472050 67.182900 169.68295 155.093850 45.068250 NaN
std NaN 0.435478 0.332083 0.499231 4.125954 40.08097 32.013306 17.192689 NaN
min NaN 0.000000 0.000000 0.000000 48.000000 68.00000 68.000000 18.000000 NaN
25% NaN 0.000000 1.000000 0.000000 64.000000 140.00000 130.000000 31.000000 NaN
50% NaN 1.000000 1.000000 0.000000 67.000000 165.00000 150.000000 43.000000 NaN
75% NaN 1.000000 1.000000 1.000000 70.000000 190.00000 175.000000 57.000000 NaN
max NaN 1.000000 1.000000 1.000000 93.000000 500.00000 680.000000 99.000000 NaN
In [4]:
cdc.sample(10, random_state=999)
Out[4]:
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
6743 very good 1 1 0 71 170 170 23 m
19360 very good 1 0 1 64 120 117 45 f
8104 good 1 1 1 70 192 170 64 m
8535 excellent 1 1 1 64 165 140 67 f
8275 very good 1 1 0 69 130 140 69 m
3511 very good 0 1 0 63 128 128 37 f
1521 good 1 1 0 68 176 135 37 f
976 fair 0 1 1 64 150 125 43 f
14484 good 1 1 1 68 185 185 78 m
3591 fair 1 1 0 71 165 175 34 m
In [5]:
cdc.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   genhlth   20000 non-null  object
 1   exerany   20000 non-null  int64 
 2   hlthplan  20000 non-null  int64 
 3   smoke100  20000 non-null  int64 
 4   height    20000 non-null  int64 
 5   weight    20000 non-null  int64 
 6   wtdesire  20000 non-null  int64 
 7   age       20000 non-null  int64 
 8   gender    20000 non-null  object
dtypes: int64(7), object(2)
memory usage: 1.4+ MB

Let's get a value count for each one of the categorical variables.

In [6]:
categorical_cols = cdc.columns[cdc.dtypes==object].tolist()
for col in categorical_cols:
    print(f"Column Name: {col}")
    print(cdc[col].value_counts())
    print("\n")
Column Name: genhlth
very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64


Column Name: gender
f    10431
m     9569
Name: gender, dtype: int64


Scatter Plots

Let's first plot height vs. weight in the dataset with both univariate and bivariate plots.

In [7]:
sns.jointplot(x="height", 
              y="weight", 
              data=cdc);

Let's again plot height vs. weight, but this time, let's break this by gender. For this task, we use the relplot(), which is a workhorse function in Seaborn. The default option with this function is "scatter'. You can change this by setting the kind option to "line" in order to get a line plot. Notice the height option below to make the plot slightly bigger. The hue option specifies how to color the data points.

In [8]:
sns.relplot(x="height", 
            y="weight", 
            hue="gender", 
            height=6, 
            data=cdc);

We are curious if the above plot would be different for smokers vs. nonsmokers as well as respondents who exercised in the past month vs. who did not. For this, we use the col and row options.

This kind of plot is called a facetted plot.

In [9]:
sns.relplot(x="height", 
            y="weight", 
            hue="gender", 
            col='smoke100',
            row='exerany',
            data=cdc);