Pandas (Extended)¶
The
Pandas
module is Python's fundamental data analytics library and it provides high-performance, easy-to-use data structures and tools for data analysis. Many Excel features are available in Pandas, including creating pivot tables, computing columns based on other columns, etc. Pandas also facilitates grouping rows by column values and joining tables as in SQL. Pandas is a very comprehensive and mature module that can be used for advanced data analytics, so this tutorial presents a basic overview of Pandas' capabilities.
Table of Contents¶
- Series
- DataFrame
- Creating a DataFrame
- Selecting and filtering in DataFrame
- Transposing
- Adding and removing columns
- Assigning new columns
- Evaluating an expression
- Querying a DataFrame
- Sorting a DataFrame
- Operations on DataFrame
- Automatic alignment for DataFrames
- Plotting a DataFrame
- Handling missing data
- Aggregating with groupby
- Pivot tables
- Overview functions
- Combining DataFrames
- Categories
- Saving and loading
- Exercises
Let's import
Pandas
with the usual convention as pd
.
In [1]:
# let's suppress warnings, as they can get annoying sometimes
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
Series¶
A
Series
is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
Creating a Series¶
The simplestSeries
is formed from only an array of data.
In [2]:
obj = pd.Series([2,-1,3,5])
obj
Out[2]:
0 2 1 -1 2 3 3 5 dtype: int64
The string representation of a
Series
displayed interactively shows the index on the
left and the values on the right.
You can get the array representation and index object of the Series
via
its values and index attributes, respectively.
In [3]:
obj.values
Out[3]:
array([ 2, -1, 3, 5], dtype=int64)
In [4]:
obj.index # Like range(4)
Out[4]:
RangeIndex(start=0, stop=4, step=1)
In [5]:
list(obj.index) # if you want the index as a list of values
Out[5]:
[0, 1, 2, 3]
We may want to create a
Series
with an index identifying each data point with a label.
In [6]:
obj2 = pd.Series([2,-1,3,5], index=['a', 'b', 'c', 'd'])
obj2
Out[6]:
a 2 b -1 c 3 d 5 dtype: int64
In [7]:
list(obj2.index)
Out[7]:
['a', 'b', 'c', 'd']
A
Series
can have a name
.
In [8]:
obj_withName = pd.Series([83, 68], index=["bob", "alice"], name="weights")
obj_withName
Out[8]:
bob 83 alice 68 Name: weights, dtype: int64
You can also create a
Series
object from a dict
. The keys will be used as index labels.
In [9]:
weightdata = {"john": 86, "michael": 68, "alice": 68, "bob": 83}
obj3 = pd.Series(weightdata)
obj3
Out[9]:
john 86 michael 68 alice 68 bob 83 dtype: int64
Selecting and filtering in Series¶
Compared with
NumPy
arrays, you can use labels in the index when selecting single values or a set of values.
In [10]:
obj2
Out[10]:
a 2 b -1 c 3 d 5 dtype: int64
In [11]:
obj2['a']
Out[11]:
2
In [12]:
obj2[['b','c','d']]
# Here ['b', 'c', 'd'] is interpreted as a list of indices, even though it contains strings instead of integers.
Out[12]:
b -1 c 3 d 5 dtype: int64
You can still access the items by integer location, like in a regular array. By default, the rank of the item in the
Series
starts at 0.
In [13]:
obj2[0]
Out[13]:
2
In [14]:
obj2[[1,3]]
Out[14]:
b -1 d 5 dtype: int64
In [15]:
obj2[obj2 < 0]
Out[15]:
b -1 dtype: int64
Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive.
In [16]:
obj2['b':'c']
Out[16]:
b -1 c 3 dtype: int64
To make it clear when you are accessing by label or by integer location, it is recommended to always use the
loc
attribute when accessing by label, and the iloc
attribute when accessing by integer location.
In [17]:
obj2.loc["a"]
Out[17]:
2
In [18]:
obj2.iloc[0]
Out[18]:
2
Slicing a
Series
also slices the index labels.
In [19]:
obj2.iloc[1:3]
Out[19]:
b -1 c 3 dtype: int64
This can lead to unexpected results when using the default numeric labels, so be careful:
In [20]:
surprise = pd.Series([1000, 1001, 1002, 1003])
surprise
Out[20]:
0 1000 1 1001 2 1002 3 1003 dtype: int64
In [21]:
surprise_slice = surprise[2:]
surprise_slice
Out[21]:
2 1002 3 1003 dtype: int64
Oh look! The first element has index label 2. The element with index label 0 is absent from the slice.
But remember that you can access elements by integer location using the iloc attribute. This illustrates another reason why it's always better to use loc and iloc to access Series objects.
In [22]:
surprise_slice.iloc[0]
Out[22]:
1002
Operations on Series¶
Series
objects behave much like one-dimensional ndarray
s, and you can often pass them as parameters to NumPy
functions.
In [23]:
import numpy as np
np.exp(obj2)
Out[23]:
a 7.389056 b 0.367879 c 20.085537 d 148.413159 dtype: float64
Arithmetic operations on
Series
are also possible, and they apply elementwise, just like for ndarray
s.
In [24]:
obj2 + [1000,2000,3000,4000]
Out[24]:
a 1002 b 1999 c 3003 d 4005 dtype: int64
Similar to
NumPy
, if you add a single number to a Series
, that number is added to all items in the Series
. This is called broadcasting.
In [25]:
obj2 + 1000
Out[25]:
a 1002 b 999 c 1003 d 1005 dtype: int64
The same is true for all binary operations and even conditional operations.
Plotting a Series¶
pandas
makes it easy to plot Series
data using matplotlib
. Just import matplotlib
and call the plot()
method:
In [26]:
%matplotlib inline
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
plt.style.use("ggplot")
###
temperatures = [4.4,5.1,6.1,6.2,6.1,6.1,5.7,5.2,4.7,4.1,3.9,3.5]
s7 = pd.Series(temperatures, name="Temperature")
s7.plot()
plt.show();
DataFrame¶
A
DataFrame
represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). You can see DataFrame
s as dict
of Series
.
Creating a DataFrame¶
There are many ways to construct aDataFrame
, though one of the most common is from a dict
of equal-length lists or NumPy
arrays.
In [27]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
df = pd.DataFrame(data)
df
Out[27]:
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
5 | Nevada | 2003 | 3.2 |
For large
DataFrame
s, the head
method selects only the first five rows.
In [28]:
df.head()
Out[28]:
state | year | pop | |
---|---|---|---|
0 | Ohio | 2000 | 1.5 |
1 | Ohio | 2001 | 1.7 |
2 | Ohio | 2002 | 3.6 |
3 | Nevada | 2001 | 2.4 |
4 | Nevada | 2002 | 2.9 |
You can also create a
DataFrame
by passing a dict
of Series
objects.
In [29]:
people_dict = {
"weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
"birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
"children": pd.Series([0, 3], index=["charles", "bob"]),
"hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(people_dict)
people
Out[29]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
alice | 68 | 1985 | NaN | Biking |
bob | 83 | 1984 | 3.0 | Dancing |
charles | 112 | 1992 | 0.0 | NaN |
A few things to note:
- The
Series
were automatically aligned based on their index. - Missing values are represented as
NaN
. Series
names are ignored (the name"year"
was dropped).
If you pass a list of columns and/or index row labels to the
DataFrame
constructor, it will guarantee that these columns and/or rows will exist, in that order, and no other column/row will exist.
In [30]:
df2 = pd.DataFrame(
people_dict,
columns=["birthyear", "weight", "height"],
index=["bob", "alice", "eugene"]
)
df2
Out[30]:
birthyear | weight | height | |
---|---|---|---|
bob | 1984.0 | 83.0 | NaN |
alice | 1985.0 | 68.0 | NaN |
eugene | NaN | NaN | NaN |
Another convenient way to create a
DataFrame
is to pass all the values to the constructor as an ndarray
, or a list of lists, and specify the column names and row index labels separately:
In [31]:
values = [
[1985, np.nan, "Biking", 68],
[1984, 3, "Dancing", 83],
[1992, 0, np.nan, 112]
]
df3 = pd.DataFrame(
values,
columns=["birthyear", "children", "hobby", "weight"],
index=["alice", "bob", "charles"]
)
df3
Out[31]:
birthyear | children | hobby | weight | |
---|---|---|---|---|
alice | 1985 | NaN | Biking | 68 |
bob | 1984 | 3.0 | Dancing | 83 |
charles | 1992 | 0.0 | NaN | 112 |
You can access columns pretty much as you would expect. They are returned as
Series
objects.
In [32]:
people["birthyear"]
Out[32]:
alice 1985 bob 1984 charles 1992 Name: birthyear, dtype: int64
You can also get multiple columns at once.
In [33]:
people[["birthyear", "hobby"]]
Out[33]:
birthyear | hobby | |
---|---|---|
alice | 1985 | Biking |
bob | 1984 | Dancing |
charles | 1992 | NaN |
To specify missing values, you can either use
np.nan
or NumPy
's masked arrays:
In [34]:
masked_array = np.ma.asarray(values, dtype=np.object)
masked_array[(0, 2), (1, 2)] = np.ma.masked
df3 = pd.DataFrame(
masked_array,
columns=["birthyear", "children", "hobby", "weight"],
index=["alice", "bob", "charles"]
)
df3
Out[34]:
birthyear | children | hobby | weight | |
---|---|---|---|---|
alice | 1985 | NaN | Biking | 68 |
bob | 1984 | 3 | Dancing | 83 |
charles | 1992 | 0 | NaN | 112 |
Instead of an
ndarray
, you can also pass a DataFrame
object:
In [35]:
df4 = pd.DataFrame(
df3,
columns=["hobby", "children"],
index=["alice", "bob"]
)
df4
Out[35]:
hobby | children | |
---|---|---|
alice | Biking | NaN |
bob | Dancing | 3 |
Selecting and filtering in DataFrame¶
Let's go back to thepeople
:
In [36]:
people
Out[36]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
alice | 68 | 1985 | NaN | Biking |
bob | 83 | 1984 | 3.0 | Dancing |
charles | 112 | 1992 | 0.0 | NaN |
The
loc
attribute lets you access rows instead of columns. The result is a Series
object in which the DataFrame
's column names are mapped to row index labels.
In [37]:
people.loc["charles"]
Out[37]:
weight 112 birthyear 1992 children 0 hobby NaN Name: charles, dtype: object
You can also access rows by integer location using the
iloc
attribute.
In [38]:
people.iloc[2]
Out[38]:
weight 112 birthyear 1992 children 0 hobby NaN Name: charles, dtype: object
You can also get a slice of rows, and this returns a
DataFrame
object.
In [39]:
people.iloc[1:3]
Out[39]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
bob | 83 | 1984 | 3.0 | Dancing |
charles | 112 | 1992 | 0.0 | NaN |
Finally, you can pass a boolean array to get the matching rows.
In [40]:
people[np.array([True, False, True])]
Out[40]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
alice | 68 | 1985 | NaN | Biking |
charles | 112 | 1992 | 0.0 | NaN |
This is most useful when combined with boolean expressions.
In [41]:
people[people["birthyear"] < 1990]
Out[41]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
alice | 68 | 1985 | NaN | Biking |
bob | 83 | 1984 | 3.0 | Dancing |
Transposing¶
You can swap columns and indices using theT
attribute.
In [42]:
people.T
Out[42]:
alice | bob | charles | |
---|---|---|---|
weight | 68 | 83 | 112 |
birthyear | 1985 | 1984 | 1992 |
children | NaN | 3 | 0 |
hobby | Biking | Dancing | NaN |
Adding and removing columns¶
You can generally treatDataFrame
objects like dictionaries of Series
, so the following work fine.
In [43]:
people
Out[43]:
weight | birthyear | children | hobby | |
---|---|---|---|---|
alice | 68 | 1985 | NaN | Biking |
bob | 83 | 1984 | 3.0 | Dancing |
charles | 112 | 1992 | 0.0 | NaN |
In [44]:
people["age"] = 2018 - people["birthyear"] # adds a new column "age"
people["over 30"] = people["age"] > 30 # adds another column "over 30"
birthyears = people.pop("birthyear")
del people["children"]
people
Out[44]:
weight | hobby | age | over 30 | |
---|---|---|---|---|
alice | 68 | Biking | 33 | True |
bob | 83 | Dancing | 34 | True |
charles | 112 | NaN | 26 | False |
In [45]:
birthyears
Out[45]:
alice 1985 bob 1984 charles 1992 Name: birthyear, dtype: int64
When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored.
In [46]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1}) # alice is missing, eugene is ignored
people
Out[46]:
weight | hobby | age | over 30 | pets | |
---|---|---|---|---|---|
alice | 68 | Biking | 33 | True | NaN |
bob | 83 | Dancing | 34 | True | 0.0 |
charles | 112 | NaN | 26 | False | 5.0 |
When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the
insert()
method.
In [47]:
people.insert(1, "height", [172, 181, 185])
people
Out[47]:
weight | height | hobby | age | over 30 | pets | |
---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN |
bob | 83 | 181 | Dancing | 34 | True | 0.0 |
charles | 112 | 185 | NaN | 26 | False | 5.0 |
Assigning new columns¶
You can also create new columns by calling theassign()
method. Note that this returns a new DataFrame
object, the original is not modified.
In [48]:
people.assign(
body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
has_pets = people["pets"] > 0
)
Out[48]:
weight | height | hobby | age | over 30 | pets | body_mass_index | has_pets | |
---|---|---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 | False |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | False |
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 | True |
Note that you cannot access columns created within the same assignment.
In [49]:
try:
people.assign(
body_mass_index = people["weight"] / (people["height"] / 100) ** 2,
overweight = people["body_mass_index"] > 25
)
except KeyError as e:
print("Key error:", e)
Key error: 'body_mass_index'
The solution is to split this assignment in two consecutive assignments.
In [50]:
df5 = people.assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
df5.assign(overweight = df5["body_mass_index"] > 25)
Out[50]:
weight | height | hobby | age | over 30 | pets | body_mass_index | overweight | |
---|---|---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 | False |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | True |
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 | True |
Having to create a temporary variable df5 is not very convenient. You may want to just chain the assigment calls, but it does not work because the
people
object is not actually modified by the first assignment.
In [51]:
try:
(people
.assign(body_mass_index = people["weight"] / (people["height"] / 100) ** 2)
.assign(overweight = people["body_mass_index"] > 25)
)
except KeyError as e:
print("Key error:", e)
Key error: 'body_mass_index'
But fear not, there is a simple solution. You can pass a function to the
assign()
method (typically a lambda
function), and this function will be called with the DataFrame
as a parameter.
In [52]:
(people
.assign(body_mass_index = lambda df: df["weight"] / (df["height"] / 100) ** 2)
.assign(overweight = lambda df: df["body_mass_index"] > 25)
)
Out[52]:
weight | height | hobby | age | over 30 | pets | body_mass_index | overweight | |
---|---|---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 | False |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | True |
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 | True |
Problem solved!
Evaluating an expression¶
A great feature supported bypandas
is expression evaluation. This relies on the numexpr
library which must be installed.
In [53]:
people.eval("weight / (height/100) ** 2 > 25")
Out[53]:
alice False bob True charles True dtype: bool
Assignment expressions are also supported. Let's set
inplace=True
to directly modify the DataFrame
rather than getting a modified copy:
In [54]:
people.eval("body_mass_index = weight / (height/100) ** 2", inplace=True)
people
Out[54]:
weight | height | hobby | age | over 30 | pets | body_mass_index | |
---|---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 |
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 |
You can use a local or global variable in an expression by prefixing it with
'@'
.
In [55]:
overweight_threshold = 30
people.eval("overweight = body_mass_index > @overweight_threshold", inplace=True)
people
Out[55]:
weight | height | hobby | age | over 30 | pets | body_mass_index | overweight | |
---|---|---|---|---|---|---|---|---|
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 | False |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | False |
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 | True |
In [56]:
people.query("age > 30 and pets == 0")
Out[56]:
weight | height | hobby | age | over 30 | pets | body_mass_index | overweight | |
---|---|---|---|---|---|---|---|---|
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | False |
Sorting a DataFrame¶
You can sort aDataFrame
by calling its sort_index
method. By default it sorts the rows by their index label, in ascending order, but let's reverse the order.
In [57]:
people.sort_index(ascending=False)
Out[57]:
weight | height | hobby | age | over 30 | pets | body_mass_index | overweight | |
---|---|---|---|---|---|---|---|---|
charles | 112 | 185 | NaN | 26 | False | 5.0 | 32.724617 | True |
bob | 83 | 181 | Dancing | 34 | True | 0.0 | 25.335002 | False |
alice | 68 | 172 | Biking | 33 | True | NaN | 22.985398 | False |
Note that
sort_index
returned a sorted copy of the DataFrame
. To modify people
directly, we can set the inplace
argument to True
. Also, we can sort the columns instead of the rows by setting axis=1
.
In [58]:
people.sort_index(axis=1, inplace=True)
people
Out[58]:
age | body_mass_index | height | hobby | over 30 | overweight | pets | weight | |
---|---|---|---|---|---|---|---|---|
alice | 33 | 22.985398 | 172 | Biking | True | False | NaN | 68 |
bob | 34 | 25.335002 | 181 | Dancing | True | False | 0.0 | 83 |
charles | 26 | 32.724617 | 185 | NaN | False | True | 5.0 | 112 |
To sort the
DataFrame
by the values instead of the labels, we can use sort_values
and specify the column to sort by.
In [59]:
people.sort_values(by="age", inplace=True)
people
Out[59]:
age | body_mass_index | height | hobby | over 30 | overweight | pets | weight | |
---|---|---|---|---|---|---|---|---|
charles | 26 | 32.724617 | 185 | NaN | False | True | 5.0 | 112 |
alice | 33 | 22.985398 | 172 | Biking | True | False | NaN | 68 |
bob | 34 | 25.335002 | 181 | Dancing | True | False | 0.0 | 83 |
Operations on DataFrame¶
AlthoughDataFrame
s do not try to mimick NumPy
arrays, there are a few similarities. Let's create a DataFrame
to demonstrate this:
In [60]:
grades_array = np.array([[8,8,9],[10,9,9],[4, 8, 2], [9, 10, 10]])
grades = pd.DataFrame(grades_array, columns=["sep", "oct", "nov"], index=["alice","bob","charles","darwin"])
grades
Out[60]:
sep | oct | nov | |
---|---|---|---|
alice | 8 | 8 | 9 |
bob | 10 | 9 | 9 |
charles | 4 | 8 | 2 |
darwin | 9 | 10 | 10 |
You can apply
NumPy
mathematical functions on a DataFrame
. The function is applied to all values.
In [61]:
np.sqrt(grades)
Out[61]:
sep | oct | nov | |
---|---|---|---|
alice | 2.828427 | 2.828427 | 3.000000 |
bob | 3.162278 | 3.000000 | 3.000000 |
charles | 2.000000 | 2.828427 | 1.414214 |
darwin | 3.000000 | 3.162278 | 3.162278 |
Similarly, adding a single value to a
DataFrame
will add that value to all elements in the DataFrame
. This is called broadcasting.
In [62]:
grades + 1
Out[62]:
sep | oct | nov | |
---|---|---|---|
alice | 9 | 9 | 10 |
bob | 11 | 10 | 10 |
charles | 5 | 9 | 3 |
darwin | 10 | 11 | 11 |
Of course, the same is true for all other binary operations, including arithmetic (
*
,/
,**
...) and conditional (>
, ==
...) operations.
In [63]:
grades >= 5
Out[63]:
sep | oct | nov | |
---|---|---|---|
alice | True | True | True |
bob | True | True | True |
charles | False | True | False |
darwin | True | True | True |
Aggregation operations, such as computing the
max
, the sum
or the mean
of a DataFrame
, apply to each column, and you get back a Series
object.
In [64]:
grades.mean()
Out[64]:
sep 7.75 oct 8.75 nov 7.50 dtype: float64
The
all
method is also an aggregation operation: it checks whether all values are True
or not. Let's see during which months all students got a grade greater than 5
.
In [65]:
(grades > 5).all()
Out[65]:
sep False oct True nov False dtype: bool
Most of these functions take an optional
axis
parameter which lets you specify along which axis of the DataFrame
you want the operation executed. The default is axis=0
, meaning that the operation is executed vertically (on each column). You can set axis=1
to execute the operation horizontally (on each row). For example, let's find out which students had all grades greater than 5
:
In [66]:
(grades > 5).all(axis = 1)
Out[66]:
alice True bob True charles False darwin True dtype: bool
The
any
method returns True
if any value is True. Let's see who got at least one grade 10:
In [67]:
(grades == 10).any(axis = 1)
Out[67]:
alice False bob True charles False darwin True dtype: bool
If you add a
Series
object to a DataFrame
(or execute any other binary operation), pandas
attempts to broadcast the operation to all rows in the DataFrame
. This only works if the Series
has the same size as the DataFrame
s rows. For example, let's substract the mean
of the DataFrame
(a Series
object) from the DataFrame
:
In [68]:
grades - grades.mean() # equivalent to: grades - [7.75, 8.75, 7.50]
Out[68]:
sep | oct | nov | |
---|---|---|---|
alice | 0.25 | -0.75 | 1.5 |
bob | 2.25 | 0.25 | 1.5 |
charles | -3.75 | -0.75 | -5.5 |
darwin | 1.25 | 1.25 | 2.5 |
We substracted
7.75
from all September grades, 8.75
from October grades and 7.50
from November grades. It is equivalent to substracting this DataFrame
.
In [69]:
pd.DataFrame([[7.75, 8.75, 7.50]]*4, index=grades.index, columns=grades.columns)
Out[69]:
sep | oct | nov | |
---|---|---|---|
alice | 7.75 | 8.75 | 7.5 |
bob | 7.75 | 8.75 | 7.5 |
charles | 7.75 | 8.75 | 7.5 |
darwin | 7.75 | 8.75 | 7.5 |
If you want to substract the global mean from every grade, here is one way to do it:
In [70]:
grades - grades.values.mean() # substracts the global mean (8.00) from all grades
Out[70]:
sep | oct | nov | |
---|---|---|---|
alice | 0.0 | 0.0 | 1.0 |
bob | 2.0 | 1.0 | 1.0 |
charles | -4.0 | 0.0 | -6.0 |
darwin | 1.0 | 2.0 | 2.0 |
Automatic alignment for DataFrames¶
Similar toSeries
, when operating on multiple DataFrame
s, pandas
automatically aligns them by row index label, but also by column names. Let's create a DataFrame
with bonus points for each person from October to December:
In [71]:
bonus_array = np.array([[0,np.nan,2],[np.nan,1,0],[0, 1, 0], [3, 3, 0]])
bonus_points = pd.DataFrame(bonus_array, columns=["oct", "nov", "dec"], index=["bob","colin", "darwin", "charles"])
bonus_points
Out[71]:
oct | nov | dec | |
---|---|---|---|
bob | 0.0 | NaN | 2.0 |
colin | NaN | 1.0 | 0.0 |
darwin | 0.0 | 1.0 | 0.0 |
charles | 3.0 | 3.0 | 0.0 |
In [72]:
grades + bonus_points
Out[72]:
dec | nov | oct | sep | |
---|---|---|---|---|
alice | NaN | NaN | NaN | NaN |
bob | NaN | NaN | 9.0 | NaN |
charles | NaN | 5.0 | 11.0 | NaN |
colin | NaN | NaN | NaN | NaN |
darwin | NaN | 11.0 | 10.0 | NaN |
Looks like the addition worked in some cases but way too many elements are now empty. That's because when aligning the DataFrames, some columns and rows were only present on one side, and thus they were considered missing on the other side (NaN). Then adding NaN to a number results in NaN, hence the result.
Plotting a DataFrame¶
Just like forSeries
, pandas
makes it easy to draw nice graphs based on a DataFrame
.
For example, it is easy to create a bar plot from a DataFrame
's data by calling its plot
method.
In [73]:
people.plot(kind = "bar", y = ["body_mass_index"])
plt.show();
You can pass extra arguments supported by matplotlib's functions. For example, we can create scatterplot and pass it a list of sizes using the
s
argument of matplotlib's scatter()
function.
In [74]:
people.plot(kind = "scatter", x = "height", y = "weight", s=[40, 120, 200])
plt.show();
Again, there are way too many options to list here: the best option is to scroll through the Visualization page in
pandas
documentation and find the plot you are interested in.
Handling missing data¶
Dealing with missing data is a frequent task when working with real life data.pandas
offers a few tools to handle missing data.
The
isnull
and notnull
functions in pandas
can be used to detect missing data.
In [75]:
pd.isnull(grades)
Out[75]:
sep | oct | nov | |
---|---|---|---|
alice | False | False | False |
bob | False | False | False |
charles | False | False | False |
darwin | False | False | False |
In [76]:
pd.notnull(grades)
Out[76]:
sep | oct | nov | |
---|---|---|---|
alice | True | True | True |
bob | True | True | True |
charles | True | True | True |
darwin | True | True | True |
It's a bit unfair that we're setting grades to zero in September, though. Perhaps we should decide that missing grades are missing grades, but missing bonus points should be replaced by zeros.
In [77]:
grades
Out[77]:
sep | oct | nov | |
---|---|---|---|
alice | 8 | 8 | 9 |
bob | 10 | 9 | 9 |
charles | 4 | 8 | 2 |
darwin | 9 | 10 | 10 |
In [78]:
fixed_bonus_points = bonus_points.fillna(0)
fixed_bonus_points.insert(0, "sep", 0)
fixed_bonus_points.loc["alice"] = 0
grades + fixed_bonus_points
Out[78]:
dec | nov | oct | sep | |
---|---|---|---|---|
alice | NaN | 9.0 | 8.0 | 8.0 |
bob | NaN | 9.0 | 9.0 | 10.0 |
charles | NaN | 5.0 | 11.0 | 4.0 |
colin | NaN | NaN | NaN | NaN |
darwin | NaN | 11.0 | 10.0 | 9.0 |
That's much better: although we made up some data, we have not been too unfair.
Another way to handle missing data is to interpolate. Let's look at the
bonus_points
DataFrame
again:
In [79]:
bonus_points
Out[79]:
oct | nov | dec | |
---|---|---|---|
bob | 0.0 | NaN | 2.0 |
colin | NaN | 1.0 | 0.0 |
darwin | 0.0 | 1.0 | 0.0 |
charles | 3.0 | 3.0 | 0.0 |
Now let's call the
interpolate
method. By default, it interpolates vertically (axis=0
), so let's tell it to interpolate horizontally (axis=1
).
In [80]:
bonus_points.interpolate(axis=1)
Out[80]:
oct | nov | dec | |
---|---|---|---|
bob | 0.0 | 1.0 | 2.0 |
colin | NaN | 1.0 | 0.0 |
darwin | 0.0 | 1.0 | 0.0 |
charles | 3.0 | 3.0 | 0.0 |
Bob had 0 bonus points in October, and 2 in December. When we interpolate for November, we get the mean: 1 bonus point. Colin had 1 bonus point in November, but we do not know how many bonus points he had in September, so we cannot interpolate, this is why there is still a missing value in October after interpolation. To fix this, we can set the September bonus points to 0 before interpolation.
In [81]:
better_bonus_points = bonus_points.copy()
better_bonus_points.insert(0, "sep", 0)
better_bonus_points.loc["alice"] = 0
better_bonus_points = better_bonus_points.interpolate(axis=1)
better_bonus_points
Out[81]:
sep | oct | nov | dec | |
---|---|---|---|---|
bob | 0.0 | 0.0 | 1.0 | 2.0 |
colin | 0.0 | 0.5 | 1.0 | 0.0 |
darwin | 0.0 | 0.0 | 1.0 | 0.0 |
charles | 0.0 | 3.0 | 3.0 | 0.0 |
alice | 0.0 | 0.0 | 0.0 | 0.0 |
Great, now we have reasonable bonus points everywhere. Let's find out the final grades:
In [82]:
grades + better_bonus_points
Out[82]:
dec | nov | oct | sep | |
---|---|---|---|---|
alice | NaN | 9.0 | 8.0 | 8.0 |
bob | NaN | 10.0 | 9.0 | 10.0 |
charles | NaN | 5.0 | 11.0 | 4.0 |
colin | NaN | NaN | NaN | NaN |
darwin | NaN | 11.0 | 10.0 | 9.0 |
It is slightly annoying that the September column ends up on the right. This is because the
DataFrame
s we are adding do not have the exact same columns (the grades
DataFrame
is missing the "dec"
column), so to make things predictable, pandas
orders the final columns alphabetically. To fix this, we can simply add the missing column before adding.
In [83]:
grades["dec"] = np.nan
final_grades = grades + better_bonus_points
final_grades
Out[83]:
sep | oct | nov | dec | |
---|---|---|---|---|
alice | 8.0 | 8.0 | 9.0 | NaN |
bob | 10.0 | 9.0 | 10.0 | NaN |
charles | 4.0 | 11.0 | 5.0 | NaN |
colin | NaN | NaN | NaN | NaN |
darwin | 9.0 | 10.0 | 11.0 | NaN |
There's not much we can do about December and Colin: it's bad enough that we are making up bonus points, but we can't reasonably make up grades (well I guess some teachers probably do). So let's call the
dropna()
method to get rid of rows that are full of NaN
s:
In [84]:
final_grades_clean = final_grades.dropna(how="all")
final_grades_clean
Out[84]:
sep | oct | nov | dec | |
---|---|---|---|---|
alice | 8.0 | 8.0 | 9.0 | NaN |
bob | 10.0 | 9.0 | 10.0 | NaN |
charles | 4.0 | 11.0 | 5.0 | NaN |
darwin | 9.0 | 10.0 | 11.0 | NaN |
Now let's remove columns that are full of
NaN
s by setting the axis
argument to 1
:
In [85]:
final_grades_clean = final_grades_clean.dropna(axis=1, how="all")
final_grades_clean
Out[85]:
sep | oct | nov | |
---|---|---|---|
alice | 8.0 | 8.0 | 9.0 |
bob | 10.0 | 9.0 | 10.0 |
charles | 4.0 | 11.0 | 5.0 |
darwin | 9.0 | 10.0 | 11.0 |
Overview functions¶
When dealing with largeDataFrames
, it is useful to get a quick overview of its content. pandas
offers a few functions for this. First, let's create a large DataFrame
with a mix of numeric values, missing values and text values. Notice how Jupyter displays only the corners of the DataFrame
:
In [86]:
much_data = np.fromfunction(lambda x,y: (x+y*y)%17*11, (10000, 26))
large_df = pd.DataFrame(much_data, columns=list("ABCDEFGHIJKLMNOPQRSTUVWXYZ"))
large_df[large_df % 16 == 0] = np.nan
large_df.insert(3,"some_text", "Blabla")
large_df.head(10)
Out[86]:
A | B | C | some_text | D | E | F | G | H | I | ... | Q | R | S | T | U | V | W | X | Y | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 11.0 | 44.0 | Blabla | 99.0 | NaN | 88.0 | 22.0 | 165.0 | 143.0 | ... | 11.0 | NaN | 11.0 | 44.0 | 99.0 | NaN | 88.0 | 22.0 | 165.0 | 143.0 |
1 | 11.0 | 22.0 | 55.0 | Blabla | 110.0 | NaN | 99.0 | 33.0 | NaN | 154.0 | ... | 22.0 | 11.0 | 22.0 | 55.0 | 110.0 | NaN | 99.0 | 33.0 | NaN | 154.0 |
2 | 22.0 | 33.0 | 66.0 | Blabla | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 | ... | 33.0 | 22.0 | 33.0 | 66.0 | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 |
3 | 33.0 | 44.0 | 77.0 | Blabla | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN | ... | 44.0 | 33.0 | 44.0 | 77.0 | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN |
4 | 44.0 | 55.0 | 88.0 | Blabla | 143.0 | 33.0 | 132.0 | 66.0 | 22.0 | NaN | ... | 55.0 | 44.0 | 55.0 | 88.0 | 143.0 | 33.0 | 132.0 | 66.0 | 22.0 | NaN |
5 | 55.0 | 66.0 | 99.0 | Blabla | 154.0 | 44.0 | 143.0 | 77.0 | 33.0 | 11.0 | ... | 66.0 | 55.0 | 66.0 | 99.0 | 154.0 | 44.0 | 143.0 | 77.0 | 33.0 | 11.0 |
6 | 66.0 | 77.0 | 110.0 | Blabla | 165.0 | 55.0 | 154.0 | 88.0 | 44.0 | 22.0 | ... | 77.0 | 66.0 | 77.0 | 110.0 | 165.0 | 55.0 | 154.0 | 88.0 | 44.0 | 22.0 |
7 | 77.0 | 88.0 | 121.0 | Blabla | NaN | 66.0 | 165.0 | 99.0 | 55.0 | 33.0 | ... | 88.0 | 77.0 | 88.0 | 121.0 | NaN | 66.0 | 165.0 | 99.0 | 55.0 | 33.0 |
8 | 88.0 | 99.0 | 132.0 | Blabla | NaN | 77.0 | NaN | 110.0 | 66.0 | 44.0 | ... | 99.0 | 88.0 | 99.0 | 132.0 | NaN | 77.0 | NaN | 110.0 | 66.0 | 44.0 |
9 | 99.0 | 110.0 | 143.0 | Blabla | 11.0 | 88.0 | NaN | 121.0 | 77.0 | 55.0 | ... | 110.0 | 99.0 | 110.0 | 143.0 | 11.0 | 88.0 | NaN | 121.0 | 77.0 | 55.0 |
The
head()
method returns the top 5 rows.
In [87]:
large_df.head()
Out[87]:
A | B | C | some_text | D | E | F | G | H | I | ... | Q | R | S | T | U | V | W | X | Y | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 11.0 | 44.0 | Blabla | 99.0 | NaN | 88.0 | 22.0 | 165.0 | 143.0 | ... | 11.0 | NaN | 11.0 | 44.0 | 99.0 | NaN | 88.0 | 22.0 | 165.0 | 143.0 |
1 | 11.0 | 22.0 | 55.0 | Blabla | 110.0 | NaN | 99.0 | 33.0 | NaN | 154.0 | ... | 22.0 | 11.0 | 22.0 | 55.0 | 110.0 | NaN | 99.0 | 33.0 | NaN | 154.0 |
2 | 22.0 | 33.0 | 66.0 | Blabla | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 | ... | 33.0 | 22.0 | 33.0 | 66.0 | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 |
3 | 33.0 | 44.0 | 77.0 | Blabla | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN | ... | 44.0 | 33.0 | 44.0 | 77.0 | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN |
4 | 44.0 | 55.0 | 88.0 | Blabla | 143.0 | 33.0 | 132.0 | 66.0 | 22.0 | NaN | ... | 55.0 | 44.0 | 55.0 | 88.0 | 143.0 | 33.0 | 132.0 | 66.0 | 22.0 | NaN |
Of course there's also a
tail()
function to view the bottom 5 rows. You can pass the number of rows you want.
In [88]:
large_df.tail(n=2)
Out[88]:
A | B | C | some_text | D | E | F | G | H | I | ... | Q | R | S | T | U | V | W | X | Y | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9998 | 22.0 | 33.0 | 66.0 | Blabla | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 | ... | 33.0 | 22.0 | 33.0 | 66.0 | 121.0 | 11.0 | 110.0 | 44.0 | NaN | 165.0 |
9999 | 33.0 | 44.0 | 77.0 | Blabla | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN | ... | 44.0 | 33.0 | 44.0 | 77.0 | 132.0 | 22.0 | 121.0 | 55.0 | 11.0 | NaN |
The
info()
method prints out a summary of each columns contents.
In [89]:
large_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 27 columns): A 8823 non-null float64 B 8824 non-null float64 C 8824 non-null float64 some_text 10000 non-null object D 8824 non-null float64 E 8822 non-null float64 F 8824 non-null float64 G 8824 non-null float64 H 8822 non-null float64 I 8823 non-null float64 J 8823 non-null float64 K 8822 non-null float64 L 8824 non-null float64 M 8824 non-null float64 N 8822 non-null float64 O 8824 non-null float64 P 8824 non-null float64 Q 8824 non-null float64 R 8823 non-null float64 S 8824 non-null float64 T 8824 non-null float64 U 8824 non-null float64 V 8822 non-null float64 W 8824 non-null float64 X 8824 non-null float64 Y 8822 non-null float64 Z 8823 non-null float64 dtypes: float64(26), object(1) memory usage: 2.1+ MB
Finally, the
describe()
method gives a nice overview of the main aggregated values over each column:
count
: number of non-null (not NaN) valuesmean
: mean of non-null valuesstd
: standard deviation of non-null valuesmin
: minimum of non-null values25%
,50%
,75%
: 25th, 50th and 75th percentile of non-null valuesmax
: maximum of non-null values
In [90]:
large_df.describe()
Out[90]:
A | B | C | D | E | F | G | H | I | J | ... | Q | R | S | T | U | V | W | X | Y | Z | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 8823.000000 | 8824.000000 | 8824.000000 | 8824.000000 | 8822.000000 | 8824.000000 | 8824.000000 | 8822.000000 | 8823.000000 | 8823.000000 | ... | 8824.000000 | 8823.000000 | 8824.000000 | 8824.000000 | 8824.000000 | 8822.000000 | 8824.000000 | 8824.000000 | 8822.000000 | 8823.000000 |
mean | 87.977559 | 87.972575 | 87.987534 | 88.012466 | 87.983791 | 88.007480 | 87.977561 | 88.000000 | 88.022441 | 88.022441 | ... | 87.972575 | 87.977559 | 87.972575 | 87.987534 | 88.012466 | 87.983791 | 88.007480 | 87.977561 | 88.000000 | 88.022441 |
std | 47.535911 | 47.535523 | 47.521679 | 47.521679 | 47.535001 | 47.519371 | 47.529755 | 47.536879 | 47.535911 | 47.535911 | ... | 47.535523 | 47.535911 | 47.535523 | 47.521679 | 47.521679 | 47.535001 | 47.519371 | 47.529755 | 47.536879 | 47.535911 |
min | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | ... | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 |
25% | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | ... | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 | 44.000000 |
50% | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | ... | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 | 88.000000 |
75% | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | ... | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 | 132.000000 |
max | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | ... | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 | 165.000000 |
Combining DataFrames¶
SQL-like joins¶
One powerful feature ofpandas
is it's ability to perform SQL-like joins on DataFrame
s. Various types of joins are supported: inner joins, left/right outer joins and full joins. To illustrate this, let's start by creating a couple simple DataFrame
s:
In [91]:
city_loc = pd.DataFrame(
[
["CA", "San Francisco", 37.781334, -122.416728],
["NY", "New York", 40.705649, -74.008344],
["FL", "Miami", 25.791100, -80.320733],
["OH", "Cleveland", 41.473508, -81.739791],
["UT", "Salt Lake City", 40.755851, -111.896657]
], columns=["state", "city", "lat", "lng"])
city_loc
Out[91]:
state | city | lat | lng | |
---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 |
1 | NY | New York | 40.705649 | -74.008344 |
2 | FL | Miami | 25.791100 | -80.320733 |
3 | OH | Cleveland | 41.473508 | -81.739791 |
4 | UT | Salt Lake City | 40.755851 | -111.896657 |
In [92]:
city_pop = pd.DataFrame(
[
[808976, "San Francisco", "California"],
[8363710, "New York", "New-York"],
[413201, "Miami", "Florida"],
[2242193, "Houston", "Texas"]
], index=[3,4,5,6], columns=["population", "city", "state"])
city_pop
Out[92]:
population | city | state | |
---|---|---|---|
3 | 808976 | San Francisco | California |
4 | 8363710 | New York | New-York |
5 | 413201 | Miami | Florida |
6 | 2242193 | Houston | Texas |
Now let's join these
DataFrame
s using the merge()
function:
In [93]:
pd.merge(left=city_loc, right=city_pop, on="city")
Out[93]:
state_x | city | lat | lng | population | state_y | |
---|---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | 808976 | California |
1 | NY | New York | 40.705649 | -74.008344 | 8363710 | New-York |
2 | FL | Miami | 25.791100 | -80.320733 | 413201 | Florida |
Note that both
DataFrame
s have a column named state
, so in the result they got renamed to state_x
and state_y
.
Also, note that Cleveland, Salt Lake City and Houston were dropped because they don't exist in both DataFrame
s. This is the equivalent of a SQL INNER JOIN
. If you want a FULL OUTER JOIN
, where no city gets dropped and NaN
values are added, you must specify how="outer"
.
In [94]:
all_cities = pd.merge(left=city_loc, right=city_pop, on="city", how="outer")
all_cities
Out[94]:
state_x | city | lat | lng | population | state_y | |
---|---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | 808976.0 | California |
1 | NY | New York | 40.705649 | -74.008344 | 8363710.0 | New-York |
2 | FL | Miami | 25.791100 | -80.320733 | 413201.0 | Florida |
3 | OH | Cleveland | 41.473508 | -81.739791 | NaN | NaN |
4 | UT | Salt Lake City | 40.755851 | -111.896657 | NaN | NaN |
5 | NaN | Houston | NaN | NaN | 2242193.0 | Texas |
Of course
LEFT OUTER JOIN
is also available by setting how="left"
: only the cities present in the left DataFrame
end up in the result. Similarly, with how="right"
only cities in the right DataFrame
appear in the result. For example:
In [95]:
pd.merge(left=city_loc, right=city_pop, on="city", how="right")
Out[95]:
state_x | city | lat | lng | population | state_y | |
---|---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | 808976 | California |
1 | NY | New York | 40.705649 | -74.008344 | 8363710 | New-York |
2 | FL | Miami | 25.791100 | -80.320733 | 413201 | Florida |
3 | NaN | Houston | NaN | NaN | 2242193 | Texas |
If the key to join on is actually in one (or both) DataFrame's index, you must use left_index=True and/or right_index=True. If the key column names differ, you must use left_on and right_on. For example:
In [96]:
city_pop2 = city_pop.copy()
city_pop2.columns = ["population", "name", "state"]
pd.merge(left=city_loc, right=city_pop2, left_on="city", right_on="name")
Out[96]:
state_x | city | lat | lng | population | name | state_y | |
---|---|---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | 808976 | San Francisco | California |
1 | NY | New York | 40.705649 | -74.008344 | 8363710 | New York | New-York |
2 | FL | Miami | 25.791100 | -80.320733 | 413201 | Miami | Florida |
Concatenation¶
Rather than joiningDataFrame
s, we may just want to concatenate them using concat()
method.
In [97]:
result_concat = pd.concat([city_loc, city_pop], sort = False)
result_concat
Out[97]:
state | city | lat | lng | population | |
---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | NaN |
1 | NY | New York | 40.705649 | -74.008344 | NaN |
2 | FL | Miami | 25.791100 | -80.320733 | NaN |
3 | OH | Cleveland | 41.473508 | -81.739791 | NaN |
4 | UT | Salt Lake City | 40.755851 | -111.896657 | NaN |
3 | California | San Francisco | NaN | NaN | 808976.0 |
4 | New-York | New York | NaN | NaN | 8363710.0 |
5 | Florida | Miami | NaN | NaN | 413201.0 |
6 | Texas | Houston | NaN | NaN | 2242193.0 |
Note that this operation aligned the data horizontally (by columns) but not vertically (by rows). In this example, we end up with multiple rows having the same index (eg. 3).
pandas
handles this rather gracefully.
In [98]:
result_concat.loc[3]
Out[98]:
state | city | lat | lng | population | |
---|---|---|---|---|---|
3 | OH | Cleveland | 41.473508 | -81.739791 | NaN |
3 | California | San Francisco | NaN | NaN | 808976.0 |
Or you can tell
pandas
to just ignore the index.
In [99]:
pd.concat([city_loc, city_pop], ignore_index=True, sort = False)
Out[99]:
state | city | lat | lng | population | |
---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | NaN |
1 | NY | New York | 40.705649 | -74.008344 | NaN |
2 | FL | Miami | 25.791100 | -80.320733 | NaN |
3 | OH | Cleveland | 41.473508 | -81.739791 | NaN |
4 | UT | Salt Lake City | 40.755851 | -111.896657 | NaN |
5 | California | San Francisco | NaN | NaN | 808976.0 |
6 | New-York | New York | NaN | NaN | 8363710.0 |
7 | Florida | Miami | NaN | NaN | 413201.0 |
8 | Texas | Houston | NaN | NaN | 2242193.0 |
Notice that when a column does not exist in a
DataFrame
, it acts as if it was filled with NaN
values. If we set join="inner"
, then only columns that exist in both DataFrame
s are returned.
In [100]:
pd.concat([city_loc, city_pop], join="inner")
Out[100]:
state | city | |
---|---|---|
0 | CA | San Francisco |
1 | NY | New York |
2 | FL | Miami |
3 | OH | Cleveland |
4 | UT | Salt Lake City |
3 | California | San Francisco |
4 | New-York | New York |
5 | Florida | Miami |
6 | Texas | Houston |
You can concatenate
DataFrame
s horizontally instead of vertically by setting axis=1
.
In [101]:
pd.concat([city_loc, city_pop], axis=1)
Out[101]:
state | city | lat | lng | population | city | state | |
---|---|---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | NaN | NaN | NaN |
1 | NY | New York | 40.705649 | -74.008344 | NaN | NaN | NaN |
2 | FL | Miami | 25.791100 | -80.320733 | NaN | NaN | NaN |
3 | OH | Cleveland | 41.473508 | -81.739791 | 808976.0 | San Francisco | California |
4 | UT | Salt Lake City | 40.755851 | -111.896657 | 8363710.0 | New York | New-York |
5 | NaN | NaN | NaN | NaN | 413201.0 | Miami | Florida |
6 | NaN | NaN | NaN | NaN | 2242193.0 | Houston | Texas |
In this case it really does not make much sense because the indices do not align well (eg. Cleveland and San Francisco end up on the same row, because they shared the index label
3
). So let's reindex the DataFrame
s by city name before concatenating:
In [102]:
pd.concat([city_loc.set_index("city"), city_pop.set_index("city")], axis=1, sort = False)
Out[102]:
state | lat | lng | population | state | |
---|---|---|---|---|---|
San Francisco | CA | 37.781334 | -122.416728 | 808976.0 | California |
New York | NY | 40.705649 | -74.008344 | 8363710.0 | New-York |
Miami | FL | 25.791100 | -80.320733 | 413201.0 | Florida |
Cleveland | OH | 41.473508 | -81.739791 | NaN | NaN |
Salt Lake City | UT | 40.755851 | -111.896657 | NaN | NaN |
Houston | NaN | NaN | NaN | 2242193.0 | Texas |
This looks a lot like a
FULL OUTER JOIN
, except that the state
columns were not renamed to state_x
and state_y
, and the city
column is now the index.
The
append()
method is a useful shorthand for concatenating DataFrame
s vertically:
In [103]:
city_loc.append(city_pop, sort = False)
Out[103]:
state | city | lat | lng | population | |
---|---|---|---|---|---|
0 | CA | San Francisco | 37.781334 | -122.416728 | NaN |
1 | NY | New York | 40.705649 | -74.008344 | NaN |
2 | FL | Miami | 25.791100 | -80.320733 | NaN |
3 | OH | Cleveland | 41.473508 | -81.739791 | NaN |
4 | UT | Salt Lake City | 40.755851 | -111.896657 | NaN |
3 | California | San Francisco | NaN | NaN | 808976.0 |
4 | New-York | New York | NaN | NaN | 8363710.0 |
5 | Florida | Miami | NaN | NaN | 413201.0 |
6 | Texas | Houston | NaN | NaN | 2242193.0 |
Categories¶
As always in
pandas
, the append()
method does not actually modify city_loc
: it works on a copy and returns the modified copy.
It is quite frequent to have values that represent categories, for example
1
for female and 2
for male, or "A"
for Good, "B"
for Average, "C"
for Bad. These categorical values can be hard to read and cumbersome to handle, but fortunately pandas
makes it easy. To illustrate this, let's take the city_pop
DataFrame
we created earlier, and add a column that represents a category:
In [104]:
city_econ = city_pop.copy()
city_econ["econ_code"] = [17, 17, 34, 20]
city_econ
Out[104]:
population | city | state | econ_code | |
---|---|---|---|---|
3 | 808976 | San Francisco | California | 17 |
4 | 8363710 | New York | New-York | 17 |
5 | 413201 | Miami | Florida | 34 |
6 | 2242193 | Houston | Texas | 20 |
Right now the
econ_code
column is full of apparently meaningless codes. Let's fix that. First, we will create a new categorical column based on the econ_code
s:
In [105]:
city_econ["economy"] = city_econ["econ_code"].astype('category')
city_econ["economy"].cat.categories
Out[105]:
Int64Index([17, 20, 34], dtype='int64')
Now we can give each category a meaningful name:
In [106]:
city_econ["economy"].cat.categories = ["Finance", "Energy", "Tourism"]
city_econ
Out[106]:
population | city | state | econ_code | economy | |
---|---|---|---|---|---|
3 | 808976 | San Francisco | California | 17 | Finance |
4 | 8363710 | New York | New-York | 17 | Finance |
5 | 413201 | Miami | Florida | 34 | Tourism |
6 | 2242193 | Houston | Texas | 20 | Energy |
Note that categorical values are sorted according to their categorical order, not their alphabetical order:
In [107]:
city_econ.sort_values(by="economy", ascending=False)
Out[107]:
population | city | state | econ_code | economy | |
---|---|---|---|---|---|
5 | 413201 | Miami | Florida | 34 | Tourism |
6 | 2242193 | Houston | Texas | 20 | Energy |
3 | 808976 | San Francisco | California | 17 | Finance |
4 | 8363710 | New York | New-York | 17 | Finance |
Saving and loading¶
pandas
can save DataFrame
s to various backends, including file formats such as CSV, Excel, JSON, HTML and HDF5, or to a SQL database. Let's create a DataFrame
to demonstrate this:
In [108]:
my_df = pd.DataFrame(
[["Biking", 68.5, 1985, np.nan], ["Dancing", 83.1, 1984, 3]],
columns=["hobby","weight","birthyear","children"],
index=["alice", "bob"]
)
my_df
Out[108]:
hobby | weight | birthyear | children | |
---|---|---|---|---|
alice | Biking | 68.5 | 1985 | NaN |
bob | Dancing | 83.1 | 1984 | 3.0 |
Saving¶
Let's save it as a CSV file.In [109]:
my_df.to_csv("my_df.csv")
Loading¶
Now let's load our CSV file back into aDataFrame
:
In [110]:
my_df_loaded = pd.read_csv("my_df.csv", index_col=0)
my_df_loaded
Out[110]:
hobby | weight | birthyear | children | |
---|---|---|---|---|
alice | Biking | 68.5 | 1985 | NaN |
bob | Dancing | 83.1 | 1984 | 3.0 |
As you might guess, there are similar
read_json
, read_html
, read_excel
functions as well. We can also read data straight from the Internet. For example, let's load all New York State Zip codes from data.ny.gov:
In [111]:
ny_zip = None
try:
csv_url = "https://data.ny.gov/api/views/juva-r6g2/rows.csv"
ny_zip = pd.read_csv(csv_url, index_col=0)
ny_zip = ny_zip.head()
except IOError as e:
print(e)
ny_zip
Out[111]:
State FIPS | County Code | County FIPS | ZIP Code | File Date | |
---|---|---|---|---|---|
County Name | |||||
Albany | 36 | 1 | 36001 | 12046 | 07/25/2007 |
Albany | 36 | 1 | 36001 | 12083 | 07/25/2007 |
Albany | 36 | 1 | 36001 | 12085 | 07/25/2007 |
Albany | 36 | 1 | 36001 | 12201 | 07/25/2007 |
Albany | 36 | 1 | 36001 | 12203 | 07/25/2007 |
There are more options available, in particular regarding datetime format. Check out the documentation for more details.
Exercises¶
- Create a Series using the following number:
3.14
,2.718
,1.618
with the following labels "pi
", "euler's number
", "golden ratio
". Then filter values that are only greater than 2.
- Create the following DataFrame (But use
name
as the index):
name | age | state | num_children | num_pets |
---|---|---|---|---|
john | 23 | iowa | 2 | 0 |
mary | 78 | dc | 2 | 4 |
peter | 22 | california | 0 | 0 |
jeff | 19 | texas | 1 | 5 |
bill | 45 | washington | 2 | 0 |
lisa | 33 | dc | 1 | 0 |
age
for each person in name
.
- Add another person as a new row to the previous DataFrame with the following values (HINT: use
pd.concat
):
mike
, age: 0
, state: new york
, num_children: 1
, num_pets: 0
.
Since this new person has a child, his age cannot be zero. Replace it with the median age of all other people in the DataFrame.
Possible solutions¶
- Indexing and selecting in Series
s = pd.Series([3.14, 2.718, 1.618], index = ["pi", "euler's number", "golden ratio"])
s[s > 2]
- Creating and plotting a DataFrame
df1 = pd.DataFrame(data={'age':[23,78,22,19,45,33],'state':['iowa','dc','california','texas','washington','dc'],'num_children':[2,2,0,1,2,1],'num_pets':[0,4,0,5,0,0]},index=['john','mary','peter','jeff','bill','lisa'])
df1.plot(kind = "bar", y = "age")
- Adding new row and handling missing data
df2 = pd.DataFrame(data={'age':[0],'state':['new york'],'num_children':[1],'num_pets':[0]},index=['mike'])
df3 = pd.concat([df1,df2])
df2['age'] = df1['age'].median() # Calculate and add the median age of all other people to 'mike'
df3 = pd.concat([df1,df2]) # Update the DataFrame
References¶
www.featureranking.com