NumPy

NumPy

NumPy

NumPy (Numerical Python) is the core module for numerical computation in Python. NumPy contains a fast and memory-efficient implementation of a list-like array data structure and it contains useful linear algebra and random number functions. A large portion of NumPy is actually written in the C programming language.

A NumPy array is similar to Python's list data structure. A Python list can contain any combination of element types: integers, floats, strings, functions, objects, etc. A NumPy array, on the other hand, must contain only one element type at a time. This way, NumPy arrays can be much faster and more memory efficient.

Both the Pandas module (for data analysis) and the Scikit-Learn module (for machine learning) are built upon the NumPy module. The Matplotlib module (for plotting) also plays nicely with NumPy. These four modules plus the base Python is practically all you need for basic to intermediate machine learning.

Two other fundamental Python modules closely related to machine learning are as follows - though we will not cover these in our tutorials:

  • SciPy: This module is for numerical computing including integration, differentiation, optimization, probability distributions, and parallel programming.
  • StatsModels: This module provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration.

Let's import numpy with usual convention of np.

In [1]:
import numpy as np

Creating arrays with NumPy

NumPy’s array class is called ndarray (the n-dimensional array). It is also known by the name array.

  • In a NumPy array, each dimension is called an axis and the number of axes is called the rank.
    • For example, a 3x4 matrix is an array of rank 2 (it is 2-dimensional).
    • The first axis has length 3, the second has length 4.
  • An array's list of axis lengths is called the shape of the array.
    • For example, a 3x4 matrix's shape is (3, 4).
    • The rank is equal to the shape's length.
  • The size of an array is the total number of elements, which is the product of all axis lengths (eg. 3*4=12)

np.array

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data.

In [2]:
arr1 = np.array([2, 10.2, 5.4, 80, 0])
arr1
Out[2]:
array([ 2. , 10.2,  5.4, 80. ,  0. ])

Nested sequences, like a list of equal-length lists, will be converted into a multi-dimensional array:

In [3]:
data = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data)
arr2
Out[3]:
array([[1, 2, 3, 4],
       [5, 6, 7, 8]])
In [4]:
arr2.shape
Out[4]:
(2, 4)
In [5]:
arr2.ndim  # equal to len(a.shape)
Out[5]:
2
In [6]:
arr2.size
Out[6]:
8

Other functions to create arrays

There are several other convenience NumPy functions to create arrays.

np.zeros

Creates an array containing any number of zeros.

In [7]:
np.zeros(5)
Out[7]:
array([0., 0., 0., 0., 0.])

It's just as easy to create a 2-D array (ie. a matrix) by providing a tuple with the desired number of rows and columns. For example, here's a 3x4 matrix:

In [8]:
np.zeros((2, 3))  # notice the double parantheses
Out[8]:
array([[0., 0., 0.],
       [0., 0., 0.]])

You can also create an n-dimensional array of arbitrary rank. For example, here's a 3-D array (rank=3) with shape (2,3,4):

In [9]:
np.zeros((2, 3, 2))
Out[9]:
array([[[0., 0.],
        [0., 0.],
        [0., 0.]],

       [[0., 0.],
        [0., 0.],
        [0., 0.]]])

np.ones

Produces an array of all ones.

In [10]:
np.ones((2, 3))
Out[10]:
array([[1., 1., 1.],
       [1., 1., 1.]])

How to create an array with the same values:

In [11]:
(np.pi * np.ones((3,4))).round(2)
Out[11]:
array([[3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14],
       [3.14, 3.14, 3.14, 3.14]])

np.arange

This is similar to Python's built-in range function, but much faster.

In [12]:
np.arange(5)
Out[12]:
array([0, 1, 2, 3, 4])
In [13]:
np.arange(1, 5)
Out[13]:
array([1, 2, 3, 4])

It also works with floats:

In [14]:
np.arange(1.0, 5.0)
Out[14]:
array([1., 2., 3., 4.])

Of course, you can provide a step parameter:

In [15]:
np.arange(1, 5, step = 0.5)
Out[15]:
array([1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])

np.linspace

This is similar to seq() in R. Its inputs are (start, stop, number of elements) and it returns evenly-spaced numbers over a specified interval. By default, the stop value is included.

In [16]:
np.linspace(0, 10, 6)
Out[16]:
array([ 0.,  2.,  4.,  6.,  8., 10.])

np.quantile

Computes the q-th quantile of its input. It plays nicely with np.linspace.

In [17]:
a = np.arange(1, 51)
print('a =', a)
quartiles = np.linspace(0, 1, 5)
print('quartiles =', quartiles)
a = [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
 49 50]
quartiles = [0.   0.25 0.5  0.75 1.  ]
In [18]:
np.quantile(a, 0.5)  # how to compute the median
Out[18]:
25.5
In [19]:
np.quantile(a, quartiles)
Out[19]:
array([ 1.  , 13.25, 25.5 , 37.75, 50.  ])

np.rand and np.randn

A number of functions are available in NumPy's random module to create arrays initialized with random values. For example, here is a matrix initialized with random floats between 0 and 1 (uniform distribution):

In [20]:
np.random.rand(2,3).round(3)
Out[20]:
array([[0.228, 0.081, 0.971],
       [0.764, 0.925, 0.069]])

Here's a matrix containing random floats sampled from a univariate normal distribution (Gaussian distribution) with mean 0 and variance 1:

In [21]:
np.random.randn(2,3).round(3)
Out[21]:
array([[-0.428, -0.364, -0.81 ],
       [-0.132,  2.031, -0.017]])

Data types for arrays

Type Description
int16 16-bit integer types
int32 32-bit integer types
int64 64-bit integer types
float16 Half-precision floating point
float32 Standard single-precision floating point
float64 Standard double-precision floating point
bool Boolean (True or False)
string_ String
object A value can be any Python object

np.array.dtype

NumPy's arrays are also efficient in part because all their elements must have the same type (usually numbers). You can check what the data type is by looking at the dtype attribute.

In [22]:
arr1 = np.array([1, 2, 3], dtype = np.float64)
In [23]:
print("Data type name:", arr1.dtype.name)
Data type name: float64
In [24]:
arr2 = np.array([1, 2, 3], dtype = np.int32)
In [25]:
print(arr2.dtype, arr2)
int32 [1 2 3]

np.array.astype

You can explicitly convert or cast an array from one dtype to another using astype method.

In [26]:
arr2.dtype
Out[26]:
dtype('int32')
In [27]:
arr2 = arr2.astype(np.float64)
In [28]:
arr2.dtype # integers are now cast to floating point
Out[28]:
dtype('float64')

If you have an array of strings representing numbers, you can use astype to convert them to numeric form.

In [29]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype = np.string_)
numeric_strings
Out[29]:
array([b'1.25', b'-9.6', b'42'], dtype='|S4')
In [30]:
numeric_strings.astype(float)  # this will not take effect unless you do set it to a new variable!
Out[30]:
array([ 1.25, -9.6 , 42.  ])
In [31]:
numeric_strings.dtype
Out[31]:
dtype('S4')

Arithmetic operations on arrays

All the usual arithmetic operators (+, -, *, /, //, **, etc.) can be used with arrays. They apply element-wise.

In [32]:
a = np.array([14, 23, 32, 41])
b = np.array([5,  4,  3,  2])
print("a + b  =", a + b)
print("a - b  =", a - b)
print("a * b  =", a * b)
print("a / b  =", a / b)
print("a // b  =", a // b)
print("a % b  =", a % b)
print("a ** b =", a ** b)
a + b  = [19 27 35 43]
a - b  = [ 9 19 29 39]
a * b  = [70 92 96 82]
a / b  = [ 2.8         5.75       10.66666667 20.5       ]
a // b  = [ 2  5 10 20]
a % b  = [4 3 2 1]
a ** b = [537824 279841  32768   1681]

Note that the multiplication is not a matrix multiplication.

The arrays must have the same shape. If they do not, NumPy will apply the broadcasting rules, which is discussed further below.

Reshaping arrays

In many cases, you can convert an array from one shape to another without copying any data.

np.array.shape

Changing the shape of an array is as simple as setting its shape attribute. However, the array's size must remain the same.

In [33]:
g = np.arange(12)
print(g)
print("Rank:", g.ndim)
[ 0  1  2  3  4  5  6  7  8  9 10 11]
Rank: 1
In [34]:
g.shape = (6, 2)
print(g)
print("Rank:", g.ndim)
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
Rank: 2
In [35]:
g.shape = (2, 3, 2)
print(g)
print("Rank:", g.ndim)
[[[ 0  1]
  [ 2  3]
  [ 4  5]]

 [[ 6  7]
  [ 8  9]
  [10 11]]]
Rank: 3

np.array.reshape

The reshape function returns a new array object pointing at the same data. This means that modifying one array will also modify the other.

In [36]:
g2 = g.reshape(4,3)  # you need to set this to a new variable to take effect!
print(g2)
print("Rank:", g2.ndim)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]
Rank: 2

How about we get lazy and let NumPy figure out the details?

In [37]:
g2 = g.reshape(4, -1)  
print(g2)
[[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]]

How to convert a multi-dimensional array back to 1-dimensional (a.k.a array flattening): you can use reshape or flatten.

In [38]:
f = np.arange(6).reshape(3,2)
f
f.reshape(-1)
f.flatten()
Out[38]:
array([0, 1, 2, 3, 4, 5])

Set item at row 1, col 2 to 999 (more about indexing below).

In [39]:
g2[1, 2] = 999
g2
Out[39]:
array([[  0,   1,   2],
       [  3,   4, 999],
       [  6,   7,   8],
       [  9,  10,  11]])

The corresponding element in g has been modified as well, even though g's shape is (2, 3, 2).

In [40]:
g
Out[40]:
array([[[  0,   1],
        [  2,   3],
        [  4, 999]],

       [[  6,   7],
        [  8,   9],
        [ 10,  11]]])

np.array.resize

If you want to repshape an array in-place, that is, change the shape of the original array; you can use resize().

In [41]:
g = np.arange(6)
g
g.resize(2,3)
# watch out! with resize(), you cannot use negative dimensions.
# so, this will not work: g.resize(2,-1)
g
g[0,0] = 111
g
Out[41]:
array([[111,   1,   2],
       [  3,   4,   5]])

Adding and removing elements

np.append and np.insert

In [42]:
a = np.arange(6)
print('original array:\n', a)

b = np.append(a, 111)
print('appending an element to the end:\n', b)

c = np.insert(a, 0, 111) 
print('inserting an element at a specific position:\n', c)

# watch out: these will NOT work: a.append(111), a.insert(0, 111)
original array:
 [0 1 2 3 4 5]
appending an element to the end:
 [  0   1   2   3   4   5 111]
inserting an element at a specific position:
 [111   0   1   2   3   4   5]

np.delete

In [43]:
a = np.arange(6)
a
c = np.delete(a, [0,1])
print('deleting the first two elements:\n', c)

a.resize(2,3)
print('a after resize():\n', a)

e = np.delete(a, 0, axis=1) # you can delete an entire column by specifying axis=1
print('first column deleted:\n', e)

f = np.delete(a, 0, axis=0) # or you can delete an entire row by specifying axis=0
print('first row deleted:\n', f)
deleting the first two elements:
 [2 3 4 5]
a after resize():
 [[0 1 2]
 [3 4 5]]
first column deleted:
 [[1 2]
 [4 5]]
first row deleted:
 [[3 4 5]]

Copying arrays

NumPy usually does not make copies for efficiency. Most assignments are just views, not copies. If you want a copy, you need to say so.

You can use either np.array.copy or np.copy.

In [44]:
b = a = np.arange(6)
a_copy = a.copy()
# alternatively,
a_copy = np.copy(a)
a
b
a_copy
print(a == a_copy)  # element-wise comparison
print(a is a_copy)  # this is False
print(a is b)  # this is True
a[0] = -111  # changing a has no effect on a_copy
a
a_copy
[ True  True  True  True  True  True]
False
True
Out[44]:
array([0, 1, 2, 3, 4, 5])

Broadcasting

Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Broadcasting can get complicated, so we recommend you avoid it all together if you can and do either one of the two things below:

  • Broadcast only a scalar with an array
  • Broadcast arrays of the same shape
In [45]:
A = np.arange(6).reshape(3,2)
B = np.arange(6, 12).reshape(3,2)
A
B
Out[45]:
array([[ 6,  7],
       [ 8,  9],
       [10, 11]])
In [46]:
A + B
Out[46]:
array([[ 6,  8],
       [10, 12],
       [14, 16]])
In [47]:
3 * A
Out[47]:
array([[ 0,  3],
       [ 6,  9],
       [12, 15]])
In [48]:
(A / 3).round(2)  # float division
Out[48]:
array([[0.  , 0.33],
       [0.67, 1.  ],
       [1.33, 1.67]])
In [49]:
A // 3  # integer division
Out[49]:
array([[0, 0],
       [0, 1],
       [1, 1]])
In [50]:
11 + A
Out[50]:
array([[11, 12],
       [13, 14],
       [15, 16]])

Element-wise matrix multiplication is done by *.

In [51]:
A * B
Out[51]:
array([[ 0,  7],
       [16, 27],
       [40, 55]])

For usual matrix multiplication, you need to use np.dot.

In [52]:
B_new = B.reshape(2,-1)
B_new
np.dot(A, B_new)
Out[52]:
array([[ 9, 10, 11],
       [39, 44, 49],
       [69, 78, 87]])

Conditional expressions with arrays

In [53]:
x = np.array([10,20,30,40,50])
x >= 30
Out[53]:
array([False, False,  True,  True,  True])
In [54]:
x[x >= 30]
Out[54]:
array([30, 40, 50])

np.where

Returns the indices of elements in an input array where the given condition is satisfied.

In [55]:
y = np.arange(10)
print(y)
np.where(y < 5)
[0 1 2 3 4 5 6 7 8 9]
Out[55]:
(array([0, 1, 2, 3, 4]),)

Extremely useful: You can use where for vectorised if-else statements.

In [56]:
compared_to_5 = list(np.where(y < 5, 'smaller', 'bigger'))
print(compared_to_5)
['smaller', 'smaller', 'smaller', 'smaller', 'smaller', 'bigger', 'bigger', 'bigger', 'bigger', 'bigger']

Mathematical and statistical functions

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class.

In [57]:
a = np.array([[-2.5, 3.1, 7], [10, 11, 12]])
print(a)
[[-2.5  3.1  7. ]
 [10.  11.  12. ]]
In [58]:
np.max(a)
Out[58]:
12.0
In [59]:
np.min(a)
Out[59]:
-2.5
In [60]:
np.mean(a)
Out[60]:
6.766666666666667
In [61]:
np.prod(a)
Out[61]:
-71610.0
In [62]:
np.std(a)
Out[62]:
5.084835843520964
In [63]:
np.var(a)
Out[63]:
25.855555555555554
In [64]:
np.sum(a)
Out[64]:
40.6

These functions accept an optional argument axis which lets you ask for the operation to be performed on elements along the given axis. For example:

In [65]:
b = np.arange(12).reshape(2,-1)
b
Out[65]:
array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11]])
In [66]:
b.sum(axis=0)  # sum across columns
Out[66]:
array([ 6,  8, 10, 12, 14, 16])
In [67]:
b.sum(axis=1)  # sum across rows
Out[67]:
array([15, 51])

Universal functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple element-wise transformations, like sqrt or exp. These are referred to as unary ufuncs.

In [68]:
z = np.array([[-2.5, 3.1, 7], [10, 11, 12]])

np.square

Element-wise square of the input.

In [69]:
np.square(z)
Out[69]:
array([[  6.25,   9.61,  49.  ],
       [100.  , 121.  , 144.  ]])

np.exp

Calculate the exponential of all elements in the input array.

In [70]:
np.exp(z)
Out[70]:
array([[8.20849986e-02, 2.21979513e+01, 1.09663316e+03],
       [2.20264658e+04, 5.98741417e+04, 1.62754791e+05]])

Binary universal functions

Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:

In [71]:
x = np.array([3, 6, 1])
y = np.array([4, 2, 9])
print(x)
print(y)
[3 6 1]
[4 2 9]

np.maximum

Element-wise maximum of array elements - do not confuse with np.max which finds the max element in the array.

In [72]:
np.maximum(x,y)
Out[72]:
array([4, 6, 9])

np.minimum

Element-wise minimum of array elements - do not confuse with np.min which finds the min element in the array.

In [73]:
np.minimum(x,y)
Out[73]:
array([3, 2, 1])

np.power

First array elements raised to powers from second array, element-wise.

In [74]:
np.power(x,y)
Out[74]:
array([81, 36,  1])

Array indexing and slicing

One-dimensional arrays

One-dimensional NumPy arrays can be accessed more or less like regular Python arrays:

In [75]:
a = np.array([1, 5, 3, 19, 13, 7, 3])
a[3]
Out[75]:
19
In [76]:
a[2:5]
Out[76]:
array([ 3, 19, 13])
In [77]:
a[2:-1]
Out[77]:
array([ 3, 19, 13,  7])
In [78]:
a[:2]
Out[78]:
array([1, 5])
In [79]:
a[2::2]
Out[79]:
array([ 3, 13,  3])
In [80]:
a[::-1]
Out[80]:
array([ 3,  7, 13, 19,  3,  5,  1])

Of course, you can modify elements:

In [81]:
a[3]=999
a
Out[81]:
array([  1,   5,   3, 999,  13,   7,   3])

You can also modify an array slice:

In [82]:
a[2:5] = [997, 998, 999]
a
Out[82]:
array([  1,   5, 997, 998, 999,   7,   3])

Multi-dimensional arrays

Multi-dimensional arrays can be accessed in a similar way by providing an index or slice for each axis, separated by commas:

In [83]:
b = np.arange(48).reshape(4, 12)
b
Out[83]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35],
       [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47]])
In [84]:
b[1, 1]  # row 1, col 2 (recall that Python slices starting at index 0)
Out[84]:
13
In [85]:
b[1, :]  # row 1, all columns
Out[85]:
array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
In [86]:
b[:, 1]  # all rows, column 1
Out[86]:
array([ 1, 13, 25, 37])

Caution: Note the subtle difference between these two expressions:

In [87]:
b[1, :]
Out[87]:
array([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])
In [88]:
b[1:2, :]
Out[88]:
array([[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])

The first expression returns row 1 as a 1D array of shape (12,), while the second returns that same row as a 2D array of shape (1, 12).

Transposing arrays

An array's transpose or T method transposes the array.

In [89]:
a = np.arange(10).reshape(5,-1)
a
Out[89]:
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
In [90]:
a_t = a.transpose()
a_t
Out[90]:
array([[0, 2, 4, 6, 8],
       [1, 3, 5, 7, 9]])
In [91]:
a_t = a.T  # this also works
a_t
Out[91]:
array([[0, 2, 4, 6, 8],
       [1, 3, 5, 7, 9]])

Combining arrays

np.vstack: stack arrays vertically

In [92]:
a = 1 + np.arange(3)
b = -1 * a
c = 10 + a
print(a)
print(b)
print(c)
d = np.vstack((a, b, c))  # notice the double parantheses
print('stack vertically:\n', d)
[1 2 3]
[-1 -2 -3]
[11 12 13]
stack vertically:
 [[ 1  2  3]
 [-1 -2 -3]
 [11 12 13]]

np.hstack: stack arrays horizontally

In [93]:
d = np.hstack((a, b, c))  # notice the double parantheses
print('stack horizontally:\n', d)
stack horizontally:
 [ 1  2  3 -1 -2 -3 11 12 13]

Sorting arrays

You can use an array's sort method, but pay attention as sorting is done in-place!

In [94]:
a = np.array([3, 5, -1, 0, 11])
print(a)
sort_output = a.sort()
print('a has been sorted in place:\n', a)
print(sort_output) # tricky: this will print None!
[ 3  5 -1  0 11]
a has been sorted in place:
 [-1  0  3  5 11]
None

If you do not want to sort in place, you need to use np.sort.

In [95]:
a = np.array([3, 5, -1, 0, 11])
print(a)
b = np.sort(a)
print(b)
print('Notice a is not changed:\n', a)
[ 3  5 -1  0 11]
[-1  0  3  5 11]
Notice a is not changed:
 [ 3  5 -1  0 11]

If you want reverse sort, you need to do it indirectly as there is no direct option for it inside the sort methods.

In [96]:
a_reverse_sorted = np.sort(a)[::-1]
print(a_reverse_sorted)
[11  5  3  0 -1]

Exercises

1- Initialize a 5 $\times$ 3 2D array with all numbers divisible by 3 between 3 and 48. HINT: np.arange's argument step. For example, you can create an array of 0, 2, 4, 6, 8 by calling np.arange(0, 10, step = 2). Then slice the last column of the array.

2- Create an array say a = np.random.uniform(1, 10, 10). Find the location or index of the maximum value in a. How about the location of the minimum value? HINT: use argmax and argmin methods

3- Create the following array and find the maximum values in each row. How about column-wise maximum values? HINT: use np.amax.

$$A = \begin{bmatrix} 1 & 3 & 4 \\ 2 & 7 & -1 \end{bmatrix}$$

4- Missing values such as NA and nan are not uncommon in data science (technically, nan is not a missing value. It stands for not-a-number.) Create the following matrix which contains one nan using np.nan.

$$B = \begin{bmatrix} 1 & 3 & \text{nan} \\ 2 & 7 & -1 \end{bmatrix}$$

5- Find the column-wise and the row-wise maximum values in B created in the previous question. Does np.amax return any value? HINT: Try np.nanmax method.

Possible solutions

1- Initializing and slicing arrays

import numpy as np
# Create and reshape the array
myarray = np.arange(3, 48, step = 3)
myarray.shape = (5, 3)

# Slice the last column
myarray[:,2]

2- Indexing the maximum and minimum

import numpy as np
a = np.random.uniform(1, 10, 10)
a
a.argmax() # Find the maximum index
a.argmin() # Find the minimum index

3- Column-wise and row-wise maximum and minimum values.

import numpy as np
A = np.array([[1, 3, 4],[2, 7, -1]])
np.amax(A, axis = 0) # Column-wise
np.amax(A, axis = 1) # Row-wise

4- Creating nan with numpy.

import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])

5- Column-wise and row-wise maximum and minimum values in the presence of nan values.

import numpy as np
B = np.array([[1, 3, np.nan],[2, 7, -1]])
np.nanmax(B, axis = 0) # Column-wise
np.nanmax(B, axis = 1) # Row-wise

www.featureranking.com