Data Preparation for Machine Learning

Data_Prep

Data Preparation for Statistical Modeling and Machine Learning

This tutorial's topic is data preparation for statistical modeling and machine learning. Our terminology is that the feature we would like to predict is called the "target" feature. The other features that we use for the prediction are called the "descriptive" features. There is a certain format for the data before we can perform modeling using Scikit-Learn. In general, the following steps will be necessary for data preparation in this specific order:

  1. Outliers and unusual values (such as a negative age) are taken care of: they are either imputed, dropped, or set to missing values. We do not cover this subject and refer the reader to Chapters 2 and 3 in our FMLPDA textbook.
  2. Missing values are imputed or the rows containing them are dropped.
  3. Any categorical descriptive feature is encoded to be numeric as follows:
    • one-hot-encoding for nominals,
    • one-hot-encoding or integer-encoding for ordinals.
  4. All descriptive features (which are all numeric at this point) are scaled.
  5. In case of a classification problem, the target feature is label-encoded (in case of a binary problem, the positive class is encoded as 1).
  6. If the dataset has too many observations, only a small random subset of entire dataset is selected to be used during model tuning and model comparison.
  7. Before fitting any Scikit-Learn models, any Pandas series or data frame is converted to a NumPy array using the values method in Pandas.

We now briefly describe the above steps. We will first have to make sure there are no missing values anywhere, neither in the descriptive features nor the target feature. Dealing with missing values can involve various imputation techniques in practice. However, we will stick to our topic of modeling and we will simply remove any rows with any missing values in this tutorial.

Next, we will have to ensure that all categorical features are encoded correctly. For nominal categorical descriptive features, we will use one-hot-encoding where a categorical feature with $q$ levels is replaced with $q$ binary variables indicating membership of the $q$-th level. For ordinal categorical descriptive features, we can use either one-hot-encoding or integer-encoding.

For categorical target features, we will use label-encoding so that a categorical target feature with $t$ levels is encoded as integers in the range $0, 1, \ldots, t-1$. Here, the encoding is done in alphabetical order. That is, the level that would be the first after an alphabetical sorting will be encoded as 0, and so forth. However, as we will see in tutorial SK Part 4: Evaluation, in the case of a binary classification, we need the "positive" class to be encoded as "1" and the negative class to be encoded as "0". So, we will have to make sure that regardless of the alphabetical order, the positive class is always encoded as 1.

For regression problems where the target feature is numerical, apparently no label-encoding shall be necessary. On the other hand, if the range of the numerical target feature is quite large (salaries might be a good example here), then a log transformation might be useful.

Many machine learning algorithms require numerical descriptive features to be scaled in some fashion (such as nearest neighbor methods) and, scaling usually does not hurt the other type of algorithms. So, it is always a good idea to scale the numerical descriptive features before modeling.

Sometimes your dataset will have too many observations for your computer to handle (as in millions of rows). If this is the case, it might be good idea to select a small random subset of the entire set of observations before trying out any models. Once you decide on which model to use, you can then use the entire dataset for final training before deploying your model.

Scikit-Learn models do not play well with Pandas series or data frames. For this reason, before fitting any models, we have to ensure that any Pandas variable is converted to a NumPy array using its values method.

In this tutorial, we will see how the data preparation steps described above can be performed using the NumPy, Pandas, and Scikit-Learn modules.

Learning Objectives

  • Load datasets from sklearn as well as from the Cloud
  • Perform initial data preparation steps
  • Deal with missing values
  • Discretize numeric features and make categorical features numeric
  • Encode categorical target features in classification problems
  • Scale numerical descriptive features: min-max, standard, and robust scaling
  • Select a small random subset of available rows

Binary Classification Example: Breast Cancer Wisconsin Data

This dataset contains 569 observations and has 30 input features from breast cancer screening tissue samples. The target feature has two classes: 212 malignant ("M") and 357 benign ("B").

We can load the data from sklearn (as shown further below), or we can read the data from the following github account. The reason we prefer the github account here is that the version in sklearn does not have column names.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import io
import requests

import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and
    getattr(ssl, '_create_unverified_context', None)): 
    ssl._create_default_https_context = ssl._create_unverified_context

cancer_url = 'https://raw.githubusercontent.com/vaksakalli/datasets/master/breast_cancer_wisconsin.csv'
url_content = requests.get(cancer_url).content
cancer_df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

Let's check the shape of this dataset to make sure it has been downloaded correctly.

In [2]:
cancer_df.shape
Out[2]:
(569, 31)

Let's have a look at 10 randomly selected rows in this raw dataset.

In [3]:
cancer_df.sample(n=10, random_state=8)
Out[3]:
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_texture worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension diagnosis
325 12.670 17.30 81.25 489.9 0.10280 0.07664 0.031930 0.021070 0.1707 0.05984 ... 21.10 88.70 574.4 0.13840 0.12120 0.10200 0.05602 0.2688 0.06888 B
557 9.423 27.88 59.26 271.3 0.08123 0.04971 0.000000 0.000000 0.1742 0.06059 ... 34.24 66.50 330.6 0.10730 0.07158 0.00000 0.00000 0.2475 0.06969 B
475 12.830 15.73 82.89 506.9 0.09040 0.08269 0.058350 0.030780 0.1705 0.05913 ... 19.35 93.22 605.8 0.13260 0.26100 0.34760 0.09783 0.3006 0.07802 B
308 13.500 12.71 85.69 566.2 0.07376 0.03614 0.002758 0.004419 0.1365 0.05335 ... 16.94 95.48 698.7 0.09023 0.05836 0.01379 0.02210 0.2267 0.06192 B
553 9.333 21.94 59.01 264.0 0.09240 0.05605 0.039960 0.012820 0.1692 0.06576 ... 25.05 62.86 295.8 0.11030 0.08298 0.07993 0.02564 0.2435 0.07393 B
159 10.900 12.96 68.69 366.8 0.07515 0.03718 0.003090 0.006588 0.1442 0.05743 ... 18.20 78.07 470.0 0.11710 0.08294 0.01854 0.03953 0.2738 0.07685 B
290 14.410 19.73 96.03 651.0 0.08757 0.16760 0.136200 0.066020 0.1714 0.07192 ... 22.13 101.70 767.3 0.09983 0.24720 0.22200 0.10210 0.2272 0.08799 B
146 11.800 16.58 78.99 432.0 0.10910 0.17000 0.165900 0.074150 0.2678 0.07371 ... 26.38 91.93 591.7 0.13850 0.40920 0.45040 0.18650 0.5774 0.10300 M
431 12.400 17.68 81.47 467.8 0.10540 0.13160 0.077410 0.027990 0.1811 0.07102 ... 22.91 89.61 515.8 0.14500 0.26290 0.24030 0.07370 0.2556 0.09359 B
527 12.340 12.27 78.94 468.5 0.09003 0.06307 0.029580 0.026470 0.1689 0.05808 ... 19.27 87.22 564.9 0.12920 0.20740 0.17910 0.10700 0.3110 0.07592 B

10 rows × 31 columns

First Steps

Before undertaking more involved data preparation steps, we need to perform some basic steps, which are described below.

ID Columns for Rows

Sometimes the dataset at hand will have a unique value for each row, such as Customer ID, Patient ID, Record ID, etc. Such features are irrelevant in machine learning. As an example, consider a new patient whose ID 84537. How is this ID going to help us with predicting this patient's health status? For this reason, we need to remove any column whose unique value count is the same as the number of rows. We can achieve this as follows for a dataset object df.

df = df.loc[:, df.nunique() != df.shape[0]]

Constant Features

Sometimes a dataset will have constant features (that have only one unique value). Such features are also irrelevant for machine learning, so they need to be removed as follows.

df = df.loc[:, df.nunique() != 1]

Other Irrelevant Features

While being problem-specific, any feature that is not relevant for "learning" needs to be removed. For instance, suppose there is a column in your dataset that shows the particular database this row was extracted from. If this information is purely for record-keeping purposes, then you should remove this column before fitting any models.

Redundant Features

A descriptive feature is "redundant" if it conveys the same information as another feature. Suppose a customer's salary is stored in two columns: one in US Dollars and the other one in AU Dollars. You need only one of these columns.

Date and Time Features

Date and time features need careful attention. Date features such as birthdays cannot be used as they are and they must be transformed. In the birthday example, the logical approach would be be convert it to a new feature called "age". For the conversion, you need to subtract the birth date from the current date and cast the result to a year quantity. As for time features, they should be transformed to durations, such as number of hours since a particular reference point in time.

Checking for Missing Values

Models in Scikit-Learn do not work with data with missing values. Let's check to see which columns have missing values in our dataset. Missing values are a bit complicated in Python as they can be denoted by either "na" or "null" in Pandas (both mean the same thing). Furthermore, NumPy denotes missing values as "NaN" (that is, "not a number").

In [4]:
cancer_df.isna().sum()
Out[4]:
mean_radius                0
mean_texture               0
mean_perimeter             0
mean_area                  0
mean_smoothness            0
mean_compactness           0
mean_concavity             0
mean_concave_points        0
mean_symmetry              0
mean_fractal_dimension     0
radius_error               0
texture_error              0
perimeter_error            0
area_error                 0
smoothness_error           0
compactness_error          0
concavity_error            0
concave_points_error       0
symmetry_error             0
fractal_dimension_error    0
worst_radius               0
worst_texture              0
worst_perimeter            0
worst_area                 0
worst_smoothness           0
worst_compactness          0
worst_concavity            0
worst_concave_points       0
worst_symmetry             0
worst_fractal_dimension    0
diagnosis                  0
dtype: int64

We observe that there are no columns with a missing value, which is good. In case there were any missing values, we would have to either impute them or drop the corresponding rows. Dealing with missing values can get rather complicated and they are beyond our scope in this tutorial (please see Chapter 3 in the textbook for more details). For simplicity, we advocate just dropping the rows where at least one element is missing. We can accomplish this using the dropna() method in Pandas:

In [5]:
cancer_df = cancer_df.dropna()

Discretizing Numeric Features

Sometimes it's a good idea to discretize a numeric feature and make it a categorical feature. For instance, instead of treating an age variable as numeric, it might be a better option to discretize is as young, middle-aged, and old.

Let's briefly digress here and see how we can discretize the mean_area numeric feature in the cancer dataset as "small", "average", and "large" using equal-frequency binning. We can use the qcut method in the Pandas module, which performs quantile-based discretization. We can also supply names for the resulting discretization.

First, we will make a copy of the original dataset and give it a different name.

In [6]:
cancer_df_cat = cancer_df.copy()

cancer_df_cat['mean_area'] = pd.qcut(cancer_df_cat['mean_area'], q=3, 
                                     labels=['small', 'average', 'large'])

Let's make sure we performed the dicretization correctly using the value_counts method in Pandas.

In [7]:
cancer_df_cat['mean_area'].value_counts()
Out[7]:
large      190
small      190
average    189
Name: mean_area, dtype: int64

Let's have a look at the first 5 rows in the categorized dataset. You will notice that mean area is now categorical.

In [8]:
cancer_df_cat.head(5)
Out[8]:
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_texture worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension diagnosis
0 17.99 10.38 122.80 large 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M
1 20.57 17.77 132.90 large 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M
2 19.69 21.25 130.00 large 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M
3 11.42 20.38 77.58 small 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M
4 20.29 14.34 135.10 large 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M

5 rows × 31 columns

Making Categorical Features Numeric

Keep in mind that Scikit-Learn requires all features to be numeric, so you need to encode your categorical features as numeric.

Nominal vs. Ordinal Categorical Features

There are two types of categorical features: nominal and ordinal. Levels of a nominal (categorical) feature do not have a natural ordering. Some examples of nominal categorical features are as follows:

  • Gender
  • Seasons
  • Days of a week
  • ID features (such as country IDs, device type IDs)
  • States in a country
  • Breed of a dog
  • Colors of a car brand

On the other hand, levels of an ordinal (categorical) feature have a natural ordering. Some examples are as follows:

  • The likert scale (strongly disagree, disagree, neutral, agree, strongly agree)
  • Performance levels (poor, satisfactory, good, excellent)
  • Education levels (none, elementary, secondary, bachelors, masters, doctorate)
  • Age of a customer (young, middle-aged, old)

Encoding Nominal Descriptive Features

Nominal descriptive features must be encoded using "one-hot-encoding", which is described below. This type of encoding creates a binary variable for each unique value of the nominal feature we are encoding.

We need to be careful here: Sometimes a feature that appears to be numeric is actually nominal! For instance, you might have a feature for country IDs that is numerical. However, these numbers indicate IDs, not numerical quantities! So, we will have to treat this feature as nominal. That is, whenever we have a numerical feature that actually represents categorical quantities (such as IDs), we must consider them as nominal and we must encode them using one-hot-encoding.

Encoding Ordinal Descriptive Features

For ordinal descriptive features, we have two options:

  • One-hot-encoding: We can encode ordinal features using one-hot-encoding as well. The benefit of this encoding is that we would not be assuming any arithmetic relationship between the levels. The downside is that we would be introducing $q$ additional binary variables.
  • Integer-encoding: We can perform integer-encoding (starting at 0) where the ordering is preserved. For instance, the likert scale above can be encoded as {0, 1, 2, 3, 4} where 0 corresponds to strongly disagree and 4 corresponds to strongly agree. The benefit of integer-encoding is that we would be replacing the original ordinal feature with just one feature that is numerical. The downside is that we would be introducing an arithmetic relationship between the levels, such as neutral (2) is "twice as much" as disagree (1).

Deciding which type of encoding is more appropriate for an ordinal feature is almost always problem-specific as we would be making a trade-off with either option with respect to its benefit and downside. However, if doubt, we recommend choosing the one-hot-encoding option.

One-Hot-Encoding

In the discussion below, we make no particular distinction between the terms "dummy variable" and "binary variable" and we use them interchangeably. We also make no distinction between Python "functions" and "methods" as they both refer to some piece of code.

Nominal descriptive features must always be encoded using one-hot-encoding. We can use the get_dummies() method in the Pandas module for one-hot-encoding. This is a very simple and effective method as you can do one-hot-encoding for multiple features at once and you can even supply a custom prefix for each categorical feature. If you omit the columns parameter, the get_dummies() method will intelligently do one-hot-encoding for all features that are not numeric. Thus, get_dummies() can be used in "automatic" mode.

As a side note, for the get_dummies() function to work correctly in the automatic mode, we would have to make sure that categorical features that look like numerical (such as country IDs) are set to the "string" type so that get_dummies() knows that it needs to one-hot-encode them too. You can achieve this as follows:

dataframe['cat_feature'] = dataframe['cat_feature'].astype(str)

Once you adapt the line above for your problem and then run it, the feature "cat_feature" becomes a string and therefore get_dummies() will one-hot-encode it correctly. While performing one-hot-encoding on a nominal descriptive feature with $q$ unique levels, we can actually define $q-1$ binary variables, but we would lose the base level as a feature and this would be problematic when performing feature selection. For instance, suppose we encode days of week using just 6 binary variables by dropping Monday. However, while doing feature selection for the most important days of the week, Monday will not even be one of the options. So, we will be using exactly $q$ binary variables in our tutorials.

An exception here is the case of a binary nominal feature where $q=2$. In this case, we define only one binary variable by setting the drop_first option in get_dummies() to "True". As an example, we would encode the gender feature using only one binary variable. During feature selection, we would be deciding on whether the gender feature is important at all.

It should also be noted that if you are not planning to do any feature selection, then you should define $q-1$ dummy variables for a nominal feature, not the full set of $q$ dummy variables. The reason is that defining $q$ dummy variables results in multicollinearity and this becomes problematic while fitting models such as generalized linear models (this phenomenon is referred to as the "dummy variable trap").

Sometimes a nominal descriptive feature will have dozens of levels (such as country IDs) and using one-hot-encoding will therefore result in dozens of new (binary) variables. Well, welcome to the curse of dimensionality! To mitigate this issue, you might want to try feature selection. A more elegant solution would be to cluster countries into similar-looking groups, say in to 10 groups, and use this new group feature (which will be nominal) instead of the country ID feature, but we do not cover clustering here.

As a good practice, we should first take care of any integer-encoding procedures before any one-hot-encoding so that get_dummies() works correctly in the automatic mode.

Suppose we first performed any integer-encoding procedures (if required) and our dataset now has a mix of numerical and categorical descriptive features (represented by Data) where all categorical descriptive features are nominal. We can implement a proper one-hot-encoding logic as below in Python.

# get the list of categorical descriptive features
categorical_cols = Data.columns[Data.dtypes==object].tolist()

# if a categorical descriptive feature has only 2 levels,
# define only one binary variable
for col in categorical_cols:
    n = len(Data[col].unique())
    if (n == 2):
        Data[col] = pd.get_dummies(Data[col], drop_first=True)

# for other categorical features (with > 2 levels), 
# use regular one-hot-encoding 
# if a feature is numeric, it will be untouched
Data = pd.get_dummies(Data)

As a simple example, let's see how we can use one-hot-encoding for encoding the "mean_area" categorical descriptive feature.

In [9]:
cancer_df_cat_onehot = pd.get_dummies(cancer_df_cat, columns=['mean_area'])

cancer_df_cat_onehot.head(5)
Out[9]:
mean_radius mean_texture mean_perimeter mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension radius_error ... worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension diagnosis mean_area_small mean_area_average mean_area_large
0 17.99 10.38 122.80 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 ... 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M 0 0 1
1 20.57 17.77 132.90 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 ... 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M 0 0 1
2 19.69 21.25 130.00 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 ... 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M 0 0 1
3 11.42 20.38 77.58 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 ... 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M 1 0 0
4 20.29 14.34 135.10 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 ... 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M 0 0 1

5 rows × 33 columns

We notice that the "mean_area" feature is now replaced with three binary features at the end of the dataset.

Integer-Encoding

A nominal descriptive feature always needs be encoded using one-hot-encoding. However, an ordinal descriptive feature can be encoded via either one-hot-encoding or integer-encoding. For the latter, we can use the replace() function in Pandas. Let's do a simple example. The "mean_area" feature we defined above can in fact be considered to be "ordinal" as there is a natural ordering between the levels of small, average, and large. Let's encode this feature using integer-encoding where each level corresponds to the integers 0, 1, and 2 respectively. Before using the replace() function, we need define a mapping between the levels and the integers using a dictionary as below.

In [10]:
level_mapping = {'small': 0, 'average': 1, 'large': 2}

Once we define the mapping, we can go ahead and perform the integer-encoding using the replace() function. After the encoding, we notice that the "mean_area" feature is now of integer data type.

In [11]:
cancer_df_cat_integer = cancer_df_cat.copy()

cancer_df_cat_integer['mean_area'] = cancer_df_cat_integer['mean_area'].replace(level_mapping)

cancer_df_cat_integer.head(5)
Out[11]:
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_texture worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension diagnosis
0 17.99 10.38 122.80 2 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M
1 20.57 17.77 132.90 2 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M
2 19.69 21.25 130.00 2 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M
3 11.42 20.38 77.58 0 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M
4 20.29 14.34 135.10 2 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M

5 rows × 31 columns

Let's check to make sure the type of the integer-encoded "mean_area" feature is integer.

In [12]:
cancer_df_cat_integer['mean_area'].dtype
Out[12]:
dtype('int64')

Once we encode an ordinal descriptive feature as an integer, we will have to treat it just like another numerical feature. For instance, it's important to keep in mind that integer-encoded ordinal features need to be scaled like other numerical features.

Why Integer-Encoding a Nominal Descriptive Feature is a BAD Idea

Integer encoding inherently assumes an ordering. Suppose you encode "Monday" as 1 and "Tuesday" as 2. If you feed this into a linear regression, for instance, the model will assume that Tuesday is "twice as much" as Monday. Is that really the case? Absolutely not, of course. Monday and Tuesday are just two different days of the week and clearly they are not comparable. That is, there is no natural ordering between the days of the week. However, using integer-encoding on weekdays will introduce an ordering that is not there! Thus, any subsequent modeling you perform will be incorrect and your results will be fundamentally invalid.

Encoding The Target Feature

Interestingly, while we never use integer-encoding for nominal descriptive features, we need to do just that for a nominal target feature! That is, we encode a nominal target feature using integers starting with 0. When a nominal target feature has more than 2 levels, i.e., the multinomial (a.k.a. the multiclass) case, the ordering is unimportant. In fact, we use sklearn module's LabelEncoder() function for this type of integer-encoding (which is simply called "label-encoding") and this function performs the encoding based on the alphabetical order of the feature levels. However, when a nominal target feature has exactly 2 levels, i.e., the binary classification case, we need to make sure that the "positive" class is encoded as "1" regardless of the alphabetical order of the target feature levels, as described below.

To be clear, "label-encoding" is a Scikit-Learn terminology that refers to a particular form of integer-encoding where (1) encoding starts with 0, (2) numbering is done sequentially, and (3) ordering is done alphabetically.

Now, we split cancer_df into the set of descriptive features and the target respectively. Note that we are using the original dataset here and not the categorized one, which was only for demonstration purposes.

WARNING: Below, by using the values method of Pandas for "Data" and "target", we are converting them to NumPy arrays. Here, we make no distinction between a "vector" or a "matrix" as both are simply referred to as NumPy arrays with the latter being a two-dimensional array. NumPy arrays are usually harder to work with. For instance, when we convert "Data" from a data frame to a NumPy array, we lose column names! The reason we do this transformation is that Scikit-Learn only works with NumPy arrays and not Pandas variables (series or data frames). If you pass in a Pandas variable to a Scikit-Learn method, sometimes it will work! But sometimes it won't, yet the error message you get will confuse you even further! For this reason, in order to avoid unnecessary dramas, you should never pass in any Pandas variables into Scikit-Learn methods. In case you need to, you will first have to make sure that you use the values method of the Pandas variable in order to convert it to a NumPy array beforehand.

In [13]:
Data = cancer_df.drop(columns = 'diagnosis').values

target = cancer_df['diagnosis'].values

Remember, Scikit-Learn requires all data to be numeric, so the target feature in our example needs to be encoded as 0 and 1. Note that if we had more than two levels, their label encoding would be 0,1,2,3, etc.

So, how does Scikit-Learn know if 0, 1, 2, 3 etc. is actually a numeric target feature, or it's label-encoding of a categorical target feature? We tell this to Scikit-Learn with the type of machine learning algorithm we use. For predicting numerical target features, we fit a "regressor" whereas for predicting categorical target features, we fit a "classifier".

First, let's count how many instances each label has in the target feature in the cancer dataset.

In [14]:
np.unique(target, return_counts=True)
Out[14]:
(array(['B', 'M'], dtype=object), array([357, 212]))

As expected, "B" (Benign) and "M" (Malignant) have 357 and 212 observations respectively. Next, let's encode these as 0 and 1 using LabelEncoder from the sklearn preprocessing module.

In [15]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le_fit = le.fit(target)
target_encoded_le = le_fit.transform(target)

Note that the LabelEncoder labels in an alphabetical order. That is, "B" is labeled as 0 whereas "M" as labeled as 1.

Let's check how the encoding was done. Here, you need to be careful with respect to NumPy vs. Pandas variables. The return value of label encoding is a NumPy array, so you now need to use the np.unique method; because value_counts method belongs to the Pandas module and it will not work with a NumPy object.

In [16]:
import numpy as np

print("Target Type:", type(target))

print("Counts Using NumPy:")
print(np.unique(target_encoded_le, return_counts = True))

# this will not work:
# ### target_encoded_le.value_counts()

# but this works:
print("Counts Using Pandas:")
print(pd.Series(target_encoded_le).value_counts())
Target Type: <class 'numpy.ndarray'>
Counts Using NumPy:
(array([0, 1]), array([357, 212]))
Counts Using Pandas:
0    357
1    212
dtype: int64

NOTE: Keep in mind that you can always go back and forth between a NumPy array and a Pandas series or data frame as below where s denotes a series, df denotes a data frame, and a denotes an array:

  • NumPy to Pandas: pd.Series(a) (or pd.DataFrame(a))
  • Pandas to Numpy: s.values (or df.values)

So, once encoded, how do we recover the original labels? We can do that using the inverse_transform method of the LabelEncoder.

In [17]:
target_original_values = le_fit.inverse_transform(target_encoded_le)

# here is an example of NumPy array to pandas Series conversion:
pd.Series(target_original_values).value_counts()
Out[17]:
B    357
M    212
dtype: int64

In general, the positive class (the class that we are interested in) needs to be encoded as "1" and the negative class needs to be encoded as "0" so that the performance metrics we define work correctly (we will discuss these in more detail in SK 4: Evaluation). In this case, we got lucky because "M" comes after "B" in the alphabet! So, label encoder correctly encoded the malignant class as "1". But we cannot rely on luck all the time, so in case the label encoding does not give us the result we want, we will have to manually define the labels ourselves. For this purpose, we can use the where() function in NumPy to perform a vectorized "if-else" operation as below. The syntax for this function is follows:

np.where(condition, value when condition is true, value when condition is false)
In [18]:
target_encoded_where = np.where(target=='M', 1, 0)

np.unique(target_encoded_where, return_counts = True)
Out[18]:
(array([0, 1]), array([357, 212]))

We observe that manually encoding the target feature using the where() function gives us the encoding we want. As an alternative to the where() function, we can actually use the replace() function in Pandas for the same encoding. Keep in mind that the replace() function is quite handy for replacing/ mapping values in a series.

In [19]:
# first convert "target" to a Series so that we can use the replace function
target_encoded_replace = pd.Series(target).replace({'B': 0, 'M': 1}).values

np.unique(target_encoded_replace, return_counts = True)
Out[19]:
(array([0, 1]), array([357, 212]))

In case you are wondering, there is really nothing special about "label-encoding" in Scikit-Learn as it is simply a "find-replace" logic as we did using the where() or replace() functions. To verify that both label-encoding and our manual "find-replace" logic using the where() function result in the same output, let's confirm that the arrays target_encoded_where and target_encoded_le are exactly the same!

In [20]:
np.array_equal(target_encoded_where, target_encoded_le)
Out[20]:
True

Scaling Descriptive Features

Once all categorical descriptive features are encoded, all features in this transformed dataset will be numerical. It is always a good idea to scale these numerical descriptive features before fitting any models, as scaling is mandatory for some important class of models such as nearest neighbors, SVMs, and deep learning.

Three popular types of scaling are as follows:

  1. Min-Max Scaling: Each descriptive feature is scaled to be between 0 and 1. Min-max scaling for a numerical feature is done as follows:

    $\mbox{scaled_value} = \frac{\mbox{value - min_value}}{\mbox{max_value - min_value}}$

  2. Standard Scaling: Scaling of each descriptive feature is done via standardization. That is, each value of the descriptive feature is scaled by removing the mean and dividing by the standard deviation of that feature. This ensures that, after scaling, each descriptive feature has a 0 mean and 1 standard deviation. Standard scaling is done as follows:

    $\mbox{scaled_value} = \frac{\mbox{value - mean}}{\mbox{std. dev.}}$

  3. Robust Scaling: Robust scaling is similar to standard scaling. However, this scaling type is robust to potential outliers in the feature as the median is used instead of mean, and MAD (median absolute deviation) is used instead of the standard deviation. If you suspect that there are outliers in your dataset, you may prefer to use robust scaling, which is done as follows:

    $\mbox{scaled_value} = \frac{\mbox{value - median}}{\mbox{MAD}}$

All of the above scaling types can be easily performed using the preprocessing module in sklearn. Let's have a look at a simple example.

In [21]:
from sklearn import preprocessing

x = np.arange(10).reshape(-1, 1)
x = np.vstack((x, 100.0))

min_max = preprocessing.MinMaxScaler().fit_transform(x).ravel()
standard = preprocessing.StandardScaler().fit_transform(x).ravel()
robust = preprocessing.RobustScaler().fit_transform(x).ravel()

x_scaled_df = pd.DataFrame({'x': x.ravel(), 'min_max': min_max, 'standard': standard, 'robust': robust})
x_scaled_df.round(2)
Out[21]:
x min_max standard robust
0 0.0 0.00 -0.48 -1.0
1 1.0 0.01 -0.44 -0.8
2 2.0 0.02 -0.41 -0.6
3 3.0 0.03 -0.37 -0.4
4 4.0 0.04 -0.33 -0.2
5 5.0 0.05 -0.30 0.0
6 6.0 0.06 -0.26 0.2
7 7.0 0.07 -0.22 0.4
8 8.0 0.08 -0.19 0.6
9 9.0 0.09 -0.15 0.8
10 100.0 1.00 3.15 19.0

Let's scale the descriptive features in our breast cancer dataset before fitting any classifiers. Here, we use the MinMaxScaler scaler as an illustration.

In [22]:
Data = preprocessing.MinMaxScaler().fit_transform(Data)

Sampling Observations

Sometimes your dataset will just have too many rows for your computer to handle. In this case, the smart thing to do will be to select only a small subset of the entire dataset during modeling.

We can use the sample function in Pandas for selecting a small random subset of the entire data. We can use the this function with either one of these two options: "frac" for selecting a certain fraction of the entire data, say 0.2, or "n" for selecting a certain number of rows, say 100 rows.

In the example below, we load the same dataset (though with no column names) directly from sklearn to illustrate how we can still select a random subset even if the descriptive features and the target feature are in different arrays. In particular, since in this case data and target are in different NumPy arrays, we need to set a common random state so that we select exactly the same rows in both the data and the target. Of course, the Breast Cancer dataset is already quite small with only 569 observations, so we do this sampling only for illustration purposes.

In the code below, we use a simple trick to use the sample function: since this is a Pandas function, we first convert the data and the target to a data frame, use the sample function, and then revert them back to a NumPy array using the values function.

In [23]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

cancer_df = load_breast_cancer()

Data, target = cancer_df.data, cancer_df.target

Data_sample = pd.DataFrame(Data).sample(n=100, random_state=8).values
target_sample = pd.DataFrame(target).sample(n=100, random_state=8).values

Let's check to see both the data and the target are still NumPy arrays and that they now have exactly 100 rows.

In [24]:
print(type(Data_sample))
print(type(target_sample))

print(Data_sample.shape)
print(target_sample.shape)
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
(100, 30)
(100, 1)

www.featureranking.com