SK Part 4: Cross-Validation and Hyperparameter Tuning

SK4

SK Part 4: Cross-Validation and Hyper-parameter Tuning

In SK Part 1, we learn how to evaluate a machine learning model using the train_test_split function to split the full set into disjoint training and test sets based on a specified test size ratio. We then train the model (that is, "fit") using the training set and evaluate it against the test set. This approach is called "hold-out sampling". A more robust and methodical approach to hold-out sampling is "cross-validation", which is the subject of this tutorial.

A critical step in machine learning is to find out the "optimal" parameters of a learner (such as the number of neighbors in a KNN model). In this tutorial, we illustrate how optimal model parameters can be identified using repeated cross-validation in a grid search framework.

Learning Objectives

  • Implement various cross-validation strategies.

  • Perform grid search to identify optimal hyperparameter values.

As in Part 1, we shall use the following datasets for regression, binary, and multiclass classification problems.

  1. Breast Cancer Wisconsin Data. The target feature is binary, i.e., if a cancer diagnosis is "malignant" or "benign".
  2. Boston Housing Data. The target feature is continuous. The target is house prices in Boston in 1970's.
  3. Wine Data. The target feature is multiclass. It consists of three types of wines in Italy.

We use KNN, DT, and NB models to illustrate how cross-validation is used to tune hyperparameters of a machine learning algorithm via grid search by going through the Breast Cancer Data and Boston Housing Data. We will leave Wine Data and other machine learning models as exercises.

Binary Classification: Breast Cancer Wisconsin Data

Data Preparation

Let's prepare the dataset for modeling by performing the following:

  • load the dataset from sklearn (unlike the Cloud version, this version does not have column names),
  • normalize the descriptive features so that they have 0 mean and 1 standard deviation, and
  • split the dataset into training and test sets.
In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing

df = load_breast_cancer()
Data, target = df.data, df.target

Data = preprocessing.MinMaxScaler().fit_transform(Data)

# target is already encoded, but we need to reverse the labels
# so that malignant is the positive class
target = np.where(target==0, 1, 0)

D_train, D_test, t_train, t_test = train_test_split(Data, target, test_size = 0.3, random_state=999)

Nearest Neighbor Models

Let's fit a 1-nearest neighbor (1-NN) classifier (n_neighbors=1) using the Euclidean distance (p=2).

In [2]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=1, p=2)

knn_classifier.fit(D_train, t_train)
knn_classifier.score(D_test, t_test)
Out[2]:
0.9473684210526315

The 1-NN classifier yields an accuracy score of around 94.7%. So, how can we improve this score? One way is to search the set of "hyperparameters" which produces the highest accuracy score. For a nearest neighbor model, the hyperparameters are as follows:

  • Number of neighbors.
  • Metric: Manhattan (p=1), Euclidean (p=2) or Minkowski (any p larger than 2). Technically, p=1 and p=2 are also Minkowski metrics, but in this notebook, we shall adopt the convention that the Minkowski metric corresponds to $p \geq 3$.

To search for the "best" set of hyperparameters, popular approaches are as follows:

  • Random search: As its name suggests, it randomly selects the hyperparameter set to train models.
  • Bayesian search: It is beyond the scope of this course. So we shall not cover it here.
  • Grid search.

Grid search is the most common approach. It exhaustively searches through all possible combinations of hyperparameters during training the phase. For example, consider a KNN model. We can specify a grid of number of neighbors (K = 1, 2, 3) and two metrics (p=1, 2). The grid search starts training a model of K = 1 and p=1 and calculates its accuracy score. Then it moves to train models of (K = 2, p = 1), (K = 3, p = 1), (K = 1, p = 2), ..., and (K = 3, p = 2) and obtain their score values. Based on the accuracy scores, the grid search will rank the models and determine the set of hyperparameter values that give the highest accuracy score.

Before we proceed further, we shall cover other cross-validation (CV) methods since tuning hyperparameters via grid search is usually cross-validated to avoid overfitting.

Cross-Validation

Two popular options for cross-validation are 5-fold and 10-fold. In 5-fold cross-validation, for instance, the entire dataset is partitioned into 5 equal-sized chunks. The first four chunks are used for training and the 5-th chunk is used for testing. Next, all the chunks other than the 4-th chunk are used for training and the 4-th chunk is used for testing, and so on. In the last iteration, all the chunks other than the 1-st chunk are used for training and the 1-st chunk is used for testing. The final step is to take the average of these 5 test accuracies and report it as the overall cross-validation accuracy. Please see the figure below for an illustration of a 10-fold cross-validation (source: karlrosaen.com) . Please refer to Chapter 8 in the textbook for more information.