SK Part 5: Advanced Topics: Pipelines, Statistical Model Comparison, and Model Deployment


SK Part 5: Pipelines, Statistical Model Comparison, and Model Deployment

In this tutorial, we discuss several advanced topics as outlined in the learning objectives below. Here, we exclusively work with the Breast Cancer Wisconsin dataset.

Learning Objectives

  • How to work with large datasets
  • Utilize the machine learning pipeline
  • Use parallel processing for model evaluation
  • Save and load trained models
  • Conduct statistical tests for model performance comparison
  • Deploy a model in the real world for predicting new observations

Getting Started

Let's load the Breast Cancer Wisconsin dataset from sklearn. For scaling, we use the MinMaxScaler scaler. However, rather than fitting and transforming in one attempt, we shall perform the scaling in two steps. That is, we first fit the data to create a scaler that "knows" how to transform the input data. Next, we perform the actual transformation. The reason we specifically want a fitted scaler is that, once we figure out which model to use and we would like to make predictions for new observations, we will need to transform the new observations exactly as we transformed the descriptive features while training the model. This shall be demonstrated in the model deployment section below.

As for the target feature, as mentioned in the SK Part 3 tutorial, even though the target feature is already encoded as binary, we need to switch the target classes so that 0 is the negative class (benign) and 1 is the positive class (malignant). The reason for this switching is that we would like the malignant class to be designated as the "positive" class so that binary performance metrics such as recall and AUC work as intended within Scikit-Learn.

In [1]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn import preprocessing

df = load_breast_cancer()

# pay attention that we name the original data as "Data_unscaled"
Data_unscaled, target =,

# here we define a scaler that "knows" how to scale input data
# we will use this scaler when deploying a model later in this tutorial
data_scaler = preprocessing.MinMaxScaler().fit(Data_unscaled)
Data = data_scaler.transform(Data_unscaled)

# target is already encoded, but we need to reverse the labels
# so that malignant is the positive class
target = np.where(target==0, 1, 0)

Working with Large Datasets

Large datasets (i.e., datasets with lots of features and/ or observations) can pose as rather significant computational bottlenecks in terms of training, testing, and hyperparameter tuning. In this section, we provide some guidelines for working with large datasets in an efficient manner.

In regards to a large number of features, the best way to tackle this problem is feature selection, which is discussed in detail in SK Part 2. Specifically, you might want to try selecting only a handful of features, such as 10, 20, or 30 features only.

In regards to a large number of observations, we have a couple of options:

  • We can select a small random subset of all observations and work with this subset throughout the hyperparameter tuning. Once the best model is identified, we can go ahead and use the entire dataset for training and subsequent model deployment.
  • We can use less CPU-demanding evaluation methods, such as 3-fold cross-validation with no repetitions.
  • Instead of a full grid search for hyperparameter tuning, we can perform a limited random search over the hyperparameter space.

We now discuss these options in detail.

Performing a grid search to identify optimal algorithm hyperparameters can be quite taxing on your computer's hardware, especially if your dataset has lots of features and lots of observations. For this reason, before performing a grid search or utilizing a pipeline, you might want to select only a small subset of observations in your original data. This topic is explained in the tutorial Data Preparation. Of course, once you decide on which model to use (by training on this small subset), you can go ahead and use the entire data to train your final best model before you deploy your model in the real world.

In the pipelines discussion below, we use 2-repeated 5-fold cross-validation as our evaluation method. If you are working with a very large dataset and run time is a concern, in addition to selecting a small subset of observations, you might also want to try less CPU-demanding performance evaluation methods, such as 3-fold cross-validation with no repetitions (by using StratifiedKFold(n_splits=3)) in order to reduce the computational burden.

Furthermore, instead of performing a full grid search over the hyperparameter space, you might consider performing a small random search over this space using RandomizedSearchCV as explained below.

Pipelines for Stacking Processes

We can stack feature selection and grid search for hyperparameter tuning (via cross-validation) in a "pipeline".

In the following chunks of code, we show how to perform the following using the Pipeline function:

  1. Perform feature selection using SelectKBest. We shall use the f_classif and mutual_info_classif score functions with 10, 20, and full set of features.
  2. Train a KNN model with different k and p values using AUC as our performance metric.

The general format for specifying parameters of the pipe processes is <model_label_for_pipe>__<model_parameter>.

As an example, we label our learning model as knn. Thus, in params, we specify the parameter names starting with knn__, i.e., knn__n_neighbors and knn__p.

In [2]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RepeatedStratifiedKFold, GridSearchCV
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

cv_method = RepeatedStratifiedKFold(n_splits=5, 

# define a pipeline with two processes
# if you like, you can put MinMaxScaler() in the pipeline as well
pipe_KNN = Pipeline([('fselector', SelectKBest()), 
                     ('knn', KNeighborsClassifier())])

params_pipe_KNN = {'fselector__score_func': [f_classif, mutual_info_classif],
                   'fselector__k': [10, 20, Data.shape[1]],
                   'knn__n_neighbors': [1, 2, 3, 4, 5],
                   'knn__p': [1, 2]}
gs_pipe_KNN = GridSearchCV(estimator=pipe_KNN, 

After defining the pipeline, we can go ahead and fit our data. Observe that our grid search involves 2 x 3 x 5 x 2 = 60 different parameter combinations.

In [3]:, target);
Fitting 10 folds for each of 60 candidates, totalling 600 fits
In [4]:
{'fselector__k': 20,
 'fselector__score_func': <function sklearn.feature_selection._mutual_info.mutual_info_classif(X, y, *, discrete_features='auto', n_neighbors=3, copy=True, random_state=None)>,
 'knn__n_neighbors': 5,
 'knn__p': 2}
In [5]:

We observe that we get the best performance with the 20 features selected by the F-score with 5 nearest neighbors and with the Manhattan distance metric for an AUC score of 98.7%. Below is the best estimator as a Pipeline object.

In [6]:
                             score_func=<function mutual_info_classif at 0x00000151B510C940>)),
                ('knn', KNeighborsClassifier())])

Randomized Search: Searching across the entire set of hyperparameter combinations in a complete grid search can be rather time consuming. For this reason, if you define a very large search space, you might want to use RandomizedSearchCV rather than GridSearchCV which will only search a randomly selected small subset of the hyperparameter space. For this function, you specify an n_iter option which controls the total number of random hyperparameter combinations that you want to try all together. Per official documentation, "In contrast to GridSearchCV, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by n_iter". RandomizedSearchCV will certainly speed up the search, but the downside is that you might miss the truly optimal hyperparameter combination because this particular combination might not be among the ones that are selected by RandomizedSearchCV. Here is an example of randomized search with only 10 random combinations.

In [7]:
from sklearn.model_selection import RandomizedSearchCV

n_iter_search = 10
random_pipe_KNN = RandomizedSearchCV(estimator=pipe_KNN, 
                           verbose=1), target);
Fitting 10 folds for each of 10 candidates, totalling 100 fits
In [8]:
{'knn__p': 2,
 'knn__n_neighbors': 4,
 'fselector__score_func': <function sklearn.feature_selection._mutual_info.mutual_info_classif(X, y, *, discrete_features='auto', n_neighbors=3, copy=True, random_state=None)>,
 'fselector__k': 20}
In [9]:

Parallel Processing

Let's discuss how we can speed up execution of the pipeline. Some Scikit-Learn functions (such as GridSearchCV, RandomizedSearchCV, and cross_val_score) have an argument named n_jobs that specifies the number of cores that needs to be used in the process. By default, sklearn functions use n_jobs=1, which results in no parallel processing at all and therefore a slower traing process. In order to achieve parallel processing, we change this parameter's value. For instance, we can set n_jobs=2 to use 2 cores, or we can set n_jobs=-1 to fire up all cores in our computer. This parallel approach is known as "multiprocessing". However, using all the cores is not a good idea as this will probably make your computer almost unusable in the meantime. For this reason, we recommend setting n_jobs=-2 so that Scikit-Learn uses all the cores except one. Thus, we leave one core out of the parallel processing to give our computer some breathing space.

Notice that below we set n_jobs to -2 to use all the cores but one.

In [10]:
gs_pipe_KNN_all_cores = GridSearchCV(estimator=pipe_KNN, 
                           verbose=1), target);
Fitting 10 folds for each of 60 candidates, totalling 600 fits

Do not assume that using parallel processing will always result in shorter run times. Parallel processing has an overhead where the task at hand needs to be split into sub-tasks and this has a time cost. Whether parallel processing is faster than sequential processing depends on a lot of factors, such as your dataset's characteristics (too small or too big), your algorithm, your pipeline, and your computer's hardware. So, you might want to experiment with this for your particular problem.

Saving and Loading Models

Once we execute a pipeline and obtain the best estimator, we can retrain the model with these optimal parameters using the full dataset before we make a new prediction. Alternatively, we can save the best estimator from the pipeline and load it whenever we need it.

The best model from the pipeline is shown below.

In [11]:
                             score_func=<function mutual_info_classif at 0x00000151B510C940>)),
                ('knn', KNeighborsClassifier())])

Let's save the best model as "best_KNN.pkl" in a pickle format. This is a compressed format for Python.

In [12]:
import joblib

joblib.dump(gs_pipe_KNN.best_estimator_, 'best_KNN.pkl', compress = 1)

To retreive the saved model object, we load it using joblib.load function.

In [13]:
saved_knn = joblib.load('best_KNN.pkl')

Once we load the best model, we can use it to make predictions. For instance, let's predict the target label for the first 20 rows in the full dataset.

In [14]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0])

Comparing Performance of Classifiers Using Paired T-Tests

For performance assessment of classifiers and regressors, we use repeated cross-validation. However, cross-validation itself is a random process and we need statistical tests in order to determine if any difference between the performance of any two classification methods is statistically significant; or if it is within the sample variation and the difference is statistically insignificant.

Let's now do the following: select the best 3 features among the descriptive features using the F-Score metric and compare the best tree estimator to the best nearest neighbor estimator using these 3 features. We will continue using AUC as our performance metric.

First, we execute another KNN pipeline with the same parameters, but with only 3 features as selected by the F-score method.

In [15]:
params_pipe_KNN_fs = {'fselector__k': [3],
                      'knn__n_neighbors': [1, 2, 3, 4, 5],
                      'knn__p': [1, 2]}
pipe_KNN_fs = Pipeline([('fselector', SelectKBest(score_func=f_classif)), 
                        ('knn', KNeighborsClassifier())])

gs_pipe_KNN_fs = GridSearchCV(estimator=pipe_KNN_fs, 
                           verbose=1), target);
Fitting 10 folds for each of 10 candidates, totalling 100 fits
In [16]:
{'fselector__k': 3, 'knn__n_neighbors': 5, 'knn__p': 1}
In [17]:

Next, let's execute the DT pipeline with the same 3 features as selected by the F-score method.

In [18]:
from sklearn.tree import DecisionTreeClassifier

df_classifier = DecisionTreeClassifier(random_state=999)

params_pipe_DT_fs = {'fselector__k': [3],                  
                  'dt__criterion': ['gini', 'entropy'],
                  'dt__max_depth': [1, 2, 3, 4]}
pipe_DT_fs = Pipeline([('fselector', SelectKBest(score_func=f_classif)), 
                    ('dt', df_classifier)])

gs_pipe_DT_fs  = GridSearchCV(estimator=pipe_DT_fs, 
                           verbose=1), target);
Fitting 10 folds for each of 8 candidates, totalling 80 fits
In [19]:
{'dt__criterion': 'entropy', 'dt__max_depth': 3, 'fselector__k': 3}

We see that the best parameters are the entropy split criterion with a maximum depth of 3.

In [20]:

Now we are ready to compare the best DT estimator against the best KNN estimator. This time let's use a stratified 5-fold cross-validation with 5 repetitions. But we will set a different random seed this time. We will also use parallel processing to speed up the cross-validation process.

In [21]:
from sklearn.model_selection import cross_val_score

cv_method_ttest = RepeatedStratifiedKFold(n_splits=5, 

cv_results_KNN = cross_val_score(estimator=gs_pipe_KNN_fs.best_estimator_,
In [22]:
cv_results_DT = cross_val_score(estimator=gs_pipe_DT_fs.best_estimator_,

Since we fixed the random state to be same for both cross-validation procedures, both classifiers were fitted and then tested on exactly the same data partitions. This indicates that our experiments were actually paired. Conducting experiments in a paired fashion reduces the variability significantly compared to conducting experiments in an independent fashion.

Let's now conduct paired t-tests to see if the difference between the best tree and best nearest neighbor models is statistically significant.

For a paired t-test in Python, we use the stats.ttest_rel function inside the scipy module and look at the p-values. At a 95% significance level, if the p-value is smaller than 0.05, we conclude that the difference is statistically significant.

In [23]:
from scipy import stats

print(stats.ttest_rel(cv_results_DT, cv_results_KNN).pvalue.round(3))

We observe that the p-value of our paired t-test is below 0.05. Thus, we conclude that, at a 95% level, this difference is statistically significant. Another observation here is that performances of the classifiers with only 3 features is not bad at all; they are slightly lower compared to corresponding classifiers trained with the full set of 30 features.

As a side note, we do not recommend using more than 5 repetitions within a repeated cross-validation scheme. The reason is that, one can set an extremely high number of repetitions and, in this case even a very small difference becomes statistically significant. In addition, you need to think about the difference in practical terms. For instance, suppose the difference between two classifiers, say A and B, is 1.5%, and it is statistically significant. However, is it really the case that there is a practical difference and classifier A is always better than B? After all, the dataset you have is a small subset of the entire population in most cases, and the dataset itself is random. Another selection of a dataset from the same population might perhaps would have reversed the situation, this time making classifier B better than A (for example, consider another independent set of 569 patients for breast cancer screening). Thus, setting the repetitions too high ends up overfitting the dataset and it is generally not a good idea.

Model Deployment

Suppose we decided that the best estimator of the KNN pipeline (with the full set of features) is the preferred machine learning method for our problem. So, now we would like to go ahead and deploy this model in the real world for making a prediction on a new set of observations.

The business of train-test-split (or cross-validation) is for comparing models and making sure we are not overfitting. Once we figure out which model to use for deployment, we use the entire data for training, not just the train data. Think about this: why would you throw away a good chunk of your data (say, 30% that you used for testing) while training a model for deployment? That would be a waste of data, and data is always valuable.

Once we train our deployment model with the entire data, we are ready to make predictions for new observations. The trick here is that, before making a new prediction, we must scale the new observations exactly as we scale the original input data.

As an example, suppose the model is a main-effects logistic regression model. Consider an age feature with a minimum value of 20 and a maximum value of 60 that was normalized between 0 and 1 during training. Suppose the coefficient of the age feature in the logistic regression (after model fitting) is 1.6. For an observation with an age value of 40, its normalized value would be 0.5 and its contribution to the linear part in the logistic regression would be 1.6 x 0.5 = 0.8. In a set of new observations, suppose the minimum age is 30 and maximum age is 80. If normalized from scratch, an observation with an age value of 40 in the new set of observations would be (40-30)/(80-30) = 0.2, and its contribution to the linear part in the logistic regression would be 1.6 x 0.2 = 0.32, which will clearly result in different prediction probabilities. This example illustrates that new observations must be scaled exactly the same way as the original data.

For this purpose, we use the "data_scaler" object that we created earlier that "knows" how to transform input data. As an illustration, suppose we would like to find out the model's prediction for the first three rows in the input data. Of course, we already know the labels of these rows (which are all malignant), so this is just to illustrate how you would make a prediction for a new set of observations.

In [24]:
knn_classifier_deployment = gs_pipe_KNN.best_estimator_

# train deployment model on the entire (scaled) data, target)

# get the first three observations from the original (unscaled) input data
obs_for_prediction_unscaled = Data_unscaled[0:3, ]

# scale these observations using the min-max scaler that was fitted to the input data
obs_for_prediction = data_scaler.transform(obs_for_prediction_unscaled)

# use the model's predict function for making a prediction for these three observations
array([1, 1, 1])