SK Part 0: Introduction to Predictive Modeling with Python and Scikit-Learn


SK Part 0: Introduction to Predictive Modeling with Python and Scikit-Learn

This is the first in a series of tutorials on supervised machine learning with Python and Scikit-Learn. It is a short introductory tutorial that provides a bird's eye view using a binary classification problem as an example and it is actually is a simplified version of the tutorial SK Part 1. The reference textbook in these tutorials is below and it can be accessed here.

The classifiers illustrated are as follows:

  • Nearest neighbors (Chapter on Similarity-based learning)
  • Decision trees (Chapter on Information-based learning)
  • Random forests ensemble method (Chapter on Information-based learning)
  • Naive Bayes (Chapter on Probability-based learning)
  • Support vector machines (Chapter on Error-based learning)

As an overview, we shall cover various aspects of scikit-learn in the following tutorials:

  • SK Part 0 ("SK-Intro"): Introduction to machine learning with Python and scikit-learn (this tutorial)
  • SK Part 1 ("SK-Basics"): Basic model fitting
  • SK Part 2 ("SK-FS"): Feature selection and ranking
  • SK Part 3 ("SK-Eval"): Model evaluation (using performance metrics other than simple accuracy)
  • SK Part 4 ("SK-CV"): Cross-validation and hyper-parameter tuning
  • SK Part 5 ("SK-Pipes"): Machine learning pipeline, statistical model comparison, and model deployment

Binary Classification Example: Breast Cancer Wisconsin Data

This dataset is concerned with predicting whether a cell tissue is cancerous or not using the cell's measurement values. It contains 569 observations and 30 input features. The target feature, "diagnosis", has two classes: 212 "malignant" and 357 "benign", denoted by "M" and "B" respectively.

The dataset has no missing values and all features are numeric other than the target feature (which is binary).

Reading Breast Cancer Dataset from the Cloud

We load the data directly from the following github account.

In [1]:
import warnings

import numpy as np
import pandas as pd
import io
import requests

# so that we can see all the columns
pd.set_option('display.max_columns', None) 

# how to read a csv file from a github account
url_name = ''
url_content = requests.get(url_name, verify=False).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))

Let's check the shape of this dataset to make sure it has been downloaded correctly.

In [2]:
(569, 31)

Let's have a look at the first 5 rows in this raw dataset.

In [3]:
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension radius_error texture_error perimeter_error area_error smoothness_error compactness_error concavity_error concave_points_error symmetry_error fractal_dimension_error worst_radius worst_texture worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension diagnosis
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M

Partitioning Dataset into the Set of Descriptive Features and the Target Feature

Next, we partition cancer_df columns into the set of descriptive features and the target.

In [4]:
# The ".values" part below converts the data frame to a 2-dimensional numpy array
Data = df.drop(columns = 'diagnosis').values
target = df['diagnosis']

Encoding Target

Keep in mind that scikit-learn always requires all data to be numeric, so the target needs to be encoded as 0 and 1.

In [5]:
from sklearn import preprocessing

target = preprocessing.LabelEncoder().fit_transform(target)

Note that the LabelEncoder labels in an alphabetical order. That is, "B" is labeled as 0 whereas "M" as labeled as 1 (see the code below).

In [6]:
np.unique(target, return_counts = True)
(array([0, 1]), array([357, 212], dtype=int64))

Scaling Descriptive Features

It's always a good idea to scale your descriptive features before fitting any models. Here, we use the "min-max scaling" so that each descriptive feature is scaled to be between 0 and 1. In the rest of this tutorial, we work with scaled data.

In [7]:
Data = preprocessing.MinMaxScaler().fit_transform(Data)

Spliting Data into Training and Test Sets

We split the descriptive features and the target feature into a training set and a test set by a ratio of 70:30. That is, we use 70% of the data to build our classifiers and evaluate their performance on the remaining 30% of the data. This is to ensure that we measure model performance on unseen data in order to avoid overfitting. We also set a random state value so that we can replicate our results later on.

In [8]:
from sklearn.model_selection import train_test_split

D_train, D_test, t_train, t_test = train_test_split(Data, 
                                                    test_size = 0.3,

Fitting a Nearest Neighbor Classifier

Let's fit a nearest neighbor classifier with 5 neighbors using the Euclidean distance. We fit the model on the train data and evaluate its performance on the test data.

Below, the score method returns the accuracy of the classifier on the test data. Accuracy is defined as the ratio of correctly predicted observations to the total number of observations.

In [9]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier(n_neighbors=5, p=2), t_train)
knn_classifier.score(D_test, t_test)

If you would like to see which parameters are available for a classifier, just type the name of the classifier followed by a question mark, e.g., "KNeighborsClassifier?".

In [10]:
# KNeighborsClassifier?

Fitting a Decision Tree Classifier

Let's fit a decision tree classifier with the entropy split criterion and a maximum depth of 4 on the train data, and then evaluate its performance on the test data.

In [11]:
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(criterion='entropy', max_depth=4), t_train)
dt_classifier.score(D_test, t_test)

Fitting a Random Forest Classifier

An ensemble method is a collection of many sub-classifiers. The final outcome is determined by a majority voting of the sub-classifiers. Random forest classifier is a popular ensemble method based on the idea of "bagging" where the sub-classifiers are decision trees. Let's fit a random forest classifier with 100 decision trees.

In [12]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100), t_train)
rf_classifier.score(D_test, t_test)

Fitting a Gaussian Naive Bayes Classifier

Another model we would like to fit to the breast cancer dataset is the Gaussian Naive Bayes classifier with a variance smoothing value of $10^{-3}$.

In [13]:
from sklearn.naive_bayes import GaussianNB

nb_classifier = GaussianNB(var_smoothing=10**(-3)), t_train)
nb_classifier.score(D_test, t_test)

Fitting a Support Vector Machine

One last model we fit is the SVM with all the default values.

In [14]:
from sklearn.svm import SVC

svm_classifier = SVC(), t_train)
svm_classifier.score(D_test, t_test)

Making Predictions with a Fitted Model

Once a model is built, a prediction can be made using the predict method of the fitted classifier.

For example, suppose we would like to use the fitted nearest neighbor classifier as our model, and we would like to find out the model's prediction for the first three rows in the input data. Of course, we already know the labels of these rows (which are all malignant), so this is just to illustrate how you would make a prediction for a new observation.

In [15]:
new_obs = Data[0:3]
array([1, 1, 1])

The model's prediction for these three rows is that they are all "1", that is, they are all "malignant". Thus, in this particular case, we observe that the model correctly predicts the first three rows in the input data.


This tutorial illustrates that Python and Scikit-Learn together provide a unified interface to model fitting and evaluation and they greatly simplify the machine learning workflow.

Of course, there is a whole lot more to supervised machine learning than what is shown in here, such as

  1. Other classification algorithms
  2. Solving prediction problems where the target feature is numeric (a.k.a. regression problems)
  3. Using other model performance metrics (e.g., precision, recall, mean squared error for regression, etc.)
  4. More sophisticated model performance assessment methods (such as cross-validation)
  5. How model parameters can be optimized (also known as hyperparameter tuning)

We cover these topics in the subsequent tutorials.