Information Gain Computation in Python¶
This tutorial illustrates how impurity and information gain can be calculated in Python using the NumPy
and Pandas
modules for information-based machine learning. The impurity calculation methods described in here are as follows:
- Entropy
- Gini index
We start off with a simple example, which is followed by the Vegetation example in the "Information-Based Learning" Chapter in the textbook.
A Simple Example¶
Suppose you are going out for a picnic and you are preparing a basket of some delicious fruits.
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
lst = ['apple']*3 + ['orange']*2 + ['banana']*2
fruits = pd.Series(lst)
print(fruits)
0 apple 1 apple 2 apple 3 orange 4 orange 5 banana 6 banana dtype: object
Here is the relative frequency of each fruit in the basket, which can be considered as the probability distribution of the fruits.
probs = fruits.value_counts(normalize=True)
probs
apple 0.428571 orange 0.285714 banana 0.285714 dtype: float64
If you like, you can define the probability distribution yourself as below.
probs_by_hand = [3/7, 2/7, 2/7]
print(probs_by_hand)
[0.42857142857142855, 0.2857142857142857, 0.2857142857142857]
Recall that Shannon's model defines entropy as $$H(x) := - \sum_{i=1}^{\ell}(P(t=i) \times \log_{2}(P(t=i))$$ The idea with entropy is that the more heterogenous and impure a feature is, the higher the entropy. Conversely, the more homogenous and pure a feature is, the lower the entropy.
The following calculation shows how impurity of this fruit basket can be computed using the entropy criterion.
entropy = -1 * np.sum(np.log2(probs) * probs)
entropy
1.5566567074628228
The gini impurity index is defined as follows: $$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$ The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.
A nice property of the Gini index is that it is always between 0 and 1, and this may make it easier to compare Gini indices across different features.
The impurity of our fruit basket using Gini index is calculated as below.
gini_index = 1 - np.sum(np.square(probs))
gini_index
0.653061224489796
In comparison, let's compute impurity of another fruit basket with 7 different fruits with equal frequency.
lst2 = ['apple', 'orange', 'banana', 'mango', 'blueberry', 'watermelon', 'pear']
fruits2 = pd.Series(lst2)
print(fruits2)
probs2 = fruits2.value_counts(normalize=True)
probs2
0 apple 1 orange 2 banana 3 mango 4 blueberry 5 watermelon 6 pear dtype: object
apple 0.142857 orange 0.142857 banana 0.142857 mango 0.142857 blueberry 0.142857 watermelon 0.142857 pear 0.142857 dtype: float64
entropy = -1 * np.sum(np.log2(probs2) * probs2)
entropy
2.807354922057604
gini_index = 1 - np.sum(np.square(probs2))
gini_index
0.8571428571428572
As expected, both entropy and Gini index of the second fruit basket is higher than those of the first fruit basket.
The Vegetation Example¶
We now work out the details of the impurity calculations for the Vegetation dataset in Chapter 5 in the textbook.
Let's first import the dataset from the Cloud.
import pandas as pd
import io
import requests
url_name = 'https://raw.githubusercontent.com/akmand/datasets/master/FMLPDA_Table4_3.csv'
url_content = requests.get(url_name, verify=False).content
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
df
stream | slope | elevation | vegetation | |
---|---|---|---|---|
0 | False | steep | high | chapparal |
1 | True | moderate | low | riparian |
2 | True | steep | medium | riparian |
3 | False | steep | medium | chapparal |
4 | False | flat | high | conifer |
5 | True | steep | highest | conifer |
6 | True | steep | high | chapparal |
For convenience, we define a function called compute_impurity()
that calculates impurity of a feature using either entropy or gini index.
def compute_impurity(feature, impurity_criterion):
"""
This function calculates impurity of a feature.
Supported impurity criteria: 'entropy', 'gini'
input: feature (this needs to be a Pandas series)
output: feature impurity
"""
probs = feature.value_counts(normalize=True)
if impurity_criterion == 'entropy':
impurity = -1 * np.sum(np.log2(probs) * probs)
elif impurity_criterion == 'gini':
impurity = 1 - np.sum(np.square(probs))
else:
raise ValueError('Unknown impurity criterion')
return(round(impurity, 3))
# let's do two quick examples.
print('impurity using entropy:', compute_impurity(fruits, 'entropy'))
print('impurity using gini index:', compute_impurity(fruits, 'gini'))
# how to test for an incorrect compute_impurity_criterion value:
# print('impurity using gini index:', compute_impurity(df['stream'], 'foo'))
impurity using entropy: 1.557 impurity using gini index: 0.653
Let's calculate entropy of the target feature "vegetation" using our new function.
target_entropy = compute_impurity(df['vegetation'], 'entropy')
target_entropy
1.557
Let's compute the information gain for splitting based on a descriptive feature to figure out the best feature to split on. For this task, we do the following:
- Compute impurity of the target feature (using either entropy or gini index).
- Partition the dataset based on unique values of the descriptive feature.
- Compute impurity for each partition.
- Compute the remaining impurity as the weighted sum of impurity of each partition.
- Compute the information gain as the difference between the impurity of the target feature and the remaining impurity.
We will define another function to achieve this, called comp_feature_information_gain()
.
As an example, let's have a look at the levels of the "elevation" feature.
df['elevation'].value_counts()
high 3 medium 2 low 1 highest 1 Name: elevation, dtype: int64
Let's see how the partitions look like for this feature and what the corresponding calculations are using the entropy split criterion.
for level in df['elevation'].unique():
print('level name:', level)
df_feature_level = df[df['elevation'] == level]
print('corresponding data partition:')
print(df_feature_level)
print('partition target feature impurity:', compute_impurity(df_feature_level['vegetation'], 'entropy'))
print('partition weight:', str(len(df_feature_level)) + '/' + str(len(df)))
print('====================')
level name: high corresponding data partition: stream slope elevation vegetation 0 False steep high chapparal 4 False flat high conifer 6 True steep high chapparal partition target feature impurity: 0.918 partition weight: 3/7 ==================== level name: low corresponding data partition: stream slope elevation vegetation 1 True moderate low riparian partition target feature impurity: -0.0 partition weight: 1/7 ==================== level name: medium corresponding data partition: stream slope elevation vegetation 2 True steep medium riparian 3 False steep medium chapparal partition target feature impurity: 1.0 partition weight: 2/7 ==================== level name: highest corresponding data partition: stream slope elevation vegetation 5 True steep highest conifer partition target feature impurity: -0.0 partition weight: 1/7 ====================
The idea here is that, for each one of the 4 data partitions above,
- We compute their impurity with respect to the target feature as a stand-alone dataset.
- We weigh these impurities with the relative number of observations in each partition. The relative number of observations is calculated as the number of observations in the partition divided by the total number of observations in the entire dataset. For instance, the weight of the first partition is 3/7.
- We add up these weighted impurities and call it the remaining impurity for this feature.
For instance, remaining impurity as measured by entropy for the elevation feature is 0.918 x (3/7) + 1.0 x (2/7) = 0.679.
Information gain is then calculated as 1.557 - 0.679 = 0.878.
Now we are ready to define our function. There is a bit of coding in here, but we can assure you that trying to figure out how things work in here will be rewarding to improve your Python programming skills.
def comp_feature_information_gain(df, target, descriptive_feature, split_criterion):
"""
This function calculates information gain for splitting on
a particular descriptive feature for a given dataset
and a given impurity criteria.
Supported split criterion: 'entropy', 'gini'
"""
print('target feature:', target)
print('descriptive_feature:', descriptive_feature)
print('split criterion:', split_criterion)
target_entropy = compute_impurity(df[target], split_criterion)
# we define two lists below:
# entropy_list to store the entropy of each partition
# weight_list to store the relative number of observations in each partition
entropy_list = list()
weight_list = list()
# loop over each level of the descriptive feature
# to partition the dataset with respect to that level
# and compute the entropy and the weight of the level's partition
for level in df[descriptive_feature].unique():
df_feature_level = df[df[descriptive_feature] == level]
entropy_level = compute_impurity(df_feature_level[target], split_criterion)
entropy_list.append(round(entropy_level, 3))
weight_level = len(df_feature_level) / len(df)
weight_list.append(round(weight_level, 3))
print('impurity of partitions:', entropy_list)
print('weights of partitions:', weight_list)
feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
print('remaining impurity:', feature_remaining_impurity)
information_gain = target_entropy - feature_remaining_impurity
print('information gain:', information_gain)
print('====================')
return(information_gain)
Now that our function has been defined, we will call it for each descriptive feature in the dataset. First let's call it using the entropy split criteria.
split_criterion = 'entropy'
for feature in df.drop(columns='vegetation').columns:
feature_info_gain = comp_feature_information_gain(df, 'vegetation', feature, split_criterion)
target feature: vegetation descriptive_feature: stream split criterion: entropy impurity of partitions: [0.918, 1.5] weights of partitions: [0.429, 0.571] remaining impurity: 1.250322 information gain: 0.306678 ==================== target feature: vegetation descriptive_feature: slope split criterion: entropy impurity of partitions: [1.371, -0.0, -0.0] weights of partitions: [0.714, 0.143, 0.143] remaining impurity: 0.9788939999999999 information gain: 0.578106 ==================== target feature: vegetation descriptive_feature: elevation split criterion: entropy impurity of partitions: [0.918, -0.0, 1.0, -0.0] weights of partitions: [0.429, 0.143, 0.286, 0.143] remaining impurity: 0.6798219999999999 information gain: 0.877178 ====================
Now let's call it using the gini index split criteria.
split_criteria = 'gini'
for feature in df.drop(columns='vegetation').columns:
feature_info_gain = comp_feature_information_gain(df, 'vegetation', feature, split_criteria)
target feature: vegetation descriptive_feature: stream split criterion: gini impurity of partitions: [0.444, 0.625] weights of partitions: [0.429, 0.571] remaining impurity: 0.5473509999999999 information gain: 0.1056490000000001 ==================== target feature: vegetation descriptive_feature: slope split criterion: gini impurity of partitions: [0.56, 0.0, 0.0] weights of partitions: [0.714, 0.143, 0.143] remaining impurity: 0.39984000000000003 information gain: 0.25316 ==================== target feature: vegetation descriptive_feature: elevation split criterion: gini impurity of partitions: [0.444, 0.0, 0.5, 0.0] weights of partitions: [0.429, 0.143, 0.286, 0.143] remaining impurity: 0.333476 information gain: 0.31952400000000003 ====================
We observe that, with both the entropy and gini index split criteria, the highest information gain occurs with the "elevation" feature.
This is the for the split at the root node of the corresponding decision tree. In subsequent splits, the above procedure is repeated with the subset of the entire dataset in the current branch until the termination condition is reached.
www.featureranking.com