Information Gain Computation

InfoGain_Computation

Information Gain Computation in Python

This tutorial illustrates how impurity and information gain can be calculated in Python using the NumPy and pandas modules for information-based machine learning. The impurity calculation methods described in here are as follows:

  • Entropy
  • Gini index

We start off with a simple example, which is followed by the Vegetation example in Chapter 5 in the textbook.

A Simple Example

Suppose you are going out for a picnic and you are preparing a basket of some delicious fruits.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
In [2]:
lst = ['apple']*3 + ['orange']*2 + ['banana']*2
fruits = pd.Series(lst)
print(fruits)
0     apple
1     apple
2     apple
3    orange
4    orange
5    banana
6    banana
dtype: object

Here is the relative frequency of each fruit in the basket, which can be considered as the probability distribution of the fruits.

In [3]:
probs = fruits.value_counts(normalize=True)
probs
Out[3]:
apple     0.428571
banana    0.285714
orange    0.285714
dtype: float64

If you like, you can define the probability distribution yourself as below.

In [4]:
probs_by_hand = [3/7, 2/7, 2/7]
print(probs_by_hand)
[0.42857142857142855, 0.2857142857142857, 0.2857142857142857]

Recall that Shannon's model defines entropy as $$H(x) := - \sum_{i=1}^{\ell}(P(t=i) \times \log_{2}(P(t=i))$$ The idea with entropy is that the more heterogenous and impure a feature is, the higher the entropy. Conversely, the more homogenous and pure a feature is, the lower the entropy.

The following calculation shows how impurity of this fruit basket can be computed using the entropy criterion.

In [5]:
entropy = -1 * np.sum(np.log2(probs) * probs)
entropy
Out[5]:
1.5566567074628228

The gini impurity index is defined as follows: $$ \mbox{Gini}(x) := 1 - \sum_{i=1}^{\ell}P(t=i)^{2}$$ The idea with Gini index is the same as in entropy in the sense that the more heterogenous and impure a feature is, the higher the Gini index.

A nice property of the Gini index is that it is always between 0 and 1, and this may make it easier to compare Gini indices across different features.

The impurity of our fruit basket using Gini index is calculated as below.

In [6]:
gini_index = 1 - np.sum(np.square(probs))
gini_index
Out[6]:
0.653061224489796

In comparison, let's compute impurity of another fruit basket with 7 different fruits with equal frequency.

In [7]:
lst2 = ['apple', 'orange', 'banana', 'mango', 'blueberry', 'watermelon', 'pear']
fruits2 = pd.Series(lst2)
print(fruits2)
probs2 = fruits2.value_counts(normalize=True)
probs2
0         apple
1        orange
2        banana
3         mango
4     blueberry
5    watermelon
6          pear
dtype: object
Out[7]:
pear          0.142857
mango         0.142857
orange        0.142857
watermelon    0.142857
blueberry     0.142857
banana        0.142857
apple         0.142857
dtype: float64
In [8]:
entropy = -1 * np.sum(np.log2(probs2) * probs2)
entropy
Out[8]:
2.807354922057604
In [9]:
gini_index = 1 - np.sum(np.square(probs2))
gini_index
Out[9]:
0.8571428571428572

As expected, both entropy and Gini index of the second fruit basket is higher than those of the first fruit basket.

The Vegetation Example

We now work out the details of the impurity calculations for the Vegetation dataset in Chapter 5 in the textbook.

Let's first import the dataset from the Cloud.

In [10]:
import pandas as pd
import io
import requests

# if you run into any SSL certification issues, 
# you may need to run the following command for a Mac OS installation.
# $/Applications/Python\ 3.6/Install\ Certificates.command

# how to read a csv file from a github account
df_url = 'https://raw.githubusercontent.com/vaksakalli/ml_tutorials/master/FMLPDA_Table4_3.csv'
url_content = requests.get(df_url).content
# print(url_content)
df = pd.read_csv(io.StringIO(url_content.decode('utf-8')))
df
Out[10]:
stream slope elevation vegetation
0 False steep high chapparal
1 True moderate low riparian
2 True steep medium riparian
3 False steep medium chapparal
4 False flat high conifer
5 True steep highest conifer
6 True steep high chapparal

For convenience, we define a function called compute_impurity() that calculates impurity of a feature using either entropy or gini index.

In [11]:
def compute_impurity(feature, impurity_criterion):
    """
    This function calculates impurity of a feature.
    Supported impurity criteria: 'entropy', 'gini'
    input: feature (this needs to be a Pandas series)
    output: feature impurity
    """
    probs = feature.value_counts(normalize=True)
    
    if impurity_criterion == 'entropy':
        impurity = -1 * np.sum(np.log2(probs) * probs)
    elif impurity_criterion == 'gini':
        impurity = 1 - np.sum(np.square(probs))
    else:
        raise ValueError('Unknown impurity criterion')
        
    return(round(impurity, 3))


# let's do two quick examples.
print('impurity using entropy:', compute_impurity(fruits, 'entropy'))
print('impurity using gini index:', compute_impurity(fruits, 'gini'))
# how to test for an incorrect compute_impurity_criterion value:
# print('impurity using gini index:', compute_impurity(df['stream'], 'foo'))
impurity using entropy: 1.557
impurity using gini index: 0.653

Let's calculate entropy of the target feature "vegetation" using our new function.

In [12]:
target_entropy = compute_impurity(df['vegetation'], 'entropy')
target_entropy
Out[12]:
1.557

Let's compute the information gain for splitting based on a descriptive feature to figure out the best feature to split on. For this task, we do the following:

  1. Compute impurity of the target feature (using either entropy or gini index).
  2. Partition the dataset based on unique values of the descriptive feature.
  3. Compute impurity for each partition.
  4. Compute the remaining impurity as the weighted sum of impurity of each partition.
  5. Compute the information gain as the difference between the impurity of the target feature and the remaining impurity.

We will define another function to achieve this, called comp_feature_information_gain().

As an example, let's have a look at the levels of the "elevation" feature.

In [13]:
df['elevation'].value_counts()
Out[13]:
high       3
medium     2
highest    1
low        1
Name: elevation, dtype: int64

Let's see how the partitions look like for this feature and what the corresponding calculations are using the entropy split criterion.

In [14]:
for level in df['elevation'].unique():
    print('level name:', level)
    df_feature_level = df[df['elevation'] == level]
    print('corresponding data partition:')
    print(df_feature_level)
    print('partition target feature impurity:', compute_impurity(df_feature_level['vegetation'], 'entropy'))
    print('partition weight:', str(len(df_feature_level)) + '/' + str(len(df)))
    print('====================')
level name: high
corresponding data partition:
   stream  slope elevation vegetation
0   False  steep      high  chapparal
4   False   flat      high    conifer
6    True  steep      high  chapparal
partition target feature impurity: 0.918
partition weight: 3/7
====================
level name: low
corresponding data partition:
   stream     slope elevation vegetation
1    True  moderate       low   riparian
partition target feature impurity: -0.0
partition weight: 1/7
====================
level name: medium
corresponding data partition:
   stream  slope elevation vegetation
2    True  steep    medium   riparian
3   False  steep    medium  chapparal
partition target feature impurity: 1.0
partition weight: 2/7
====================
level name: highest
corresponding data partition:
   stream  slope elevation vegetation
5    True  steep   highest    conifer
partition target feature impurity: -0.0
partition weight: 1/7
====================

The idea here is that, for each one of the 4 data partitions above,

  1. We compute their impurity with respect to the target feature as a stand-alone dataset.
  2. We weigh these impurities with the relative number of observations in each partition. The relative number of observations is calculated as the number of observations in the partition divided by the total number of observations in the entire dataset. For instance, the weight of the first partition is 3/7.
  3. We add up these weighted impurities and call it the remaining impurity for this feature.

For instance, remaining impurity as measured by entropy for the elevation feature is 0.918 x (3/7) + 1.0 x (2/7) = 0.679.

Information gain is then calculated as 1.557 - 0.679 = 0.878.

Now we are ready to define our function. There is a bit of coding in here, but we can assure you that trying to figure out how things work in here will be rewarding to improve your Python programming skills.

In [15]:
def comp_feature_information_gain(df, target, descriptive_feature, split_criterion):
    """
    This function calculates information gain for splitting on 
    a particular descriptive feature for a given dataset
    and a given impurity criteria.
    Supported split criterion: 'entropy', 'gini'
    """
    
    print('target feature:', target)
    print('descriptive_feature:', descriptive_feature)
    print('split criterion:', split_criterion)
            
    target_entropy = compute_impurity(df[target], split_criterion)

    # we define two lists below:
    # entropy_list to store the entropy of each partition
    # weight_list to store the relative number of observations in each partition
    entropy_list = list()
    weight_list = list()
    
    # loop over each level of the descriptive feature
    # to partition the dataset with respect to that level
    # and compute the entropy and the weight of the level's partition
    for level in df[descriptive_feature].unique():
        df_feature_level = df[df[descriptive_feature] == level]
        entropy_level = compute_impurity(df_feature_level[target], split_criterion)
        entropy_list.append(round(entropy_level, 3))
        weight_level = len(df_feature_level) / len(df)
        weight_list.append(round(weight_level, 3))

    print('impurity of partitions:', entropy_list)
    print('weights of partitions:', weight_list)

    feature_remaining_impurity = np.sum(np.array(entropy_list) * np.array(weight_list))
    print('remaining impurity:', feature_remaining_impurity)
    
    information_gain = target_entropy - feature_remaining_impurity
    print('information gain:', information_gain)
    
    print('====================')

    return(information_gain)

Now that our function has been defined, we will call it for each descriptive feature in the dataset. First let's call it using the entropy split criteria.

In [16]:
split_criterion = 'entropy'
for feature in df.drop(columns='vegetation').columns:
    feature_info_gain = comp_feature_information_gain(df, 'vegetation', feature, split_criterion)
target feature: vegetation
descriptive_feature: stream
split criterion: entropy
impurity of partitions: [0.918, 1.5]
weights of partitions: [0.429, 0.571]
remaining impurity: 1.250322
information gain: 0.306678
====================
target feature: vegetation
descriptive_feature: slope
split criterion: entropy
impurity of partitions: [1.371, -0.0, -0.0]
weights of partitions: [0.714, 0.143, 0.143]
remaining impurity: 0.9788939999999999
information gain: 0.578106
====================
target feature: vegetation
descriptive_feature: elevation
split criterion: entropy
impurity of partitions: [0.918, -0.0, 1.0, -0.0]
weights of partitions: [0.429, 0.143, 0.286, 0.143]
remaining impurity: 0.6798219999999999
information gain: 0.877178
====================

Now let's call it using the gini index split criteria.

In [17]:
split_criteria = 'gini'
for feature in df.drop(columns='vegetation').columns:
    feature_info_gain = comp_feature_information_gain(df, 'vegetation', feature, split_criteria)
target feature: vegetation
descriptive_feature: stream
split criterion: gini
impurity of partitions: [0.444, 0.625]
weights of partitions: [0.429, 0.571]
remaining impurity: 0.5473509999999999
information gain: 0.1056490000000001
====================
target feature: vegetation
descriptive_feature: slope
split criterion: gini
impurity of partitions: [0.56, 0.0, 0.0]
weights of partitions: [0.714, 0.143, 0.143]
remaining impurity: 0.39984000000000003
information gain: 0.25316
====================
target feature: vegetation
descriptive_feature: elevation
split criterion: gini
impurity of partitions: [0.444, 0.0, 0.5, 0.0]
weights of partitions: [0.429, 0.143, 0.286, 0.143]
remaining impurity: 0.333476
information gain: 0.31952400000000003
====================

We observe that, with both the entropy and gini index split criteria, the highest information gain occurs with the "elevation" feature.

This is the for the split at the root node of the corresponding decision tree. In subsequent splits, the above procedure is repeated with the subset of the entire dataset in the current branch until the termination condition is reached.

Please refer to Chapter 4: Information-Based Learning for more details.


www.featureranking.com