Machine Learning Part 6: Entropy and Information Gain


machine learning

In the previous post, we saw how Decision Trees split data based on “purity.” But how do we measure this purity mathematically? This is where Entropy and Information Gain come in.

What is Entropy? 🎗

In machine learning, Entropy is a measure of impurity or randomness in a dataset.

  • If a subset is perfectly pure (all items belong to one class), the entropy is 0.
  • If a subset is totally impure (items are evenly split between classes), the entropy is 1.

The Entropy Formula

To calculate entropy ($H$) for a subset with two classes: $$H = -(P_+) \log_2(P_+) - (P_-) \log_2(P_-)$$

Where: * $P_+$ is the probability of the positive class. * $P_-$ is the probability of the negative class.

Example: Teacher’s Presence

Recalling our “Good Day” example from Part 5, let’s look at the “Teacher Present” split:

Teacher Absent: 5 Good (G), 1 Bad (B) $$H(\text{absent}) = -(5/6) \log_2(5/6) - (1/6) \log_2(1/6) \approx 0.65$$

Teacher Present: 2 Good (G), 1 Bad (B) $$H(\text{present}) = -(2/3) \log_2(2/3) - (1/3) \log_2(1/3) \approx 0.92$$

Since 0.65 is closer to 0 than 0.92, the “Absent” group is “purer” than the “Present” group.


What is Information Gain? 🎗

Information Gain measures the reduction in entropy after a dataset is split on an attribute. We want to split on the attribute that gives us the highest Information Gain.

The Formula:

$$Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$$

Basically: (Entropy before split) - (Weighted average of Entropy after split).

If we calculate the gain for “Teacher Presence” in our 9-row dataset: * Total Entropy $H(S) \approx 0.76$ * $Gain = 0.76 - (6/9 \cdot 0.65) - (3/9 \cdot 0.92) \approx 0.02$

A gain of 0.02 is quite low, suggesting that teacher presence isn’t the most important factor in predicting a good day.

Exercise

Calculate the Information Gain for “Parent Mood” from the dataset in Part 5. You’ll find it is much higher!

In the next part, we will see how combining many decision trees creates a powerful model called a Random Forest.

Written by

Abdur-Rahmaan Janhangeer

Chef

Python author of 9+ years having worked for Python companies around the world

Suggested Posts

Machine Learning Part 7: Random Forests Explained

While Decision Trees are easy to understand, they have a major weakness: they tend to overfit the da...

Read article

Machine Learning Part 5: Decision Trees and Mixed Methods

Some machine learning methods are versatile enough to be used for both Classification and Regression...

Read article

Linear Regression vs. Decision Trees vs. Support Vector Machines

Machine Learning algorithms are one of the most important things to decide during model training and...

Read article
Free Flask Course