Machine Learning Part 6: Entropy and Information Gain


machine learning

In the previous post, we saw how Decision Trees split data based on “purity.” But how do we measure this purity mathematically? This is where Entropy and Information Gain come in.

What is Entropy? 🎗

In machine learning, Entropy is a measure of impurity or randomness in a dataset.

  • If a subset is perfectly pure (all items belong to one class), the entropy is 0.
  • If a subset is totally impure (items are evenly split between classes), the entropy is 1.

The Entropy Formula

To calculate entropy ($H$) for a subset with two classes: $$H = -(P_+) \log_2(P_+) - (P_-) \log_2(P_-)$$

Where: * $P_+$ is the probability of the positive class. * $P_-$ is the probability of the negative class.

Example: Teacher’s Presence

Recalling our “Good Day” example from Part 5, let’s look at the “Teacher Present” split:

Teacher Absent: 5 Good (G), 1 Bad (B) $$H(\text{absent}) = -(5/6) \log_2(5/6) - (1/6) \log_2(1/6) \approx 0.65$$

Teacher Present: 2 Good (G), 1 Bad (B) $$H(\text{present}) = -(2/3) \log_2(2/3) - (1/3) \log_2(1/3) \approx 0.92$$

Since 0.65 is closer to 0 than 0.92, the “Absent” group is “purer” than the “Present” group.


What is Information Gain? 🎗

Information Gain measures the reduction in entropy after a dataset is split on an attribute. We want to split on the attribute that gives us the highest Information Gain.

The Formula:

$$Gain(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} Entropy(S_v)$$

Basically: (Entropy before split) - (Weighted average of Entropy after split).

If we calculate the gain for “Teacher Presence” in our 9-row dataset: * Total Entropy $H(S) \approx 0.76$ * $Gain = 0.76 - (6/9 \cdot 0.65) - (3/9 \cdot 0.92) \approx 0.02$

A gain of 0.02 is quite low, suggesting that teacher presence isn’t the most important factor in predicting a good day.

Exercise

Calculate the Information Gain for “Parent Mood” from the dataset in Part 5. You’ll find it is much higher!

In the next part, we will see how combining many decision trees creates a powerful model called a Random Forest.

Suggested Posts

Why Nesterov Accelerated Gradient Converges Faster Than Momentum

Gradient-based optimization lies at the heart of modern machine learning. From linear regression to ...

Read article

Chat with your own data/text files using chatGPT and LangChain

There are many different approaches to chatting with your own data. One way is to upload your PDF t...

Read article

Understand langChain to Quickly Get Started with Production Codebases

LangChain is a superb library to productionize LLMs. It has few concepts which make it great. If you...

Read article
Free Flask Course