TF-IDF Explained: Boosting Signal in Text Vectorization


nlp

In our Bag of Words guide, we saw how to turn text into numbers by counting word frequencies. But there is a major flaw in that approach:

Not all words are created equal.

In a document about “Quantum Physics,” the word “the” might appear 100 times, while “quantum” appears only 5 times. A simple count makes “the” seem 20x more important than “quantum,” which is obviously wrong.

TF-IDF (Term Frequency - Inverse Document Frequency) is the industry-standard solution to this problem. It downweights words that appear everywhere (like “the”, “is”, “and”) and upweights words that are rare and unique to the current document.


The Intuition: Balancing Frequency and Rarity

TF-IDF is a score composed of two parts multiplied together:

1. TF (Term Frequency)

  • Question: How often does the word appear in this document?
  • Logic: The more it appears, the more important it likely is for this specific text.
  • Formula: $TF = \frac{\text{Count of word in doc}}{\text{Total words in doc}}$

2. IDF (Inverse Document Frequency)

  • Question: How rare is this word across all documents?
  • Logic: If a word appears in every single document (like “the”), it has zero information value. If it appears in only one document, it is highly discriminative.
  • Formula: $IDF = \log(\frac{\text{Total Documents}}{\text{Documents containing word}})$

The Result: $$TF\text{-}IDF = TF \times IDF$$

A high score means the word is frequent in this document but rare in others. This is the “signature” of the document.


Python Implementation: Scikit-Learn TfidfVectorizer

While the math is good to know, you should never implement this manually in production. Scikit-learn’s TfidfVectorizer is highly optimized.

Code Example: Finding Keywords

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# 1. The Corpus
corpus = [
    "The car is fast.",
    "The car is red.",
    "The car is broken."
]

# 2. Initialize
vectorizer = TfidfVectorizer()

# 3. Fit and Transform
tfidf_matrix = vectorizer.fit_transform(corpus)

# 4. View Results
df = pd.DataFrame(
    tfidf_matrix.toarray(), 
    columns=vectorizer.get_feature_names_out()
)

print(df)

Output Analysis: You will notice that the word “the” (if not removed as a stop word) will have a score of 0 (or very low) because it appears in all documents. The words “fast”, “red”, and “broken” will have the highest scores in their respective rows because they are the unique differentiators.


Practical Application: Keyword Extraction

One of the best uses of TF-IDF is extracting keywords from a large text automatically.

def extract_keywords(text, vectorizer, top_n=3):
    # Transform the single document
    vector = vectorizer.transform([text])

    # Get mapping from index to word
    feature_names = vectorizer.get_feature_names_out()

    # Sort and get top N indices
    sorted_indices = vector.toarray()[0].argsort()[::-1]

    return [feature_names[i] for i in sorted_indices[:top_n]]

new_doc = "The red car is very fast"
print(extract_keywords(new_doc, vectorizer))
# Likely Output: ['fast', 'red', 'car']

Limitations and Trade-offs

1. No Semantic Context

Like Bag of Words, TF-IDF effectively treats the document as a “bag.” It doesn’t know that “good” and “excellent” are similar. It just knows they are different strings.

2. Sparsity

You still end up with massive sparse matrices. If you have 100,000 documents and 50,000 unique words, your memory usage can explode.

3. Out of Vocabulary

If your model encounters a word it didn’t see during training (the fit step), it ignores it completely.


Summary

Use TF-IDF when: * You are building a search engine (to rank results). * You need to extract keywords or tags. * You are doing document clustering or classification on distinct topics.

It is the robust, “baseline” choice for most NLP tasks before you need the heavy artillery of Word Embeddings, which we will cover in the next guide on Word2Vec.

Written by

Parth Shukla

Ml wizard

Suggested Posts

Bag of Words (BoW) in NLP: A Deep Dive into Text Vectorization

If you are trying to build a spam filter, a sentiment analyzer, or any machine learning model that p...

Read article

Word2Vec: Capturing Meaning in Vector Space

Bag of Words and TF-IDF are great, but they suffer from a fatal flaw: they have no concept of meanin...

Read article

Chatterbot Google Colab Train English Corpus

Here’s how to solve the file not found for training chatterbot on Google collab: # Install !pip inst...

Read article
Free Flask Course