How to prevent OpenAi and similar from scraping your blog


security

If you want to disable OpenAi or similar crawlers from harvesting your data, create a file called robots.txt at the root of your domain.

If your domain is health.com, make sure the file can be accessed at health.com/robots.txt

Here is the snippet i use (you can check the site’s robot.txt file here):

# Disallow OpenAI and related bots
User-agent: OpenAI
Disallow: /

User-agent: GPTBot
Disallow: /

# Disallow Google AI crawlers
User-agent: Bard
Disallow: /

# Disallow Anthropic AI
User-agent: Claude
Disallow: /

# Disallow Microsoft AI crawlers
User-agent: BingAI
Disallow: /

# Disallow CommonCrawl (often used by AI models for datasets)
User-agent: CCBot
Disallow: /

# Disallow Neeva AI (deprecated but still included for completeness)
User-agent: NeevaBot
Disallow: /

# Disallow Baidu AI
User-agent: Baiduspider
Disallow: /

# Disallow Yandex AI
User-agent: YandexBot
Disallow: /

# Disallow other common web crawlers used for AI data collection
User-agent: DuckDuckGo-Bot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: SemrushBot
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: BLEXBot
Disallow: /

User-agent: PetalBot
Disallow: /

Written by

Abdur-Rahmaan Janhangeer

Chef

Python author of 7+ years having worked for Python companies around the world

Suggested Posts

Python Engineering Articles

This page lists the very best Python engineering articles of the internet, hand-picked to give hours...

Read article

Why Choose Flask Over FastAPI

FastAPI positions itself as one of the best choices for API development in Python. The project is po...

Read article

What's New in Python 3.9 alpha2?

Python is set to release a new version next year, the shiny 3.9. This one omitted the sys.argv chang...

Read article
Free Flask Course