Block AI Crawlers with Robots.txt

Jul 16 2024 · Hugo

As AI companies continue to scrape content from the open web, I wanted to take small steps to protect my own content against them. Since the mid 1990s, a simple robots.txt file in the root directory of a website has communicated to bots how they should or shouldn’t crawl its pages. While this file has no legal or technical authority¹, and relying on robots.txt is trusting bots to respect its rules with no mechanism to enforce them, I decided it can’t hurt to try. And who knows, it may help prevent at least some crawlers from shamelessly scraping content. Let’s find out!

📝 Create a robots.txt file with Hugo

Hugo can generate a robots.txt file just like any other template. As a first step, I sought out a list of AI crawlers to block in that file. I came across the ai.robots.txt project which seemed like a good starting point. I simply copied the contents of their robots.txt file into a new local file in the /layouts/ directory of my local Hugo installation:

curl https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt -o /layouts/robots.txt

I edited the newly created robots.txt to allow other bots access to the site with the following:

User-agent: *
Disallow:

Then I built Hugo locally and checked for the new robots.txt file in the root directory at /robots.txt. After confirming that worked as expected, I committed my changes and pushed to production. But then I got to thinking…

🔄 Automatically update robots.txt

The remote robots.txt file appears to be updated regularly as new crawlers as added. Instead of having to remember to check that list and manually add new entries to my local robots.txt, I decided to take things a step further and integrate the update into Hugo’s build process.

Create Hugo template file

The first step is to create a new template file in Hugo. This file will also live in /layouts/ and we can call it index.robots.txt. In that file, we can use Hugo’s resources.GetRemote to snag the list of crawlers from the ai.robots.txt GitHub repo and assign it to a variable. Then we can extract the content and use the safeHTML filter to ensure the content is treated as safe HTML. And finally we can output the fetched content in the file itself.

I included a couple other things such as a sitemap and allow rules for other bots to crawl the site. Putting it all together, my index.robots.txt looks something like this:

{{- $remoteRobots := resources.GetRemote "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/main/robots.txt" -}}
{{- $robotsContent := $remoteRobots.Content | safeHTML -}}

Sitemap: {{ .Site.BaseURL }}sitemap.xml

User-agent: *
Disallow:

# Block AI Crawlers
# Ref: https://github.com/ai-robots-txt/ai.robots.txt/

{{ $robotsContent }}

Modify Hugo config

With the new template file created, we just need to adjust Hugo’s configuration file to handle the rest. I’m using config.toml, so I opened it up and added the following:

[outputs]
home = [
  "HTML",
  "ROBOTS"
] # Specify two types of files to be output

[outputFormats]

[outputFormats.ROBOTS] # Handle the output of ROBOTS
mediaType = "text/plain" # Set MIME type to plaintext
baseName = "robots" # Set the base filename
isPlainText = true # Ensure Hugo treats as plaintext
notAlternative = true # Output is not an alternative main content

Take it for a test drive

We can spin up a local development to see how it works with hugo server² and take a look at /robots.txt. Sure enough, I see the following:

Sitemap: /sitemap.xml

User-agent: *
Disallow:

# Block AI Crawlers
# Ref: https://github.com/ai-robots-txt/ai.robots.txt/

User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: cohere-ai
User-agent: Diffbot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: omgili
User-agent: omgilibot
User-agent: PerplexityBot
User-agent: YouBot
Disallow: /

From here, we can push changes to production. Now each time Hugo builds and deploys, robots.txt will be updated with the latest version of the ai.robots.txt file.

📚 Further reading

I don’t trust that AI crawlers will respect robots.txt but it’s worth a shot. If you wanted to take this further, you could block crawlers at the server level. Here are some links I’ve found that may be helpful in pursuing that route:

As a next step, I may look into setting up the Dark Visitors Analytics Agent to see what sort of impact this does (or doesn’t) have on crawlers.

“For three decades, a tiny text file has kept the internet from chaos. This text file has no particular legal or technical authority, and it’s not even particularly complicated. It represents a handshake deal between some of the earliest pioneers of the internet to respect each other’s wishes and build the internet in a way that benefitted everybody. It’s a mini constitution for the internet, written in code.” – The text file that runs the internet ↩
Having first manually created robots.txt, I had an older version stuck in the cache. I included --ignoreCache to ignore the cache directory. ↩