AI coaching bots from OpenAI, Anthropic, Amazon, and a dozen different firms are actually hitting manufacturing internet servers with the identical aggression as a DDoS assault, and robots.txt isn’t stopping them. This information walks by means of how InMotion’s programs staff makes use of ModSecurity to implement per-bot charge limiting on the server degree, with out slicing off your web site’s…
The Drawback: AI Bots That Don’t Comply with the Guidelines
robots.txt has been the de facto settlement between web sites and internet crawlers for many years. A directive like Crawl-delay: 10 tells compliant bots to attend 10 seconds between requests. Google provides you a method to configure crawl charge by means of Google Search Console. Conventional search crawlers have operated inside these boundaries lengthy sufficient that the majority sysadmins by no means thought a lot about them.
LLM coaching crawlers are a unique story.
Beginning in 2024, InMotion’s programs administration groups started seeing a sample of unusually heavy visitors throughout shared and devoted infrastructure. The supply wasn’t a single bot operating wild. It was a number of bots, every operated by a unique AI firm, concurrently crawling the identical servers with no delay between requests and no respect for Crawl-delay directives. None of them coordinated with one another. None of them wanted to. The mixed load of GPTBot, ClaudeBot, Amazonbot, and their friends hitting the identical server concurrently produces useful resource exhaustion that appears functionally an identical to an unintentional distributed denial-of-service assault.
That surprises plenty of web site homeowners who assume robots.txt is binding. It isn’t. It’s a conference, and these bots aren’t observing it.
Two Choices, One Clear Tradeoff
The blunt instrument is a full block by way of .htaccess. You possibly can deny entry by Consumer-Agent and the bots cease hitting your server totally. Drawback solved, besides it isn’t: your web site additionally disappears from AI-driven discovery programs. For companies that need to seem in AI-generated solutions or LLM-powered search options, blocking coaching crawlers totally carries an actual long-term price.
Charge limiting is the higher path. You gradual the bots right down to a tempo your server can take in. They nonetheless index your content material. You continue to preserve visibility. And when a bot refuses to respect the speed restrict you’ve set, you block that particular request slightly than the bot completely.
How ModSecurity Charge Limiting Works
ModSecurity is an open-source Net Software Firewall that operates inside Apache or Nginx, inspecting HTTP visitors in actual time. It’s the identical software that blocks SQL injection makes an attempt and cross-site scripting assaults on correctly hardened servers. What makes it helpful right here is its means to trace request frequency by Consumer-Agent and deny requests that exceed an outlined threshold.
The method works in two steps:
- Determine the incoming request by Consumer-Agent string and increment a per-host counter.
- If that counter exceeds the allowed restrict earlier than it expires, deny the request with a 429 Too Many Requests response and set a Retry-After header.
That Retry-After header issues. It explicitly tells the bot how lengthy to attend earlier than its subsequent request. A well-behaved crawler will honor it. One which doesn’t get blocked on its subsequent try.
The ModSecurity Guidelines
Under are the rate-limiting guidelines InMotion Internet hosting’s programs staff developed and at present deploys. Every rule set targets a particular bot by Consumer-Agent and enforces a most of 1 request per 3 seconds per hostname.
GPTBot (OpenAI)
# Restrict GPTBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Consumer-Agent "@pm GPTBot"   "id:13075,section:2,nolog,cross,setuid:%{request_headers.host},  setvar:consumer.ratelimit_gptbot=+1,expirevar:consumer.ratelimit_gptbot=3"SecRule USER:RATELIMIT_GPTBOT "@gt 1"   "chain,id:13076,section:2,deny,standing:429,setenv:RATELIMITED_GPTBOT,  log,msg:'RATELIMITED GPTBOT'"  SecRule REQUEST_HEADERS:Consumer-Agent "@pm GPTBot"Header at all times set Retry-After "3" env=RATELIMITED_GPTBOTErrorDocument 429 "Too Many Requests"
ClaudeBot (Anthropic)
# Restrict ClaudeBot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Consumer-Agent "@pm ClaudeBot"   "id:13077,section:2,nolog,cross,setuid:%{request_headers.host},  setvar:consumer.ratelimit_claudebot=+1,expirevar:consumer.ratelimit_claudebot=3"SecRule USER:RATELIMIT_CLAUDEBOT "@gt 1"   "chain,id:13078,section:2,deny,standing:429,setenv:RATELIMITED_CLAUDEBOT,  log,msg:'RATELIMITED CLAUDEBOT'"  SecRule REQUEST_HEADERS:Consumer-Agent "@pm ClaudeBot"Header at all times set Retry-After "3" env=RATELIMITED_CLAUDEBOTErrorDocument 429 "Too Many Requests"
Amazonbot
# Restrict Amazonbot hits by consumer agent to at least one hit per 3 secondsSecRule REQUEST_HEADERS:Consumer-Agent "@pm Amazonbot"   "id:13079,section:2,nolog,cross,setuid:%{request_headers.host},  setvar:consumer.ratelimit_amazonbot=+1,expirevar:consumer.ratelimit_amazonbot=3"SecRule USER:RATELIMIT_AMAZONBOT "@gt 1"   "chain,id:13080,section:2,deny,standing:429,setenv:RATELIMITED_AMAZONBOT,  log,msg:'RATELIMITED AMAZONBOT'"  SecRule REQUEST_HEADERS:Consumer-Agent "@pm Amazonbot"Header at all times set Retry-After "3" env=RATELIMITED_AMAZONBOTErrorDocument 429 "Too Many Requests"
Adapting the Guidelines for Different Bots
The construction is identical for each bot. So as to add protection for a brand new crawler, copy any rule set and make two adjustments:
- Change the Consumer-Agent string (e.g., GPTBot) with the brand new bot’s identifier.
- Assign distinctive id values and distinctive env variable names to keep away from conflicts with current guidelines.
The id discipline should be distinctive throughout your whole ModSecurity configuration. For those who’re including these to an current ruleset, examine what IDs are already in use earlier than assigning new ones. Collisions trigger guidelines to fail silently.
For reference, a rising listing of identified AI crawler Consumer-Agent strings consists of Bytespider, CCBot, Google-Prolonged, Meta-ExternalAgent, and PerplexityBot, amongst others. The Darkish Guests challenge maintains a fairly present catalogue of identified AI agent identifiers.
What Occurs After You Deploy
As soon as these guidelines are lively, a bot that makes two requests to the identical hostname inside a 3-second window receives a 429 on the second request. The Retry-After: 3 header tells it to attend earlier than making an attempt once more.
From there, conduct splits into two classes:
Bots that respect the header decelerate robotically. They proceed indexing your content material at a tempo your server can deal with. Sources are conserved, and your web site stays accessible to the crawlers price caring about.
Bots that ignore the header preserve hitting the deny rule on each subsequent request till their inner retry logic kicks in or they transfer on. Both method, they’re consuming a fraction of the sources they might have with out charge limiting in place.
You gained’t repair the underlying drawback of AI firms deploying aggressive crawlers with out consent. However you cease absorbing the price of their indexing operations in your {hardware}.
Conditions and The place to Apply These Guidelines
These guidelines require ModSecurity to be put in and enabled in your server. On InMotion Internet hosting Devoted Servers and VPS plans, ModSecurity is on the market by means of cPanel’s WHM interface beneath Safety Middle > ModSecurity. The principles might be added as customized guidelines by means of WHM or immediately in your server’s ModSecurity configuration listing.
For those who’re on a managed devoted server, InMotion Internet hosting’s Superior Product Help staff can help with customized ModSecurity rule deployment. Clients with Premier Care have entry to InMotion Options for precisely this type of customized server configuration work.
Shared internet hosting environments don’t assist customized ModSecurity guidelines on the account degree. If aggressive bot visitors is an issue on shared internet hosting, the choices are restricted to .htaccess blocks or upgrading to a VPS or devoted server the place you could have full WAF configurability.
A Word on robots.txt
None of this replaces a well-structured robots.txt file. Preserving crawl-delay directives in place for compliant bots stays worthwhile, and explicitly itemizing AI crawlers you need to limit provides a documented sign of intent, even when some bots ignore it. The ModSecurity guidelines deal with enforcement for those that gained’t self-regulate.
robots.txt for bots that respect conventions; ModSecurity charge limiting for those that don’t. The 2 layers work collectively.
Abstract
AI coaching crawlers don’t observe robots.txt the best way conventional search bots do, and the mixed load from a number of simultaneous indexing operations can degrade server efficiency for professional visitors. ModSecurity’s Consumer-Agent-based charge limiting provides you server-side management over how steadily these bots can request sources, with out requiring you to dam them from indexing your web site totally.
The principles are simple to deploy, lengthen to any bot by copying the template, and supply specific signaling by way of Retry-After headers for crawlers which might be able to honoring them.
For those who’re seeing unexplained spikes in server load or HTTP request quantity that don’t correlate with actual consumer visitors, examine your entry logs for AI crawler Consumer-Brokers earlier than assuming you’re coping with one thing extra advanced.
