On any given day, a publisher may receive thousands of visits from Google, Bing, social media platforms and direct readers. Increasingly, however, another type of visitor is appearing in server logs: AI crawlers.
These automated systems, operated by artificial intelligence companies, scan websites to collect information used to train models, build search indexes and generate answers within AI assistants.
Unlike traditional search engines, many AI platforms consume large volumes of content while returning relatively little traffic to the original source.
That shift has sparked one of the most significant debates about the future of the open web.
What Are AI Crawlers?
AI crawlers are automated bots that visit websites and collect publicly available content.
Their purposes include:
- Training large language models
- Updating retrieval systems
- Building AI search indexes
- Improving AI-generated responses
Examples include crawlers operated by OpenAI, Anthropic, Google, Meta and other AI developers.
Like traditional search engine bots, AI crawlers request pages from websites and process the information they discover. The key difference lies in how that information is ultimately used.
How AI Crawlers Differ From Traditional Search Engines
For decades, publishers accepted a relatively straightforward arrangement.
Search engines indexed content and, in return, directed visitors back to the originating website. While the system was far from perfect, it created a clear economic incentive to publish information online.
AI systems change that equation.
Rather than displaying a list of links, many AI tools generate complete answers directly within their interface. As a result, users can often obtain the information they need without visiting the source website.
This process is frequently referred to as answer extraction.
Why Publishers Are Concerned
Three issues dominate the discussion surrounding AI crawlers.
High Resource Consumption
AI crawlers can place substantial demands on web servers, particularly for smaller publishers with limited infrastructure.
Large-scale crawling operations may request thousands of pages within a short period, increasing hosting costs and server strain.
Reduced Referral Traffic
When AI systems answer questions directly, fewer users click through to source websites.
For publishers that rely on advertising, subscriptions or affiliate income, declining referral traffic can have a direct impact on revenue.
An Unclear Value Exchange
Many publishers argue that AI companies derive significant value from their content while providing little compensation in return.
This concern has contributed to licensing negotiations, crawler restrictions and legal disputes across multiple jurisdictions.
Which AI Companies Operate Crawlers?
Several major AI companies operate their own web-crawling systems.
These organisations include developers of:
- General-purpose language models
- AI-powered search engines
- Retrieval systems
- Research datasets
Most publish varying levels of information regarding crawler behaviour, identification methods and opt-out mechanisms.
Website owners can often restrict access using robots.txt files, although implementation varies between providers.
Why Some Publishers Are Blocking AI Bots
A growing number of publishers have restricted access to AI crawlers.
Their reasons generally fall into three categories:
- Protecting intellectual property
- Preserving advertising revenue
- Strengthening their position in licensing negotiations
Some organisations have adopted a selective approach, allowing traditional search engines while blocking AI-specific crawlers.
Others have entered commercial licensing agreements with AI companies.
What Happens If Everyone Blocks AI Crawlers?
The long-term consequences remain uncertain, but several outcomes are possible.
Scenario 1: Licensing Becomes the Norm
AI companies pay publishers for access to content, creating a new commercial framework for information distribution.
Scenario 2: More Content Moves Behind Paywalls
Publishers place valuable content behind registration systems, subscriptions or paywalls to limit unrestricted access.
Scenario 3: Reduced Diversity Across the Web
If publishing becomes less economically sustainable, fewer independent websites may survive.
This could reduce the diversity of information available online and increase reliance on a smaller number of large publishers.
The Emerging AI Web Economy
The internet was built on a relatively simple principle: create content, attract visitors and monetise attention.
AI-generated answers challenge that model by separating information consumption from website visits.
The central question is no longer whether AI will change search.
It already has.
The more pressing question is whether the next generation of information systems can continue to support the publishers, researchers and organisations that create the knowledge they depend upon.
The answer may determine what the open web looks like over the next decade.
The Hidden Costs of AI Scrapers
Your hosting bill keeps creeping higher. Pages feel slower than they should. Real users occasionally wait a little too long for content to load.
Then you look at the logs.
The traffic isn’t coming from customers.
It’s bots.
More specifically, AI scrapers and aggressive crawlers are traversing your application end-to-end, generating database queries, bypassing caches, and quietly driving up infrastructure costs.
This has become a common complaint. Many teams don’t notice it until the invoice arrives.
When bots find an infinite hallway
One pattern appears again and again: pages with no natural endpoint.
Calendars are the classic example. A bot lands on a page, sees a “previous month” link, follows it, then finds another “previous month” link and follows that one too. The process repeats indefinitely.
One customer discovered their calendar links extended all the way back to the 1700s. The bots dutifully crawled every page.
The specific application doesn’t matter. Any navigation structure without a lower bound creates the same problem: archives, paginated listings, date hierarchies, category trees, and other forms of endless navigation.
The crawler isn’t misbehaving. The site never told it where to stop.
Faceted navigation creates an explosion of URLs
The second pattern is even more expensive.
Imagine a product catalogue with filters for brand, colour, memory size, screen size, price range, and availability. Every filter adds another parameter to the URL.
A crawler that follows every available link eventually explores every possible combination of those filters.
A handful of facets can generate millions of unique URLs. Each one reaches the application server. Each one may trigger database queries. Most bypass caching because every URL appears different.
No individual request is problematic.
Millions of them are.
The hidden infrastructure tax
Bot traffic consumes the same resources as your customers. The same application servers. The same databases. The same caches.
As utilisation rises, you either add capacity or allow performance to degrade. Either way, there is a cost.
The frustrating part is that the bill rarely identifies the cause. Cloud providers charge for compute hours, database utilisation, storage, and bandwidth. They do not itemise “rendered the same page 80,000 times for crawlers.”
Instead, costs rise gradually and appear to be a natural consequence of growth.
Auto-scaling often masks the issue. The platform adds capacity, traffic continues flowing, dashboards remain green, and the invoice quietly gets larger.
The bots never notice.
Your users do.
The simplest fix: robots.txt
Most of the crawlers creating this load are well-behaved. They read robots.txt and generally respect what it says.
For faceted navigation, a simple rule can eliminate a significant amount of unnecessary crawling:
User-agent: *
Disallow: /*?*=*
This tells crawlers to avoid URLs containing query parameters.
If the problem exists only within a specific section of the site, narrow the scope:
User-agent: *
Disallow: /products?
There is a trade-off.
The same rule that prevents AI scrapers from exploring every filter combination also prevents those URLs from being indexed.
Historically that mostly affected Google search visibility. Today it also affects what AI systems such as ChatGPT, Claude, and Perplexity can discover and reference.
That doesn’t mean abandoning robots.txt. It means ensuring important content remains accessible through canonical URLs that aren’t blocked.
Link directly to products from category pages. Publish a sitemap.xml containing the URLs you want indexed. Search engines and AI crawlers will use those paths while respecting the URLs you’ve asked them to avoid.
Robots.txt is not a security control. It’s a request.
For well-behaved crawlers, that’s usually enough.
Changes are not immediate. Most bots refresh robots.txt every day or two, so traffic reductions typically appear within a few days rather than a few minutes.
Improve cache efficiency
Even after crawler traffic declines, some requests will continue reaching pages with query parameters.
A common mistake is allowing cache keys to vary based on parameter order.
These URLs are identical:
/products?brand=apple&color=black
/products?color=black&brand=apple
Yet many caching systems treat them as separate objects.
Normalising URLs before they reach the cache solves the problem. Sort query parameters consistently and strip values that do not affect page content, such as tracking identifiers or session tokens.
Hundreds of cache entries can collapse into one.
Whether the cache lives in a CDN, reverse proxy, or application layer, the principle is the same: eliminate meaningless URL variation.
Fix the application, not just the crawler
The most durable solution is often inside the application itself.
The calendar that links endlessly into the 1700s is not fundamentally a robots.txt problem. It is a design problem.
Nobody is looking for events from three centuries ago. Once meaningful content ends, stop rendering the link, return a 404, or redirect users to the earliest available month.
Faceted navigation deserves the same treatment.
If a filter combination returns zero products, there is little value in generating a permanent URL for it. Display the empty result and avoid linking to further empty combinations.
Small changes like these remove the infinite hallways that crawlers would otherwise explore forever.
Adapting to a new reality
This is not a story about villains.
The bots are doing what they were designed to do. Websites, meanwhile, were built in an era when the dominant crawler was Googlebot and crawl budgets were rarely a concern.
Both sides are adapting.
The encouraging part is that most solutions are inexpensive and largely under your control.
A robots.txt file that reflects the structure of your site. Meta directives such as noindex and nofollow where appropriate. Sensible limits on navigation. A cache that distinguishes meaningful variation from noise.
The same crawlers capable of overwhelming an application are often the ones most willing to follow instructions.
Give them clear boundaries, and much of the pressure disappears on its own
Frequently Asked Questions
Do AI crawlers send traffic back to websites?
Some do, particularly AI search products that include source links. However, referral volumes are generally much lower than those generated by traditional search engines.
Can website owners block AI crawlers?
In many cases, yes. Most major AI companies provide crawler identifiers and robots.txt instructions that allow website owners to manage access.
Are AI crawlers legal?
The legality of AI crawling and model training remains the subject of ongoing legal and regulatory debate in multiple jurisdictions.
Why does this matter?
The outcome of this debate affects publishers, researchers, businesses, AI companies and anyone who relies on an open internet for information.
