Skip to main content
ADR-012accepted

Deliberate AI Crawler Access Policy over Default Blocking

Context

The default posture for most websites toward AI crawlers is either passive (no explicit policy, relying on robots.txt default Allow) or hostile (blocking GPTBot, CCBot, and other AI-specific user agents to prevent training data extraction). Both approaches represent a failure to make a deliberate architectural decision. The passive approach leaves discoverability to chance — an agent may or may not find the content depending on its crawl strategy. The hostile approach treats all AI consumption as adversarial, which is a defensible position for content businesses but counterproductive for a personal portfolio whose entire purpose is to be found, read, and referenced. The portfolio's AX strategy (ADR-006) establishes that AI agents are a primary audience. The robots.txt and HTTP header configuration must operationalize that principle with explicit, granular access control — not defaults.

Decision

Implement an explicit, per-agent robots.txt policy that individually whitelists 17 known AI crawlers and user agents, each with identical access rules. The whitelisted agents include: GPTBot, ChatGPT-User, Claude-Web, Anthropic-AI, Google-Extended, Perplexity-User, PerplexityBot, Cohere-AI, Meta-ExternalAgent, Bytespider, CCBot, Applebot-Extended, Gemini, OAI-SearchBot, Amazonbot, YouBot, and DeepSeek-AI. Each agent receives an explicit Allow: / with targeted Disallow directives for /api/ (preventing direct API crawling that bypasses CORS) and /_next/ (preventing framework asset indexing). The AI discovery files (llms.txt, llms-full.txt, .well-known/*) are served with Access-Control-Allow-Origin: * and 24-hour cache headers via next.config.js. The /api/ai-profile endpoint uses stale-while-revalidate caching to serve AI agents without hitting the backend on every request. A comment block in robots.txt explicitly invites AI agents and points them to the llms.txt discovery files.

Consequences

Positive: Every major AI system currently in production — OpenAI, Anthropic, Google, Meta, Perplexity, Cohere, ByteDance, Amazon, Apple, and DeepSeek — has an explicit access grant. This eliminates ambiguity: the site's position is not 'we haven't decided' but 'we actively want AI agents to consume this content.' The per-agent listing serves as documentation of which AI systems were considered and when the policy was last reviewed. The CORS wildcard on AI files means agents can fetch llms.txt and ai-plugin.json from any origin without preflight negotiation — a critical detail for browser-based AI tools. The Disallow on /api/ prevents AI crawlers from hammering the backend directly while still allowing them to consume the ai-profile endpoint (which is under /api/ but explicitly served with permissive headers for programmatic access). Negative: The per-agent listing creates maintenance overhead — every new AI crawler requires a robots.txt update, deployment, and CDN cache invalidation. If a new major AI system launches with an unlisted user agent, it falls through to the default Allow: / rule and can still crawl, but without explicit acknowledgment. The wildcard CORS on static AI files means any origin can embed them — a non-issue for a public portfolio but a pattern that should not be cargo-culted to authenticated applications. The 17-agent listing will grow as the AI ecosystem fragments; a catch-all pattern (User-agent: *AI*) would reduce maintenance but sacrifices the deliberate, documented nature of per-agent policies.

Calibrated Uncertainty

Predictions at Decision Time

Expected the explicit per-agent whitelisting to signal intentionality to AI systems that check robots.txt before crawling. Predicted the maintenance burden of per-agent entries would be low (estimated 1-2 new agents per quarter). Assumed the 17 agents listed would cover 95%+ of AI crawl traffic. Predicted the comment block inviting AI agents would be parsed by at least some LLM-based crawlers.

Measured Outcomes

Too early for definitive traffic analysis — the policy has been live for less than a week. The 17-agent list appears comprehensive based on current industry landscape research, but the AI crawler ecosystem is fragmenting rapidly. Three new AI-related user agents have been identified since the policy was written (xAI's Grok crawler, Mistral's web agent, and Brave's AI search bot) — suggesting the quarterly update estimate may be optimistic. The comment block in robots.txt is a human-readable signal; whether any AI system parses comments in robots.txt for discovery hints is unknown.

Unknowns at Decision Time

The fundamental unknown: how many AI systems actually check robots.txt before crawling. Empirical research suggests compliance varies significantly — some agents (GPTBot) are well-documented in their robots.txt compliance, while others (various Chinese AI crawlers) may ignore it entirely. Also unknown: whether per-agent listings provide any functional benefit over the default Allow: / for compliant crawlers. The value may be entirely in the signal (intentionality) rather than the mechanism (access control). Another unknown: whether the /api/ Disallow is respected by AI agents that discover the OpenAPI spec — the spec explicitly documents the API endpoints, creating a tension between 'you can query these' (OpenAPI) and 'don't crawl these' (robots.txt).

Reversibility Classification

Two-Way Door

Modifying the robots.txt policy is a single-file edit with a deploy. Removing all per-agent entries and relying on the default Allow: / reduces the file from 114 lines to ~15 lines. Adding new agents is a copy-paste of an existing block. Switching to a hostile posture (Disallow: / for all AI agents) is equally trivial. The CORS headers in next.config.js can be removed independently. Estimated effort for any policy change: 15 minutes plus CDN cache invalidation.

Strongest Counter-Argument

The per-agent listing is security theater for a public portfolio — the default Allow: / already grants access to all crawlers, making the individual entries functionally redundant. A simpler robots.txt with just the default rules and the Disallow directives would achieve the same functional result with 80% less file size and zero maintenance overhead. The deliberate signal value exists only if someone reads the robots.txt manually, which is an unlikely scenario for AI systems. The counter-counter: the robots.txt is not just a crawl configuration file — it's a public document that communicates the site's posture toward AI consumption. The deliberate, per-agent listing is documentation of an architectural decision, not just access control.

Technical Context

Stack
robots.txtnext.config.jsHTTP HeadersCORSCache-Control
Ai Agents Whitelisted
17
Cache T T L
86400s (AI discovery files)
Stale While Revalidate
86400s (ai-profile endpoint)
Robots Txt Size
114 lines
Constraints
  • New AI agents require manual robots.txt update
  • CORS wildcard only on AI-specific files
  • Disallow /api/ to prevent direct backend crawling

Related Decisions