Sitemaps, robots.txt and agent access

robots.txt and the sitemap are the oldest machine-readable files on the web, and they still decide whether an agent is allowed in and what it can find. An agent reads robots.txt to learn the rules and the sitemap to learn the map, before it reads any page.

robots.txt does two jobs for agents. It sets crawl rules, and it can name AI crawlers explicitly, so a site states whether it welcomes GPTBot and similar clients rather than leaving them to guess. A Content-Signal directive can go further and declare how content may be used, separating ordinary search from AI input and training, which gives a site granular control instead of an all-or-nothing block.

The sitemap answers the other question, which is what exists. A complete sitemap lists every canonical URL with a last-modified date, so an agent can find the real pages without inferring them from navigation. A page that is not in the sitemap is a page an agent may never reach.

Getting these wrong is quietly expensive. A robots.txt that blocks an AI crawler by accident removes a site from that assistant's answers. A stale sitemap hides new pages. The files are small and the fix is fast, which is why they are the first thing a readiness review checks.

turva.dev declares AI bot rules and Content Signals in robots.txt and keeps a complete sitemap. For an audit of a site's crawl and access surface, contact info@turva.dev.