IETF tackles AI scraping with new standards

While the IETF’s efforts will help publishers and creators more precisely signal their preferences to AI crawlers, actual protection will depend on whether AI companies respect these new standards-or whether future laws will require them to do so.

IETF tackles AI scraping with new standards

The Internet Engineering Task Force has chartered a group it hopes will create a standard that lets content creators tell AI developers whether it’s OK to use their work.

Named the AI Preferences Working Group (AIPREF), the group has been asked to develop two things: a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks; and means of attaching that vocabulary to content on the internet, either by embedding it in the content or by formats similar to robots.txt, along with a standard mechanism to reconcile multiple expressions of preferences.

The AIPREF charter suggests ‘attaching preferences to content either by including preferences in content metadata or by signalling preferences using the protocol that delivers content’ as the main paths forward. Co-chair Mark Nottingham believes these new tools are needed because current systems are failing. In particular, the non-standard signals placed in robots.txt files – the decades-old standard meant to tell crawlers whether they may access certain web pages – are no longer functioning effectively.

The urgency for an update is clear. Several AI companies are increasingly bypassing the robots.txt protocol, a voluntary web standard that, for years, helped define boundaries between website owners and search engine crawlers. Traditional search engines like Google have generally respected robots.txt. However, many AI scrapers, used for training large language models, are now ignoring these directives and pulling content without permission. This widespread non-compliance has sparked disputes between AI firms and publishers, leading to accusations of unauthorised use and even plagiarism.

The limitations of robots.txt are at the heart of the problem. Compliance with the protocol is entirely voluntary. There is no legal enforcement mechanism, meaning that when bots ignore it, website owners have little recourse beyond costly and often ineffective methods like IP blocking. Moreover, robots.txt lacks granularity. It operates at the domain level and does not allow content owners to specify nuanced preferences, such as permitting their content for search indexing but prohibiting its use in AI model training.

Recognising these shortcomings, the Internet Engineering Task Force has taken steps to modernise the system through the work of AIPREF. The group is drafting new standards aimed at improving how content creators can communicate their preferences to AI systems. These proposed updates suggest enhancing the syntax of robots.txt to distinguish better AI-specific crawlers and activities, including separating tasks like training and inference.

New signalling mechanisms are also under discussion. Options include embedding preferences directly in content metadata, using HTTP headers, or introducing a new complementary file – tentatively called ai.txt – that would offer more detailed controls. Through these methods, publishers could express preferences about snippet length, attribution requirements, or explicit bans on content being used for AI training.

The focus of these technical solutions is clear communication. However, just like with the original robots.txt, compliance would still be voluntary unless backed by legal or regulatory measures. While the AIPREF initiative significantly improves the technical capacity for signaling preferences, actual protection will depend on whether AI companies choose to honor these signals – or whether they are eventually required to do so by law.

A quick comparison of the current system and the proposed improvements underscores the evolution underway:

Featurerobots.txt (Current)Proposed IETF/AIPREF Updates
ComplianceVoluntaryVoluntary (for now)
GranularityBy domain/crawlerBy company, bot, and purpose
AI-specific ControlsNoYes (training vs. inference, etc.)
EnforcementNone (technical only)None (policy/legal may follow)
Communication MechanismPlain text at site rootMetadata, HTTP headers, ai.txt, etc.

While the IETF’s work marks a meaningful step toward giving content owners more control, the path forward remains uncertain. Without enforceable regulations, the battle between content creators and AI companies continues across technical, legal, and policy fronts.

Go to Top