OECD publishes polcy paper on mapping data collection mechanism for AI training
Approved by the OECD Digital Policy Committee in July 2025, the report examines the full range of mechanisms used to gather data for machine learning – from user-provided information and voluntary data donations to commercial data licensing, open datasets, and large-scale web scraping.

The Organisation for Economic Co-operation and Development (OECD) has released a new policy paper titled ‘Mapping Relevant Data Collection Mechanisms for AI Training‘, offering one of the most detailed analyses to date of how data used in AI systems is sourced, shared, and governed.
Approved by the OECD Digital Policy Committee in July 2025, the report examines the full range of mechanisms used to gather data for machine learning – from user-provided information and voluntary data donations to commercial data licensing, open datasets, and large-scale web scraping. It highlights the implications of each method for privacy, data protection, and intellectual property, stressing that how data is collected can be as consequential as how algorithms are built.
The OECD situates these findings in a rapidly changing global policy environment. Recent laws, including the EU’s Artificial Intelligence Act, Korea’s AI Trust Act, and Japan’s AI Promotion Act, place data governance at the centre of responsible AI development. The paper also notes increasing legal disputes over the use of scraped online data to train large language models, as well as growing calls for transparency on the origins of training datasets.
A central feature of the publication is a taxonomy categorising data collection into two main sources – direct collection from individuals and organisations, and data obtained from third parties. Within these, it identifies distinct practices such as commercial data licensing, open data initiatives, and voluntary data donations. The OECD emphasises that while open and shared datasets are essential for innovation, they must be balanced with privacy safeguards and accountability measures.
The report further points to privacy-enhancing technologies and synthetic data as tools that could reduce risks in future data collection processes.
Why does it matter?
For civil society, this paper is a must-read. It clarifies how the data feeding AI systems is gathered and under what terms – an issue that directly affects privacy, consent, and fair access to information. Understanding these underlying data practices is essential to ensuring that future regulations protect individuals and communities, not just technology developers.