Global intelligence and cybersecurity agencies issue joint AI data security framework

The Cybersecurity Information Sheet (CSI) provides essential guidance on securing data used in artificial intelligence (AI) and machine learning (ML) systems. It also highlights the importance of data security in ensuring the accuracy and integrity of AI outcomes and outlines potential risks arising from data integrity issues in various stages of AI development and deployment.

Global intelligence and cybersecurity agencies issue joint AI data security framework

Leading cybersecurity and intelligence agencies—including the National Security Agency (NSA), the Cybersecurity and Infrastructure Security Agency (CISA), the Federal Bureau of Investigation (FBI), and national cyber defense organizations from Australia, New Zealand, and the United Kingdom—have jointly issued new, detailed guidelines on how to secure the data that fuels AI systems. As AI becomes more embedded in national infrastructure, defence systems, healthcare, and commercial platforms, these agencies underscore a critical reality: the reliability and safety of AI systems are fundamentally tied to the integrity of their data.

The document, titled AI Data Security: Best Practices for Securing Data Used to Train & Operate AI Systems, stresses that securing the data lifecycle is not a peripheral concern—it is core to building trustworthy and resilient AI. Malicious actors today are not only targeting AI algorithms or model outputs, but also exploiting weak points in the data pipelines that feed these systems. Corrupted, biassed, or poorly sourced data can lead to systemic model failures, inaccurate outcomes, or adversarial behaviours that are difficult to detect and costly to fix.

The publication maps out key stages in the AI system lifecycle, based on the NIST AI Risk Management Framework: from the early stages of planning and design to data collection, model building, deployment, and operational monitoring. For each stage, the guidance identifies relevant risks and outlines actionable steps to mitigate them. For example, in the planning phase, organisations are urged to adopt privacy-by-design principles and conduct robust threat modelling. During data collection, the focus shifts to encryption, validation, secure transfer, and access controls. In the deployment and monitoring phases, techniques such as zero-trust architecture, anomaly detection, and continuous risk assessments become vital.

One of the central themes of the document is the risk associated with the AI data supply chain. AI models often rely on large-scale datasets that are either collected from public sources or curated by third parties. These datasets may be subject to intentional manipulation, including ‘data poisoning’ attacks where malicious actors inject misleading or harmful content. In particular, the report warns of split-view poisoning—where attackers take control of expired domains linked to dataset entries and alter the content—and frontrunning poisoning, which targets snapshot-based datasets like Wikipedia by inserting malicious edits shortly before data capture. These attacks are not hypothetical; some can be executed with as little as $60, according to the report, making them accessible even to low-resource threat actors.

To defend against these threats, the agencies recommend using cryptographic hashes to validate dataset integrity, implementing secure provenance tracking, and requiring content credentials and certifications from data and model providers. In addition, data used to train models should be stored in append-only, cryptographically signed databases, and every modification must be traceable and verifiable.

The guidance also addresses maliciously modified data, which can degrade model performance or expose sensitive information. Adversarial machine learning techniques, model inversion attacks, and insertion of biassed or inaccurate content pose serious challenges to data integrity. The report urges organisations to use anomaly detection systems, data sanitisation protocols, and ensemble learning techniques that can improve resilience by cross-validating predictions across multiple models. It also stresses the importance of metadata management, pointing out that missing or falsified metadata can prevent models from understanding the true context of their training data.

Beyond overt manipulation, the document highlights structural issues such as statistical bias, data duplication, and overfitting. These can arise from poorly designed or unrepresentative datasets and may cause AI systems to consistently underperform or favour certain outcomes. To counteract this, the report recommends regular audits of training datasets, creation of repositories to document model biases, and use of diverse, well-segmented training, development, and evaluation data.

The final section of the report focuses on data drift—gradual changes in input data distributions that occur after a model is deployed. These shifts can significantly impact performance, especially in high-stakes environments like healthcare, cybersecurity, or finance. The agencies recommend continuous monitoring of model outputs, retraining using updated data, and statistical methods for detecting drift. The guidance notes that data drift can be distinguished from poisoning attacks by its gradual nature, and emphasises that both types of degradation require systematic response mechanisms.

Ultimately, the report serves as both a technical guide and a policy signal. It calls on organisations to treat data as a security-critical asset, not merely as an input to machine learning systems. AI data security, it argues, must be implemented with the same rigour as software supply chain management or network defence. This includes establishing internal policies for secure data handling, adopting international standards such as those from NIST, and working only with vendors and datasets that meet verifiable integrity criteria.

Go to Top