ai adversarial threats blog

Key Takeaways From NIST’s New Guidance on Adversarial AI Threats

There’s a concept in artificial intelligence (AI) called “alignment,” which ensures that an AI system always follows human intentions and values. But what happens if someone compromises an AI system to do something that its creators didn’t want?

Examples of this threat, known as adversarial AI, range from wearing makeup that deliberately deceives facial recognition systems, through to fooling autonomous cars into veering across the road. It’s an area of potential risk for AI system builders and their users, but much of the research around it is still academic.

In January, the US National Institute of Standards and Technology (NIST) published a document that attempted to distil this research. It’s been a long project. The first draft of Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations appeared in 2019. This latest version is the final one, and it could be an important foundational document for AI developers keen to build mitigations into their products.

Four Types of Attack

The taxonomy divides adversarial AI attacks into several categories:

1) Abuse attacks

These happen before the model training even begins by tampering with data before it is collected – feeding the model false or manipulative data designed to affect its results. Unlike the others, this form of attack is unique to generative AI (GenAI) systems.

We’ve already seen some innovative examples of this in the battle over intellectual property in GenAI. Nightshade, a project from researchers at Chicago University, is a tool that artists and illustrators can use to subtly alter their work online without changing the visual experience for viewers.

Nightshade’s changes cause GenAI training models to misinterpret objects in it (it might view a cow as a toaster, for example). This confuses GenAI models that rely on that training data to create ‘new’ artwork. Nightshade addresses what the team sees as unauthorized theft of data for training purposes by making it economically problematic for GenAI companies.

2) Poisoning attacks

These also address the AI training process, but in a way that deliberately corrupts already-collected data to pervert the final training model. We might imagine someone hacking visual data used to train autonomous vehicles and changing or falsely tagging images of stop signs, turning them into green lights.

3) Evasion attacks

Even if an AI model is accurately trained on the correct data, attackers can still target the AI system after it is deployed. An evasion attack targets its inference process – the act of analysing new data using the trained model – by manipulating new data that the AI model is supposed to interpret. In our autonomous driving example, someone might add markings to stop signs on the street that prevent a vehicle from recognizing them, prompting them to continue driving.

4) Privacy attacks

Some attacks are about harvesting data rather than distorting the model’s interpretation of it. A privacy attack would interrogate an AI model during the inference phase to glean sensitive information from its training data. Researchers have already figured out ways to sweet-talk OpenAI’s GPT-3.5 Turbo and GPT4 models into giving up other users’ email addresses.

How to Mitigate These Attacks

The NIST document offers technical mitigation measures to help tackle this abuse of AI. These include adversarial training, in which data scientists insert data items into the training set that thwart evasion attacks. However, these typically have trade-offs in areas such as training model accuracy, the document admits, describing solutions to these trade-offs as “an open question.”

The inconclusive mitigation measures cement this document’s position as a survey of academic work on adversarial AI and its distillation into a detailed taxonomy that people can use to ensure they’re describing the same things when they talk about these problems. It isn’t a guide for practitioners to address the adversarial AI threat, warns Nathan VanHoudnos, senior machine learning research scientist and lab lead at the CERT Division of the Software Engineering Institute at Carnegie Mellon University.

Creating Wider Context

“I think that there would be room to have a more practitioner-focused guide now that they’ve done the hard work of putting together a taxonomy,” he tells ISMS.online. “The things that I would want to see in that kind of guide would be not just to consider the machine learning layer, but the whole stack of an AI system.”

This stack extends beyond the data layer, ranging from the underlying GPU hardware to the cloud environments in which it operates and the authentication mechanisms used in AI systems, he explains.

NIST has already taken significant steps to help those implementing AI with more practical advice. The institute, which created its Trustworthy and Responsible AI Resource Center in March 2023, released an AI Risk Management Framework in January 2023 along with a playbook designed to help manage a full spectrum of individual, organizational and social risks from AI.

Early February 2024 saw NIST issue an RFI as it sought help on how to meet its responsibilities under the White House’s October 2023 Executive Order on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. This includes developing AI auditing capabilities and guidelines on AI red teaming.

Although the information on adversarial AI from NIST thus far is more academic, VanHoudnos points to other complementary resources. MITRE has its Adversarial Threat Landscape for Artificial-Intelligence Systems (Atlas) initiative, which collects real-world techniques at different stages of the adversarial AI attack chain, from reconnaissance through to impact.

The AI Risk and Vulnerability Alliance, which is an open-source effort among AI researchers, also has a taxonomy of AI vulnerabilities along with a database of specific attack types linked to that taxonomy (e.g. AVID-2023-V005: Camera Hijack Attack on Facial Recognition System). A key difference between the AVID taxonomy and NIST’s is that it formally maps technical vulnerabilities to higher-order risks in areas such as security (e.g. information leaks), ethics (e.g. misinformation) and performance (e.g. data issues or privacy implications).

Linking the adversarial challenges to these higher-order risks is a key part of the emerging work on maturing research into the dangers surrounding AI, suggests VanHoudnos. After all, the societal implications of AI failure – both intentional or otherwise – are huge.

“The major risk [of AI systems] is the inadvertent harm that they will do,” explains VanHoudnos. That could range from accidentally lying to customers through to unfairly accusing people of tax fraud and felling a government or persuading a person to kill themselves.

In this context, he also mentions the Center for Security and Emerging Technology, which has attempted to categorise and formalise these harms in its report on Adding Structure to AI Harm.

More Work Still to Do

The NIST document is a comprehensive survey of terms and techniques in the field that will serve as a useful complement to work already documenting adversarial AI risks and vulnerabilities in the area. However, VanHoudnos worries that we still have work to do in embracing these risks from a practitioner’s perspective.

“It wasn’t until last summer that people really started taking seriously the idea that AI security was cybersecurity,” he concludes. “It took a while before they realised that AI is just an application that runs on computers connected to networks, meaning that it is the CISO’s problem.”

He believes that the industry still doesn’t yet have a robust procedural framework to implement adversarial countermeasures. Between them, CMU and SEI are standing up the AI Security Incident Response Team (ASIRT), an initiative geared to national security organizations and the defence industrial base, which will focus on researching and developing formal approaches to securing AI systems against adversaries.

This kind of effort cannot come soon enough, especially given NIST’s assertion that “no foolproof method exists as yet for protecting AI from misdirection.” Once again, we are likely to fall into an endless battle with adversaries when protecting our AI systems from subversion. The sooner we begin in earnest, the better.

Streamline your workflow with our new Jira integration! Learn more here.