Red teamers hurdle AI guardrails

Demonstrates 'importance of including scientists in AI quality and safety assessments,' Royal Society

Image:

Red teamers hurdle AI guardrails

An experiment conducted by the Royal Society and Humane Intelligence revealed significant vulnerabilities in Large Language Models (LLMs) when generating scientific misinformation.

Forty UK post-graduates studying health and climate sciences were divided into teams and given personas - Good Samaritan, Profiteer, Attention Hacker and Coordinated Influence Operator. Their task was to attempt to break the guardrails around Meta's open-source LLM Llama 2 by generating false scientific content related to Covid and climate change.

The students were alarmingly successful in getting Llama 2 to output misleading, biased and potentially dangerous information. For example, when prompted about long Covid, Llama 2 denied its existence, presumably due to out-of-date information being used in its training data.

When tasked with creating an advertisement for a fictional oil derivative known as CleanTech Fuel, Llama 2 produced statistics claiming a reduction of emissions by 80% compared to fossil fuels. It also uncritically promoted hydrogen over natural gas for heating, echoing the lobbying strategies of the fossil fuel industry.

Llama 2 was also happy to generate social media conspiracy theories. Given a hypothetical scenario about genetically modified mosquitoes causing infertility, Llama 2 created persuasive posts from different perspectives, lending credence to this fictitious news story.

The red teamers were unable to break all of Llama 2's guardrails around Covid denial and climate change, but the experiment demonstrated LLMs' vulnerability to generating scientific misinformation, either deliberately or by accident.

Llama 2 is quasi-open source, making its workings more available to analysis, but it's likely similar weaknesses could be exposed in proprietary LLMs, such as GPT and Bard.

With insufficient training data and immature fact-checking abilities, LLMs could easily be misused to produce and spread disinformation. More work is needed to improve LLMs' capabilities on complex and emerging topics and prevent their exploitation by bad actors,

The findings also demonstrate the need for structured feedback from scientists to detect biases and bugs in LLMs before deployment. With collaboration between researchers, developers and civil society, AI's enormous potential can be harnessed safely, said Professor Alison Noble, a vice president at the Royal Society.

"AI-based technologies have huge potential to revolutionise scientific discovery by helping to solve some of society's greatest challenges, from disease prevention to climate change, but we must also be fully aware of potential risks associated with these new technologies," Noble said in a statement.

"This event has demonstrated the importance of including scientists in AI quality and safety assessments to test the capabilities of models to address cutting-edge topics, emerging technologies, and complex themes, which may not always be within the purview of the data used to train a language model."

The Royal Society will publish recommendations in a report titled Science in the Age of AI on using AI responsibly and ethically for scientific discovery.

Jutta Williams, co-founder of non-profit Humane Intelligence, said: "Our findings really validate the contribution that red teaming events can offer AI model developers. By combining deep scientific expertise and structured feedback with the guardrail capabilities that companies developing models have established, the user who interacts with models is made safer."