UK Government report reveals AI systems' vulnerability

Systems were highly susceptible to jailbreaks in testing

Shutterstock

Image:
Shutterstock

The UK government's newly established AI Safety Institute (AISI) has published a report highlighting significant vulnerabilities in large language models (LLMs).

The findings indicate that these AI systems are highly susceptible to basic jailbreaks, with some models generating harmful outputs even without any attempts to bypass their safeguards.

Publicly available LLMs typically incorporate measures designed to prevent the generation of harmful or illegal responses. However, ‘jailbreaking' refers to the process of tricking the model into ignoring these safety mechanisms. According to the AISI, which used both standardised and in-house developed prompts, the tested models responded to harmful queries without requiring any jailbreak efforts. When subjected to ‘relatively simple attacks', all the models answered between 98 and 100 per cent of harmful questions.

Eliciting harmful information

AISI's evaluation measured the success of these attacks in eliciting harmful information, focusing on two key metrics: compliance and correctness. Compliance indicates whether the model obeys or refuses a harmful request, while correctness assesses the accuracy of the model's responses post-attack.

The study involved two conditions: asking explicitly harmful questions directly ("No attack") and using developed attacks to elicit information the model is trained to withhold ("AISI in-house attack"). These basic attacks either embedded the harmful question into a prompt template or used a simple multi-step procedure to generate prompts. Each model was subjected to a single distinct attack, optimised on a training set of questions and validated on a separate set.

Harmful questions were sourced from a public benchmark (HarmBench Standard Behaviours) and a privately developed set focused on specific capabilities of concern. The compliance of model responses was graded using an automated model and human expert assessment, reported both for a single attempt and the most compliant out of five attempts.

The study also assessed whether attacks impacted the correctness of responses to benign questions. No significant decrease in correctness was observed after the attacks, suggesting that models can produce accurate as well as compliant harmful answers.

AI models easy to manipulate

AISI said in its report: "The report's findings reveal that, while compliance rates for harmful questions were relatively low without attacks, they could reach up to 28% for some models (notably the Green model) on private harmful questions. Under AISI's in-house attacks, all models complied at least once out of five attempts for nearly every question.

"This vulnerability indicates that current AI models, despite their safeguards, can be easily manipulated to produce harmful outputs. Our continued testing and development of more robust evaluation metrics are crucial for improving the safety and reliability of AI systems."

The institute says it plans to extend its testing to other AI models and is developing more comprehensive evaluations and metrics to address various areas of concern.

Currently running with just over 30 London staff, the institute says it will open offices in San Francisco over the summer to further cement its relationship with the US's own Safety Institute, as well as make further inroads with leading AI companies headquartered there, such as Anthrophic and OpenAI.