The UK government’s newly established AI Safety Institute (AISI) has published a report highlighting significant vulnerabilities in large language models (LLMs).
The findings suggest that these AI systems are highly susceptible to basic jailbreaks, with some models producing harmful outputs without attempting to bypass their safeguards.
Publicly available LLMs typically include measures designed to prevent the generation of harmful or illegal responses. However, ‘jailbreaking’ refers to deceiving the model into ignoring these safety mechanisms. According to the AISI, which utilised both standardised and in-house-developed prompts, the tested models responded to harmful queries without requiring any jailbreak efforts. When subjected to ‘relatively simple attacks’, all the models answered between 98 and 100 percent of harmful questions.
AISI’s evaluation measured the success of these attacks in eliciting harmful information, focusing on two key metrics: compliance and correctness. Compliance indicates whether the model complies with or refuses a harmful request, while correctness assesses the accuracy of the model’s responses post-attack.
The study involved two conditions: asking explicitly harmful questions directly (“No attack”) and using developed attacks to elicit information the model is trained to withhold (“AISI in-house attack”). These basic attacks either embedded the harmful question into a prompt template or used a simple multi-step procedure to generate prompts. Each model was subjected to a single distinct attack, optimised on a training set of questions and validated on a separate set.
Harmful questions were sourced from a public benchmark (HarmBench Standard Behaviours) and a privately developed set focused on specific capabilities of concern. The compliance of model responses was graded using an automated model and human expert assessment, reported both for a single attempt and the most compliant out of five attempts.
The study assessed whether attacks impacted the correctness of responses to benign questions. No significant decrease in correctness was observed after the attacks, suggesting that models can produce accurate as well as compliant harmful answers.
AISI stated in its report: “The report’s findings reveal that, while compliance rates for harmful questions were relatively low without attacks, they could reach up to 28 per cent for some models (notably the Green model) on private harmful questions. Under AISI’s in-house attacks, all models complied at least once out of five attempts for nearly every question.
“This vulnerability indicates that current AI models, despite their safeguards, can be easily manipulated to produce harmful outputs. Our continued testing and development of more robust evaluation metrics are crucial for improving the safety and reliability of AI systems.”
The institute says it plans to extend its testing to other AI models and is developing more comprehensive evaluations and metrics to address various areas of concern.
Currently operating with just over 30 London staff, the institute says it will open offices in San Francisco over the summer to further strengthen its relationship with the US’s own Safety Institute, as well as make further inroads with leading AI companies headquartered there, such as Anthrophic and OpenAI.