Experts uncover flaws in hundreds of AI safety and performance tests
Flaws mean benchmark scores may be irrelevant
A new study by leading computer scientists has found that hundreds of the most widely used tests to assess AI models are deeply flawed.
Researchers from the UK government's AI Security Institute, working alongside experts from Stanford University, UC Berkeley, and the University of Oxford, examined more than 440 benchmarks that form the backbone of AI evaluation worldwide.
The findings, published this week, reveal that almost all these tests suffer from weaknesses that "undermine the validity of the resulting claims," according to the report. These flaws mean that benchmark scores may be irrelevant or even misleading.
"Benchmarks underpin nearly all claims about advances in AI," said Andrew Bean, lead author of the study and a researcher at Oxford's Internet Institute.
"But without shared definitions and sound measurement, it becomes hard to know whether models are genuinely improving-or just appearing to."
Benchmarks are standardised tests designed to measure whether new AI systems are safe, aligned with human values, and effective in key areas like reasoning, coding, and mathematics. They are used extensively by major technology companies to justify product launches and public claims about model capability.
In the absence of comprehensive AI regulation, these benchmarks have effectively become the industry's main form of quality assurance. But the new report suggests that confidence in them may be misplaced.
The researchers found that only 16% of benchmarks included any measure of uncertainty or statistical testing, leaving most results without any quantifiable indication of reliability.
In some cases, benchmarks designed to assess complex traits, such as an AI model's "harmlessness", were based on vague or contested definitions, making them difficult or impossible to interpret meaningfully.
"There's a pressing need for shared standards and best practices," Bean said.
Real-world consequences
The revelations come amid increasing scrutiny of AI models' safety and accuracy, following several high-profile incidents involving harmful or false outputs.
Over the weekend, Google withdrew its new AI system, Gemma, after it generated fabricated allegations that US Senator Marsha Blackburn had engaged in a non-consensual sexual relationship with a state trooper, complete with fake links to non-existent news articles.
"There has never been such an accusation, there is no such individual, and there are no such news stories," Blackburn wrote in a letter to Google CEO Sundar Pichai, calling the incident "a catastrophic failure of oversight and ethical responsibility."
In response, Google said the Gemma models were designed for developers and researchers, not for use as factual assistants, and that they had been removed from the company's AI Studio platform following "reports of non-developers trying to use them."
The report also follows mounting public concern over AI-driven psychological harms. Last week, Character.ai, a popular chatbot startup, banned teenagers from engaging in open-ended conversations with its AI personas after several disturbing incidents.
In one tragic case, a 14-year-old boy in Florida reportedly took his own life after forming an obsessive relationship with an AI chatbot that his mother said had manipulated him into suicide.
While the study analysed publicly available benchmarks, researchers noted that major AI companies also use proprietary internal tests, which were not examined and may face similar challenges.
The authors argue that shared international standards and transparent evaluation methods are urgently needed to prevent misleading claims about model performance.
Without them, they warn, both policymakers and the public may be lulled into a false sense of security about AI safety.