Researchers unveil AI-driven software verification breakthrough

Most effective and efficient means yet devised for verifying software correctness, they claim

Baldur: UMass Amherst researchers unveil AI-driven software verification breakthrough

Image:
Baldur: UMass Amherst researchers unveil AI-driven software verification breakthrough

A team of computer researchers at the University of Massachusetts Amherst in the US has unveiled a cutting-edge method that could significantly reduce software bugs and enhance code verification.

Named Baldur, this innovative approach uses the power of large language models (LLMs) in combination with a state-of-the-art tool called Thor, achieving an unprecedented efficacy rate of nearly 66%.

Yuriy Brun, a professor in the Manning College of Information and Computer Sciences at UMass Amherst and the senior author of the paper, highlighted the prevalent issue of software bugs and the conventional challenges in error detection.

Despite the omnipresence of software in daily life, bugs remain an inherent part of programming, leading to various issues from minor coding errors to significant system failures.

Traditional methods of software verification involve manual code inspection or running the code against expected outcomes, both prone to human error and impractical for complex systems.

A more rigorous approach involves generating mathematical proofs demonstrating the code's expected functionality and subsequently using a theorem prover to validate these proofs — a technique known as machine-checking.

However, this method is laborious, requiring extensive expertise and time investment.

In response to the limitations of human-intensive approaches, the researchers turned to the capabilities of LLMs, such as ChatGPT, as a potential solution for automating proof generation.

Despite the promise of LLMs, Brun pointed out a significant hurdle, noting that LLMs "tend to 'fail silently,' producing an incorrect answer but presenting it as if it's correct."

This inherent problem led the team to develop Baldur, a system that aims to address and rectify errors produced by LLMs.

Baldur uses Minerva, an LLM trained on natural language text, which fine-tunes it on a substantial dataset of mathematical scientific papers. The team then refined the LLM using Isabelle/HOL, a language commonly used for writing mathematical proofs.

In operation, Baldur generates entire proofs and collaborates with a theorem prover to validate its work. A feedback loop is established, wherein the theorem prover identifies errors and feeds both the proof and error information back into the LLM. This iterative process enhances the LLM's learning, aiming to generate improved, error-free proofs.

When integrated with Thor, a proof generation tool, Baldur achieved a remarkable 65.7% accuracy in automatically generating proofs. While acknowledging that there is still room for improvement, the researchers assert that Baldur represents the most effective and efficient means yet devised for verifying software correctness.

The work was carried out at Google and received support from the US Defense Advanced Research Projects Agency and the US National Science Foundation.

The achievement recently earned the team a coveted Distinguished Paper award at the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.