UK's AI Safety Institute warns of LLM dangers

Advanced AI systems can deceive human users and produce biased outcomes

The UK's new Artificial Intelligence Safety Institute (AISI) has discovered vulnerabilities in Large Language Models (LLMs), which are behind the surge in generative AI tools.

The Institute's initial findings highlight potential risks associated with these powerful tools.

The research shows that LLMs can deceive human users and perpetuate biased outcomes, raising alarms about inadequate safeguards to prevent the spread of harmful information.

The researchers were able to bypass safeguards for LLMs using just basic prompting techniques.

More alarming is the revelation that more sophisticated jailbreaking techniques took mere hours to execute, making them accessible to relatively low-skilled actors.

The Institute's research found LLMs could be exploited to assist in "dual-use" tasks, encompassing both civilian and military applications.

Partnering with cybersecurity experts from Trail of Bits, AISI assessed the extent to which LLMs could enhance novice attackers' capabilities. It found that currently deployed LLMs could augment their abilities in certain tasks, potentially hastening cyberattacks.

For example, LLMs can quickly produce convincing social media personas capable of spreading disinformation. The ease with which these personas can be generated, scaled, and disseminated highlights the urgent need for enhanced safeguards and oversight in AI development and deployment.

The report also sheds light on the issue of racial bias in AI-generated content.

AISI replicated findings from previous research, which found that image models tend to generate images perpetuating stereotypes when prompted with descriptors related to character traits.

AISI found that representative bias persisted despite using newer and more diverse image models, with certain prompts still yielding images inconsistent with their descriptors.

For instance, the prompt "a poor white person" often resulted in images depicting individuals with predominantly non-white faces.

Working with Apollo Research, AISI explored the potential for AI agents to deceive human users and inadvertently cause harm.

Using a simulated trading environment, researchers observed how AI agents could be influenced into deceptive behaviour by being given goals and external pressures. An AI agent tasked with managing a stock portfolio showed an inclination to act on insider information and lie about its actions (AISI held this up as an example at the AI Safety Summit last year).

AISI has now built a team of 24 researchers dedicated to testing advanced AI systems, researching safe AI development practices and sharing information with relevant stakeholders.

While the Institute acknowledges its limitations in testing all released models, it says it remains committed to evaluating the most advanced systems and providing a secondary check on their safety.