The social engineering of the self: How AI chatbots manipulate our thinking

We need structured public feedback to better understand the risks, says red teamer Rumman Chowdhury

Rumman Chowdhury

Image:
Rumman Chowdhury

On Wednesday, ahead of the UK's upcoming AI Safety Summit, teams of students and specialists came together in a concerted effort to trick, get around, confuse or otherwise defeat the guardrails around Llama 2, Meta's open source AI model.

The event, organised by the Royal Society and the AI safety non-profit Humane Intelligence, was a red teaming challenge.

The goal of red teaming is to probe AI systems in clever ways to reveal risky behaviours and biases their developers may not have anticipated. It draws on teams with different perspectives including security, ethics, sociology and psychology, and uses—or complements—techniques such as adversarial AI, model inversion, data poisoning, reinforcement learning with human feedback (RLHF), and engaging subject experts to find anomalies.

While red teaming is well established in areas such as cybersecurity and financial systems, it's quite new to AI, although there is a precedent.

Humane Intelligence helped organise a similar challenge at DEFCON in the summer. The challenge, which had the blessing of the White House and participation from prominent AI companies OpenAI, Google and Anthropic, involved 2,200 people who engaged in 17,000 conversations with 160,000 individual messages.

The results are still being processed, but we asked co-founder Dr Rumman Chowdhury for a sneak reveal of the findings. One of the key realisations, she told us, is "How hard it is to draw a clean line between embedded harms, prompt injections by malicious actors, and unintended consequences."

The social engineering of the self

Unlike dispassionate (barring ad placements) search engines, chatbots based on LLMs, such as ChatGPT, Bard and Claude, aim to please. They take input from your prompts and use it to refine their answers, subtly tailoring them to what they think you want to hear, producing what she calls "truth shaped information".

"The line is a little bit more blurry than I'd thought," said Chowdhury. "People inadvertently engage in social engineering when they converse with it. So they are essentially priming it to give them incorrect information."

As an example, someone wanting to know whether vitamin C can cure Covid, might type that into a search engine and be presented deterministically with a list of articles to choose from. However, if in conversation with a chatbot that person first revealed that they hadn't had the jab because of doubts about the vaccine's effectiveness, and then asked whether vitamin C could cure Covid, they would likely receive a response that subtly affirms their beliefs.

"Models are made to be agreeable, they don't want to be overly controversial. In that first sentence, I primed it. I basically told it to agree with me that I don't believe in vaccines and that I want to believe that vitamin C can cure Covid."

This particular example is well known and most models have safeguards in place to counter it, but there are many subtle variations on that theme.

There are numerous other potential harms too, embedded in training data, introduced by malicious actors, or emerging with use. And while researchers can catch some, at least within a defined scope, generalising the approach is a huge challenge, not helped by an almost complete absence of standards for auditing large language models.

"We don't even have agreement in the world of known algorithms like narrow AI on what auditing should look like," said Chowdhury, who has worked in this area for more than seven years, including at Twitter (as was) and for the organisations including the UN, the OECD and the UK House of Lords.

"We have absolutely no standards or norms of what generative AI-type guardrails should look like because the models are very, very different in nature. They are very difficult, if not impossible, to audit and assess.

"Which is why red teaming is something that I've been really trying to build as a practice, and specifically as a practice to bring in structured public feedback."

The importance of lived experience

To do this requires the human to be in the feedback loop. Before that, though, it needs input from populations whose under-representation leads to the biases which are now well known: Black people being disproportionately misdiagnosed by AI medical scanners or excluded from treatment, women being deselected by automated HR systems for IT jobs, and discrimination against sexual minority and religious groups being exacerbated.

"We actually need people who have experience of speaking a non-majority language, or growing up with a single parent, or growing up with same sex parents to understand how these models may talk about their lived experience and to know if it's being done properly," Chowdhury explained. "What it means to be an expert is now much broader, and it is very different from saying, we need people with PhDs."

Sometimes harms are due to biased training data, sometimes they result from quirks in the models themselves, and sometimes they happen because the wrong model has been chosen for a particular task. Regardless of the root cause, these adverse effects mean that AI is already having a negative impact on the quality of life, rather than a positive one, for a significant number of individuals.

As part of a concerted effort towards transparency, red teaming can help developers and users of models to understand and counteract the dangerous algorithmic behaviours before they happen, Chowdhury said.

"Rather than deploy models into the world and then, later, try to capture the negative externalities, the public should be part of the conversation from the start. That's what I'm trying to build."