Anthropic: Jailbreak our new model. We dare you
Claude is now much more resistant to trickery
AI chatbots can be a great force for good – but it was found early on that they can also give people access to knowledge that really should stay hidden.
Over the years, people have found all sorts of ways around the restrictions AI developers have put in place to generate controversial results, from unusual capitalisation to extremely long prompts.
Anthropic says its newest approach, Constitutional Classifers (see here for the full research paper), can filter “the overwhelming majority” of these types of jailbreaks, and is asking for help testing the system.
“These Constitutional Classifiers are input and output classifiers trained on synthetically generated data that filter the overwhelming majority of jailbreaks with minimal over-refusals and without incurring a large compute overhead,” writes Anthropic.
The company tested the system in a prototype phase, with 183 participants spending an estimated 3,000 hours over two months trying to break it – with no successes.
However, the prototype had its own errors: it refused too many harmless queries and was resource-intensive. Anthropic says it has started to address these flaws in the newest model, which it has tested with synthetic jailbreaking prompts:
“Under baseline conditions, with no defensive classifiers, the jailbreak success rate was 86% - that is, Claude itself blocked only 14% of these advanced jailbreak attempts. Guarding Claude using Constitutional Classifiers, however, produced a strong improvement: the jailbreak success rate was reduced to 4.4%, meaning that over 95% of jailbreak attempts were refused.”
While Constitutional Classifiers may not stop every known attack, they should require much more effort to bypass.
Individuals with experience in red teaming are now invited to try and jailbreak a version of Claude using the new system, which is specifically designed to stop attacks relating to chemical weapons.
The test will run from the 3rd – 10th February, and includes a feedback form.