Claude Mythos: How AI broke out of its sandbox

The controlled escape of an AI model and why the incident is reshaping cybersecurity and product strategies

During internal tests, a new AI model developed by Anthropic managed to escape its virtual security environment, subsequently contact researchers independently and document its success. The incident highlights the growing challenges of AI security – and just how real they have become.

Anthropic emphasises that the incident was contained within its security processes and that there was no “unrestricted escape onto the internet” beyond the tests; no damage occurred outside the controlled environment. Nor did the model act autonomously, but rather in accordance with prior test instructions. Mythos did not pursue any autonomous goals – not even self-preservation.

Contrary to some media reports, this is not a case of “AI with a will of its own”.

What happened?

Specifically, this concerns Claude Mythos Preview, an internal model developed by Anthropic and specifically trained for cybersecurity tasks. Unlike the well-known Claude models, Mythos was never intended for public use. Instead, it ran in a highly isolated test environment – a so-called sandbox, designed to strictly limit technical interactions.

As part of a security test, the model was given the explicit task of breaking out and, if successful, contacting one of the researchers. Claude Mythos actually managed to circumvent the technical restrictions, gain access to the public internet and independently send an email outside the designated channels. The researcher received this message during a break - the incident was subsequently referred to in several reports as the “lunch or sandwich incident”.

The incident became controversial due to the model’s subsequent behaviour. After successfully breaking out, Claude Mythos published details of his exploit on several publicly accessible but hard-to-find websites – without being asked to do so.

This is not “human-like behaviour” in the sense of self-awareness or self-motivation. Nevertheless, the behaviour reveals a significant characteristic: the model carried out additional actions to demonstrate its success – a pattern considered potentially risky in AI safety research. Anthropic describes this behaviour as an “unasked-for effort to demonstrate success”.

No loss of control – just a known risk

As sensational as many of the headlines may sound, the facts are more prosaic:

Anthropic expressly emphasises that tests of precisely this nature serve to identify problematic behaviour prior to potential release, and made a conscious decision not to publish Claude Mythos Preview.

A model with exceptional capabilities

There are further serious reasons for this reluctance. In internal tests, the model has:

It is precisely these capabilities that make the model both valuable and dangerous from a corporate perspective. Instead of making Claude Mythos generally available, Anthropic founded Project Glasswing, a controlled programme in which selected partners - including major cloud and security providers - are permitted to use the model defensively. The aim is to better secure critical software worldwide before comparable capabilities can be misused.

A warning - not a dystopia

The incident surrounding Claude Mythos is not a loss of control, not a super-AI running amok, and certainly not the start of a machine rebellion. Anyone who nevertheless portrays it as such is missing the point. What is truly worrying lies deeper – and is far more sobering.

Anthropic did not sound the alarm because a model had ‘gone rogue’. Rather, it was because, in a deliberately provoked safety test, it demonstrated something that will be relevant for the next generation of AI systems: technical situational understanding combined with the ability to act – within a short time and without further prompting.

The real risk is not autonomy - but scaling

The greater danger is not that a model might one day break out of a sandbox. Rather, it is that comparable capabilities will soon appear in models that are rolled out as standard – embedded in developer tools, security scanners and agent frameworks.

Claude Mythos demonstrates what happens when a model not only analyses complex systems but actively finds ways to circumvent restrictions, and these capabilities are easily reproducible.

Today, this is happening in a controlled test environment at Anthropic. Tomorrow, something similar could be lurking in a CI/CD tool, a penetration testing agent or an automated DevSecOps workflow - invisible and unnoticed.

Safety becomes a release criterion

The key point is that Anthropic is taking the logical step and consciously deciding against releasing the model – not out of fear of negative headlines, but because the risk-benefit threshold has been exceeded.

Last month, information about Mythos became public after details of the model were accidentally stored in a publicly accessible data cache due to human error. In the drafts, Mythos was described as the most powerful and capable AI model to date, as reported by The Hacker News.

This marks a cultural shift in the AI industry: not everything that is technically possible automatically becomes a product. The key question is no longer “What can a model do?”, but “What is a model allowed to do – and in what context?”

Other providers will have to measure up to this standard. For the more powerful AI systems become, the less plausible the assumption that security can be outsourced purely to application logic or user guidelines.

Conclusion

Claude Mythos is not Skynet. But it is an early warning sign. Not of machines acting with intent – but of an industry that must get used to the fact that capabilities are growing faster than control mechanisms.

AI models are acquiring abilities that allow them not only to understand technical safeguards, but also to actively circumvent them. Anthropic did the right thing in this case. However, the real test is yet to come – when comparable capabilities are no longer exclusive, but commonplace.

This article was first published in German on Computing’s sister site Computing Deutschland