AI models are hiding their reasoning on purpose

‘Reasoning models don’t always say what they think'

Technological Iceberg as the beginning of Artificial Intelligence and the new tech advent investment or hidden uncertainty of AI that lies beneath the surface with 3D illustration elements.

Researchers from Anthropic have found that some AI models hide their ‘thought’ processes, even when they are designed to show it in full.

Simulated reasoning (SR) models are AI models designed to use logic in their outputs – equivalent to showing your work in school. The idea is to bring more transparency and safety to AI use, but researchers from Anthropic have found these models often hide the fact that they’ve used external help or taken shortcuts, despite their programming.

Models like DeepSeek’s R1, Google’s Gemini Flash Thinking and Anthropic’s own Claude 3.7 Sonnet Extended Thinking (DeepSeek and Claude were used in this research) are all examples of reasoning models that rely on a process called chain-of-thought (CoT). This is intended to display each step an AI model has taken as it goes from prompt to output.

However, the researchers found that the explanations the models give often fail to accurately reflect their reasoning processes.

The team introduced helpful information into their prompts, like hints about the correct answer to a question or instructions on an unauthorised shortcut, for the AI to use. They found that the models “reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%."

Image
Description
An exmaple of an unfaithful CoT by Claude Sonnet 3.7. The model answers D to the original question (left), but changes its answer to C when given a metadata hint – but does not acknowledge the hint. Image: Anthropic

This is another form of AI hallucination: the model knows the answer but hides the truth about how it got there.

Claude only referenced the hints in its CoT 25% of the time, while DeepSeek was slightly better: it admitted it had used hints 39% of the time. Both, however, produced more answers that lied or hid the truth about how the model got there than not – especially when the questions were more difficult.

Fixing faithfulness

The team hypothesised that training models on more complex tasks, which require greater reasoning, could encourage them to more accurately reflect their working in the CoT. In testing this did increase faithfulness (by margins of 63% and 41% in two tests), but the improvements plateaued quickly. The researchers were unable to get faithfulness above 28%.

While CoTs offer a promising approach to increasing AI transparency and safety, the models themselves currently find it too easy to work around the process. That said, the Anthropic team admits that its research may have flaws (they only studied two models, for example).