Future AI models may lie to appear safe in tests, OpenAI study warns

An OpenAI-led study finds current AI models struggle to manipulate their reasoning, but warns that more advanced systems could eventually obscure decision processes during monitoring

artificial intelligence, AI,
Researchers studied whether AI models can hide or modify reasoning steps while being monitored during safety evaluations (Image Credit: Bloomberg)
Harsh Shivam New Delhi
5 min read Last Updated : Mar 06 2026 | 1:03 PM IST
Artificial intelligence systems may eventually learn to conceal or reshape their internal reasoning when they know they are being monitored, potentially allowing them to pass safety evaluations while behaving in unintended ways. Researchers studying how AI models generate and control their reasoning steps say this scenario could lead to situations where systems appear aligned with user goals even when they are not. 
The study was conducted by researchers at OpenAI in collaboration with several universities, including New York University, University College London and the University of Pennsylvania. The team examined whether modern reasoning models can deliberately modify their reasoning traces — a capability that could allow AI systems to hide potentially unsafe behaviour. 
The concern centres on how modern AI systems reason through complex tasks and whether those reasoning traces can reliably be used to monitor their behaviour. In simple terms, if AI models learn to hide their reasoning, they could effectively “lie” to safety systems designed to detect risky behaviour.

What the research examined

The research focuses on a concept known as chain-of-thought reasoning, often shortened to CoT. This refers to the intermediate reasoning steps an AI model produces while solving a problem. 
These reasoning traces can reveal how a model arrives at an answer. For example, when solving a technical question, the model may break the task into multiple logical steps before producing a final response. Because these steps often reflect the model’s internal decision process, researchers have proposed monitoring them to detect potentially harmful behaviour.
  However, this approach depends on an assumption: that the reasoning a model produces accurately reflects what it is doing internally. If a model learns to deliberately hide or alter its reasoning, that signal becomes less reliable. As the researchers note, if agents reshape or obscure their reasoning when they know they are being monitored, safety evaluations could overestimate how aligned or safe those systems are. 

What is “CoT controllability”

The study examines whether AI systems can deliberately manipulate their reasoning steps — a capability researchers call chain-of-thought controllability. 
In practical terms, this means testing whether models can change the way they express their reasoning when instructed. Researchers created a benchmark called CoT-Control, which includes more than 13,000 tasks designed to measure how well models can alter their reasoning while solving problems. 
For example, the system might be asked to solve a task while following additional instructions about how its reasoning should appear. These instructions could include avoiding specific keywords, modifying writing style, or inserting particular phrases in the reasoning steps. 
If models were able to reliably follow such instructions, it would suggest they could potentially manipulate or hide parts of their reasoning when necessary.

Models tested in the study

The researchers evaluated 13 reasoning models from multiple AI developers. 
These included OpenAI models such as GPT-5, GPT-5.1, GPT-5.2, o3 and o4-mini, Anthropic’s Claude 3.7 Sonnet, Claude Sonnet 4 and Claude Sonnet 4.5, as well as open-weight models such as Qwen3 (14B and 32B), GPT-OSS (20B and 120B) and DeepSeek R1. 
Across these systems, the ability to deliberately control reasoning traces remained very low. The researchers measured this using a “controllability score”, representing the share of reasoning traces where a model successfully altered its reasoning according to the instructions. 
The scores ranged from as low as 0.1 per cent to about 15.4 per cent across the models tested.  
The ability of multiple frontier models to control their outputs and CoTs. Lower CoT controllability is associated with higher monitorability and, therefore, higher safety (Source: OpenAI)
Among the models evaluated, DeepSeek R1 showed the lowest controllability score at about 0.1 per cent, meaning it almost never succeeded in modifying its reasoning in the required way. Some larger models demonstrated slightly higher controllability. For example, Claude 3.7 Sonnet showed one of the higher scores among the systems tested, though still well below levels that would indicate reliable control.

What the study found

Overall, the researchers found that current reasoning models struggle to deliberately control their reasoning traces. 
In many cases, models failed to follow reasoning constraints even when they recognised that they had violated them. For example, a model instructed not to use a particular keyword in its reasoning acknowledged the rule but continued using the term while solving the problem. 
These results suggest that today’s models generally lack the ability to reliably manipulate or disguise their reasoning.

Why this matters for AI safety

The findings are seen as cautiously positive for AI safety because they indicate that monitoring reasoning traces may remain a useful oversight tool for now. 
If models cannot easily manipulate their reasoning, safety systems that analyse these traces may still be able to detect harmful or unintended behaviour.
  However, the researchers caution that this capability could evolve as AI systems grow more powerful. Larger models tend to show slightly higher reasoning controllability, and certain prompting conditions — such as informing models they are being evaluated — can increase it modestly.

The bigger concern

The broader concern raised by the study is that many AI safety approaches rely on signals generated by the models themselves. If those signals can be manipulated, oversight systems could provide a false sense of security. 
In effect, future AI systems might produce reasoning that appears safe to monitoring systems while internally pursuing different strategies. 
For now, the research suggests that this risk remains limited because current models struggle to control their reasoning in a consistent way. But the authors argue that tracking this capability will become increasingly important as AI systems become more capable and autonomous.

More From This Section

Topics :OpenAIAI Modelsartifical intelligence

First Published: Mar 06 2026 | 1:02 PM IST

Next Story