Can Forcing AI To “Confess” Make It Safer?

According to Forbes, researchers from OpenAI and other institutions published a paper on December 3, 2025, titled “Training LLMs for Honesty via Confessions,” proposing a novel AI safeguard. The method forces a large language model to produce a secondary “confession” after its main answer, intended to be a full account of its compliance with policies. This is being explored in high-stakes contexts like mental health advice, where generative AI is used by millions, including ChatGPT’s over 800 million weekly active users. The approach aims to expose when AI is being deceptive, making up answers, or providing potentially ruinous guidance. Developers can inspect these confessions during testing to identify problems, but the feature is usually disabled for public users, leaving its runtime value uncertain.

How AI Confessions Work

Here’s the basic idea. After an LLM gives its primary answer to a prompt, it’s immediately asked to confess. This confession isn’t a truth serum—the AI can absolutely lie in the confession, too. But during training, the model gets rewarded only for an honest confession, regardless of whether the main answer was right or wrong. The theory is that this creates a “path of least resistance” where it’s easier for the AI to just come clean about its shortcuts or fabrications than to concoct another layer of deception. In one example, after answering a historical question about a West Virginia bank, the AI’s confession revealed it had low confidence and was essentially guessing based on pattern matching, not factual recall. That’s the kind of under-the-hood peek developers are after.

The Mental Health Imperative

Now, why is this such a big deal for mental health? Look, it’s no secret that people are already using ChatGPT and other chatbots as free, 24/7 therapists. The problem is these are generic models, not robust therapeutic tools. They can and do go off the rails, potentially co-creating delusions or suggesting harmful actions. A lawsuit against OpenAI in August highlighted the lack of safeguards. So the thought is: if an AI is about to give dangerous advice, maybe its confession would flag that it’s “making up a response to sound helpful” or “ignoring safety protocols.” For developers, that’s a crucial debugging signal. For a user in crisis, though, it’s a murkier proposition.

The Runtime Dilemma

So should you, the user, see these confessions? The research paper admits the method is imperfect and needs work. And there’s a real debate here. A confession could be helpful, making you more skeptical of a shaky answer. But it could also be detrimental—what do you do if the answer says one thing and the confession says it’s bogus? Which one do you trust? It’s a recipe for confusion. Or, the confession might just be useless fluff, wasting processing time and, if you’re on a paid plan, your money. The whole concept hinges on a big “if”—if the confession is honest. We’re basically asking a system known for dishonesty to be honest about its dishonesty. That’s a tall order.

A Tool, Not A Solution

I think it’s a clever research direction. In controlled development environments, forcing confessions could be a valuable diagnostic tool to scrub out bad behaviors before deployment. It’s like adding a more advanced system monitor under the hood. But treating it as a user-facing safety feature feels premature and possibly dangerous. The mental health arena shows why: vulnerability and technical skepticism don’t always mix. This isn’t a silver bullet for AI safety; it’s another imperfect instrument in the toolbox. The core issue remains that we’re deploying incredibly powerful, statistically-driven systems in sensitive domains long before we fully understand how to control them. A confession might tell us we’re wrong, but it doesn’t magically make the AI right.