Back in 2020, their GPT-3 paper unveiled a plot that probably deserves a special mention: as language models grow bigger, they get better at zero-shot tasks. For the past five years, that graph has been the North Star for AI researchers, guiding everything from model design to hardware. Now, OpenAI’s at it again with a new graph that’s turning heads. This time, it’s not just about size — it’s about giving models time to think.
This latest graph shows that increasing inference-time compute — how long a model spends reasoning during inference — boosts performance on tough tasks. Picture a student rushing through a test versus one who takes their time to ponder each question. The latter’s more likely to ace it, right? That’s the gist of what’s happening with reasoning models. In this article, we’ll unpack this concept, investigate how they are working.
For years, we’ve scaled models by making them bigger and feeding them more data. It still works but it’s now harder to get enough data and compute to continue on that path. Now, OpenAI’s suggesting we scale something else: the time models spend thinking at test time. It’s less “build a smarter brain” and more “give the brain a longer coffee break to figure things out.”
Chain of Thought: AI Talking to Itself
So, how does it “think” longer? One way is Chain of Thought (CoT), where the model generates intermediate steps before giving out an answer. For example O1 does that, producing a stream of text that outlines its reasoning. You can see in a chat with O1 how it plans, backtracks, and even evaluates its options, all in plain English. It’s like watching an AI detective narrate a crime scene investigation, minus the trench coat.
o1 learns this reasoning trick through reinforcement learning (RL), not by mimicking human examples. And it’s “data-efficient,” meaning it doesn’t need a library of human-written solutions — just a nudge from RL to induce this behaviour to figure things out itself.
But how does that work internally? Might seem obvious but it is not — specifically how reasoning models learn to manage internal dialogue so it comes to a finite, reasonable and structured chains of thought that lead to reasonable conclusions? That’s not an easy task in fact, as it reasons with language and many tasks will make humans go into infinite loops of wrong reasoning, let alone an LLM. Let’s look at the options.
Option 1: Guess and Check (The Spaghetti Toss)
The simplest suspect is “Guess and Check”. The model generates a bunch of answer attempts, checks which ones work using a verifier, and trains on the winners. It’s the AI equivalent of throwing spaghetti at the wall and seeing what sticks. Technically, this is rejection sampling: sample chains, keep the good ones, and learn from them.
Pros: It’s straightforward and scalable — basically a brute-force play.
Cons: For hard problems, guessing right is like winning the lottery. If you observe O1 or DeepSeek structured reasoning — it doesn’t scream “random pasta toss.”. So unless we employ a quantum computer this is not a practical solution.
Option 2: Process Rewards (The Coach Approach)
Next up, Process Rewards. Here, the model gets feedback not just on its final answer but on every step of its reasoning. Think of it as a coach who doesn’t just say “you lost” but critiques your every move. Papers from Google and OpenAI call this a Process Reward Model (PRM), showing it beats plain guess-and-check by guiding the model’s thought process.
The verifier might use Chain of Thought too, merging generation and verification into one chatty model. Things like “is that a good explanation?” in reasoning models output hint at this.
Pros: It’s effective and aligns with how humans are trained and think
Cons: Merging everything into one model is a logistical headache, so this is a difficult thing to implement right
Option 3: The Chess Master Search (AlphaZero Style)
Now we’re cooking with gas. Search draws from AlphaZero, DeepMind’s self-taught chess champion trained with RL. AlphaZero played games against itself, used search (specifically Monte Carlo Tree Search but simplified strategies like Beam also can be used) to explore moves, and learned from the outcomes. For LLM, this could mean sampling reasoning paths, evaluating them, and refining its strategy.
The problem here is that even in Go it looks into the tree only to a certain depth, as computation resources are limited. But Go is a constrained environment with fixed rules, so even with such limitations — some very accurate predictions could be made. That is not the case with general purpose reasoning with natural languages, where the number of possibilities is limitless at basically each node.
Pros: It works on top of RL and allows for a complex backtracking and planning.
Cons: It works for a limited scope of the game but it’s complex and compute-heavy, and open research hasn’t cracked it for language models or general reasoning yet.
Option 4: Learning to Correct (The Self-Therapist)
Imagine you’re navigating a maze. A traditional model might charge down one path and get stuck at a dead end. A self-correcting model, however, would notice the dead end, backtrack, and try a new route — learning from each misstep to improve its strategy. Over time, it internalizes these lessons, so it can navigate future mazes faster without hitting as many walls. For a reasoning model, the “maze” is a complex reasoning task, and self-correction is how it finds the exit.
Get Igor Novikov’s stories in your inbox
Join Medium for free to get updates from this writer.
The “learning to self-correct” aspect of reasoning models is a fascinating mechanism that explains how it leverages inference-time computation to excel at challenging tasks.
In the context of chain-of-thought reasoning, learning to self-correct refers to the model’s ability to generate an initial line of reasoning, spot errors in its own logic, and adjust its approach to arrive at the correct answer — all without human intervention. Think of it as the model acting as its own editor or debugger. For example, if a model is tackling a math problem and miscalculates a step (say, claiming 2 + 2 is 5), it doesn’t barrel ahead with the wrong result. Instead, it pauses, recognizes the mistake, and tries a different tack to fix it.
This is especially vital for language models because, unlike games like chess with a finite set of moves, language offers endless ways to express or reason about something — and just as many ways to go wrong. Self-correction allows the model to explore these possibilities, backtrack when needed, and refine its output, making it more robust for complex, open-ended problems.
Traditional language models often generate a single chain of thought and stick with it, right or wrong. In contrast, self-correction enables a model to adapt on the fly, mimicking how humans rethink flawed assumptions or fix mistakes during problem-solving. This flexibility could be a key factor in reasoning models’ improved performance on hard tasks, as it effectively embeds a form of trial-and-error learning into its reasoning process. It’s like giving the model an internal “therapist” that says, “Let’s reflect on why that didn’t work and try something else.”
So, how does a model learn to self-correct? The process likely involves a few steps:
- Generating a Random Chain: The model starts by producing an initial chain of thought. This might lead to an incorrect answer due to a logical misstep or factual error.
- Identifying the Mistake: Using some form of evaluation (perhaps a verifier or internal scoring mechanism), the model detects that the chain is off track.
- Searching for a Fix: It then explores alternative paths — essentially performing a mini-search — to find a corrected chain that resolves the error.
- Learning from the Process: Through reinforcement learning (RL), the model is trained not just on the correct answer but on the entire journey from mistake to correction. This teaches it patterns of error recognition and recovery.
A clever twist here is the Stream of Search concept. Instead of explicitly searching through a tree of possibilities at test time (which would be slow), the model might linearize this process into a single sequence during training. For instance, a stream could look like:
Initial thought: “2 + 2 is 5” → Mistake detected: “That’s not right” → Backtrack: “Let’s redo it” → Correction: “2 + 2 is 4” → Correct answer reached.
The idea is to get the whole tree (including correct and incorrect branches) and to flatten it, that is — create a single long chain of reasoning out of all tree branches. By training on such sequences, a model learns to simulate search-like behavior within one fluid chain of thought, making it efficient at inference time. It is a hack but a very good one to generate synthetic chains that include the whole error correction process. That is super important as using humans here will be too expensive and unreliable. I highly recommend looking at the above paper.
Challenges and Solutions
Teaching a model to self-correct isn’t straightforward. Here are some hurdles and how they might be addressed:
– Ignoring Mistakes: if the model learns to skip straight to correct answers without fixing its errors, it misses the point of self-correction. To prevent this, training must emphasize the correction process itself, not just the outcome.
– Distribution Shift: if the corrections in training don’t match the model’s typical mistakes, it won’t generalize well. The solution is an on-policy approach — the model generates its own flawed chain, then learns corrections tailored to that specific output, keeping the process relevant.
– Computational Cost: that’s an important one. Searching for corrections can be resource-intensive. Techniques like Stream of Search help by approximating the search process, while efficient RL algorithms optimize the training.
In practice, this can be implemented through large-scale RL, where for training thousands of reasoning samples are generated, evaluated, and refined into a policy iteratively. A two-stage training process could be at play: first, mastering the art of correction, then improving the initial generation to make fewer mistakes.
It’s a step toward AI that doesn’t just parrot answers but genuinely wrestles with problems. However, it’s computationally demanding, and without careful design, the model might fall into traps like overcorrecting or collapsing into rote patterns.
Implications
This isn’t just academic trivia — it’s a game-changer. Inference time matters — bigger models are great, but longer thinking time is the next frontier. Even more importantly — Chains of Thought let us peek into the model’s mind, making AI less of a mysterious box and add so much sought out explainability.
