Overview
On September 12, 2024, OpenAI released o1 — a model that introduced a qualitatively different approach to AI capability: instead of scaling training compute, it scaled inference-time compute. The longer o1 was allowed to “think” during a query, the better its performance on hard reasoning tasks.
This concept — variously called inference-time scaling, test-time compute scaling, or chain-of-thought scaling — became the dominant research paradigm of 2024–2025, producing a new class of “reasoning models” from every major AI lab.
The Old Paradigm: Training-Time Scaling
From 2017 to 2024, the dominant thesis of AI capability growth was training-time scaling: more data, more parameters, more compute at training equals better performance. This produced GPT-3, GPT-4, Claude, Gemini — increasingly capable models that stored ever more knowledge and pattern-recognition in their weights.
But by 2024, evidence was mounting that training-time returns were diminishing. The jump from GPT-3 (175B) to GPT-4 was enormous; the jump from GPT-4 to GPT-4o, while useful, was smaller. The field needed a new axis.
The New Paradigm: Test-Time Compute
The core insight was deceptively simple: allow the model to think before answering.
By training models via reinforcement learning to generate extended internal chain-of-thought — verifiable by whether the final answer is correct — OpenAI discovered that:
- Performance scales with reasoning length: on hard math and coding problems, longer reasoning chains reliably improve accuracy
- Emergent self-correction: models learn to check their work, catch errors, and backtrack — behaviors that appear spontaneously when incentivized by outcome rewards
- Compute flexibility: users can choose how much computation to spend (fast vs. careful thinking) based on task importance
A pivotal theoretical underpinning came from a Google DeepMind / UC Berkeley paper (August 2024): “Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters” — which showed that on problems where a smaller model has non-zero success probability, optimal test-time compute can match a model 14× larger. Notably, OpenAI’s o1 was released in September 2024, the month after this paper appeared; whether the timing reflects prior independent development or prompt incorporation remains a matter of debate in the research community.
Two Mechanisms
Inference-time scaling operates through two complementary mechanisms:
Sequential Scaling (Think Longer)
- The model generates an extended reasoning chain before outputting an answer
- Each step builds on previous steps, enabling multi-hop reasoning, error correction, and hypothesis testing
- Performance scales approximately log-linearly with tokens generated
Parallel Scaling (Think Wider)
- Generate multiple independent reasoning chains (Best-of-N sampling)
- Use a separate verifier model to select the best candidate answer
- Particularly effective for problems with objectively checkable answers (math, code execution)
The Reasoning Model Generation
The inference-time scaling paradigm produced an entire family of “thinking models” across labs:
| Model | Lab | Release |
|---|---|---|
| o1 | OpenAI | Sep 2024 |
| o3, o4-mini | OpenAI | Jan–Apr 2025 |
| DeepSeek R1 | DeepSeek | Jan 2025 |
| Claude 3.7 Sonnet (extended thinking) | Anthropic | Feb 2025 |
| Gemini 2.5 Pro (thinking) | Google DeepMind | Mar 2025 |
| QwQ | Alibaba | Nov 2024 |
Why This Matters
Inference-time scaling represents a regime change in AI capability acquisition, with several downstream implications:
Democratization: DeepSeek R1 (January 2025) demonstrated that the inference-time paradigm could be implemented far more cheaply than previously assumed — using pure reinforcement learning without expensive supervised fine-tuning data.
New economics: AI cost now has two dimensions: training cost (fixed) and inference cost (per query, variable by task difficulty). Difficult tasks that require extended thinking become more expensive; simple tasks remain cheap.
Benchmark recalibration: Many benchmarks previously considered “solved” (like MATH) were re-evaluated as the right test of reasoning is not whether models know the answer — it’s whether they can derive it under constrained conditions.
The unsolved problem: Inference scaling works best when correct answers can be verified. For open-ended tasks (creative writing, strategic advice, value judgments), the right verifier remains an unsolved research problem — and may require human judgment at scale.