Overview
On January 27, 2022, OpenAI published the paper “Training language models to follow instructions with human feedback” — introducing InstructGPT. While ChatGPT would not launch until November 2022, InstructGPT was the foundational technical work that made ChatGPT possible. It introduced RLHF (Reinforcement Learning from Human Feedback) as the primary method for aligning large language models with human intent.
The key insight: a language model trained purely on next-token prediction will optimize for “looking like text on the internet” — not for “being useful to a human user.” RLHF corrects this by training a reward model from human preference data, then fine-tuning with reinforcement learning to maximize that reward.
Three-Step Process
- SFT (Supervised Fine-Tuning): Fine-tune GPT-3 on curated demonstration data
- Reward Model Training: Train a model to predict which of two outputs a human labeler would prefer
- RL Fine-Tuning: Fine-tune the SFT model with PPO against the reward model
The result: InstructGPT 1.3B outperformed GPT-3 175B on human preference evaluations, despite being 100x smaller. Scale wasn’t everything — alignment was.
Why It Was the Real Inflection Point
ChatGPT (November 2022) got the public’s attention. But InstructGPT was the technical inflection point:
- RLHF became the standard method for aligning all subsequent large models (Claude, GPT-4, Gemini, Llama)
- Smaller aligned models outperforming larger unaligned models shifted how researchers thought about scaling
- The human feedback mechanism established the paradigm that evolved into Constitutional AI (Anthropic, 2022)