InstructGPT: Language Models Can Learn to Follow Instructions

Overview

On January 27, 2022, OpenAI published the paper “Training language models to follow instructions with human feedback” — introducing InstructGPT. While ChatGPT would not launch until November 2022, InstructGPT was the foundational technical work that made ChatGPT possible. It introduced RLHF (Reinforcement Learning from Human Feedback) as the primary method for aligning large language models with human intent.

The key insight: a language model trained purely on next-token prediction will optimize for “looking like text on the internet” — not for “being useful to a human user.” RLHF corrects this by training a reward model from human preference data, then fine-tuning with reinforcement learning to maximize that reward.

Three-Step Process

SFT (Supervised Fine-Tuning): Fine-tune GPT-3 on curated demonstration data
Reward Model Training: Train a model to predict which of two outputs a human labeler would prefer
RL Fine-Tuning: Fine-tune the SFT model with PPO against the reward model

The result: InstructGPT 1.3B outperformed GPT-3 175B on human preference evaluations, despite being 100x smaller. Scale wasn’t everything — alignment was.

Why It Was the Real Inflection Point

ChatGPT (November 2022) got the public’s attention. But InstructGPT was the technical inflection point:

RLHF became the standard method for aligning all subsequent large models (Claude, GPT-4, Gemini, Llama)
Smaller aligned models outperforming larger unaligned models shifted how researchers thought about scaling
The human feedback mechanism established the paradigm that evolved into Constitutional AI (Anthropic, 2022)

References

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.