Overview
In September 1997, Sepp Hochreiter and Jürgen Schmidhuber published a paper titled “Long Short-Term Memory” in the journal Neural Computation. The paper introduced LSTM (Long Short-Term Memory) — a recurrent neural network architecture designed to learn long-range dependencies in sequential data, solving the fundamental vanishing gradient problem that had made standard RNNs ineffective for anything beyond very short sequences.
The paper was rejected by NIPS twice before being accepted. It eventually became one of the most cited papers in all of computer science — with over 100,000 citations — powering billions of real-world applications.
The Problem It Solved
Standard RNNs trained with backpropagation through time (BPTT) suffer from the vanishing gradient problem: as the network processes longer sequences, gradients shrink exponentially, making it impossible to learn dependencies that span many timesteps.
LSTM solved this with memory cells and gating mechanisms:
- Forget gate: decides what information to discard
- Input gate: decides what new information to store
- Output gate: decides what to output
These gates allow LSTMs to maintain information over arbitrarily long sequences.
The Delayed Recognition
LSTM found early practical use in handwriting recognition (Apple’s early PalmPilot Graffiti), mobile keyboard prediction, and speech recognition. But its transformative potential was not fully realized until:
- 2013: LSTM with deep RNNs achieved state-of-the-art in speech recognition
- 2015: Google’s voice recognition switched to LSTM-based models
- 2017: Google Translate switched to LSTM (until Transformer displaced it in 2020)
Why It Matters
LSTM is the direct ancestor of the attention mechanism. When Vaswani et al. (2017) introduced the Transformer — replacing recurrence with self-attention — the core insight of “selectively remembering and forgetting through learned gates” was carried forward. Without LSTM, the path from RNNs to modern transformers would have been far longer.