Overview
GPT-4, released on March 14, 2023, represented OpenAI’s most significant leap in large language model capability since the debut of GPT-3 in 2020. Unlike its predecessors, GPT-4 was designed from the ground up as a multimodal system, capable of accepting both text and image inputs and generating text outputs across an extraordinarily broad range of tasks. The model demonstrated human-level or superhuman performance on numerous professional and academic benchmarks, achieving scores that placed it in the top decile of test-takers on the United States Bar Examination (90th percentile), the LSAT (88th percentile), and a variety of Graduate Record Examination (GRE) subtests.
The release marked a deliberate strategic shift by OpenAI. Where GPT-3 had surprised the world primarily through sheer scale—175 billion parameters and an unprecedented ability to generate coherent, extended text—GPT-4 distinguished itself through measurable reasoning capability, nuanced understanding, and the capacity to process visual information alongside textual prompts. The model could analyze a diagram, interpret a photograph, read a code snippet, and respond with structured reasoning that reflected an understanding of context and intent far beyond pattern matching.
OpenAI’s decision to release GPT-4 came after months of internal evaluation. The company had developed the model behind closed doors, subjecting it to extensive red-teaming, adversarial testing, and capability evaluation before offering it first to developers via the API (March 14, 2023) and subsequently to ChatGPT users (March 14, 2023 for ChatGPT Plus subscribers). This measured rollout reflected OpenAI’s awareness that the model’s capabilities carried both promise and peril—that a system capable of passing the bar exam could also generate persuasive misinformation, and that a model with genuine reasoning ability required careful alignment work before confronting a global user base.
What GPT-4 Actually Was
At its core, GPT-4 was a large multimodal model based on a transformer architecture, but the specifics of that architecture—model size, layer depth, attention head counts—were deliberately withheld by OpenAI, a departure from the relative transparency surrounding GPT-3. The company’s 2020 technical report for GPT-3 had disclosed its 175-billion-parameter scale. For GPT-4, OpenAI cited “competitive and safety concerns” as justification for keeping architectural details proprietary. What was known, however, was that the training process combined supervised fine-tuning (SFT) with reinforcement learning from human feedback (RLHF)—a pipeline that had been pioneered with InstructGPT and ChatGPT but was executed at a vastly larger scale and with considerably more data.
The training pipeline itself was a multi-month endeavor. Unlike the relatively straightforward next-token prediction training of early language models, GPT-4 underwent extensive post-training refinement. Human annotators generated demonstration data for supervised fine-tuning, teaching the model to perform tasks in ways that reflected human judgment and intent. Following this, the model was exposed to a large dataset of human preferences—comparisons between model outputs, ranked by human raters—and trained using RLHF to optimize for responses that humans found helpful, honest, and harmless.
Estimates of the compute and infrastructure cost of training GPT-4 have ranged from $40 million to over $100 million, with analysis by researchers suggesting the training run required thousands of high-performance GPUs running for several months on Microsoft’s Azure cloud infrastructure. This represented a substantial investment, reflecting both the computational intensity of training a model at GPT-4’s capability level and the expense of the human annotation effort required for RLHF. The decision to train on Azure was not incidental; it reflected the deepened partnership between Microsoft and OpenAI, which had committed billions in investment in exchange for preferential access to Azure compute and exclusive integration rights for OpenAI’s models in Microsoft products.
The Eight Months of Hidden Development
The gap between GPT-4’s internal completion and its public release spanned approximately eight months—a period during which OpenAI conducted extensive internal testing, external red-teaming with select partners, and capability evaluations designed to assess the model’s readiness for deployment at scale. This period of hidden development was not merely a quality assurance exercise; it reflected a strategic calculation about the optimal moment to reveal GPT-4’s capabilities to the world.
During these eight months, OpenAI engaged in what internal sources described as “secret capability testing,” evaluating the model’s performance on an extensive suite of benchmarks, adversarial prompts, and domain-specific tasks. The company specifically sought to determine whether GPT-4 had crossed threshold capabilities—reasoning levels, factual accuracy thresholds, safety metrics—that would make it genuinely useful for professional applications without posing unacceptable risks. This testing included evaluations on standardized exams, coding challenges, and scientific reasoning tasks, as well as extensive probing for harmful outputs.
The decision to wait was influenced by several factors. First, OpenAI recognized that releasing a model that could pass the bar exam or medical board questions carried implications for how the model would be used—and misused—in high-stakes domains. Second, the company was aware that GPT-4’s multimodal capabilities, particularly its ability to analyze images, introduced novel safety considerations that required careful evaluation. Third, the partnership with Microsoft demanded coordination; Bing AI, then in development, was slated to incorporate GPT-4 technology, and the release timeline needed to align with Microsoft’s product roadmap.
When OpenAI ultimately released GPT-4 on March 14, 2023, the announcement came via a blog post and technical report, with API access made available to developers immediately. The company noted that GPT-4 had been in training for “months” and that the model had undergone six months of “maturity tuning” before release—language that suggested a timeline consistent with the eight-month hidden development period reported by external observers and corroborated by subsequent investigative reporting.
Benchmark Performance
GPT-4’s performance on standardized and professional benchmarks was, by any measure, extraordinary. The model demonstrated consistent superhuman performance across a battery of evaluations:
On the United States Bar Examination (Multistate Bar Exam), GPT-4 scored in the 90th percentile among human test-takers—a result that would have been inconceivable for a language model just a few years prior. This was not a narrow or cherry-picked success; GPT-4 demonstrated competence across all seven MBE subject areas, including Constitutional Law, Contracts, Criminal Law and Procedure, Evidence, Real Property, Torts, and Professional Ethics.
On the LSAT (Law School Admission Test), GPT-4 achieved a score in the 88th percentile. The test, which measures logical reasoning and analytical reading skills, had long been considered a benchmark of human cognitive aptitude. GPT-4’s performance placed it above the median score of students admitted to many accredited law schools.
On Graduate Record Examination (GRE) quantitative reasoning, GPT-4 scored in the 80th percentile. On the GRE verbal reasoning section, the model’s performance was similarly strong. On the GRE analytical writing section—a task requiring the generation of sustained, structured arguments—GPT-4 received scores comparable to the 93rd percentile of human test-takers.
On the United States Medical Licensing Examination (USMLE), GPT-4 passed all three steps (Step 1, Step 2CK, and Step 3) at or above the threshold for a passing score, with performance approaching the 90th percentile on some Step 2 and Step 3 question sets. The model demonstrated not only factual recall but the ability to synthesize clinical information, interpret laboratory findings, and reason about patient care scenarios.
On coding benchmarks, GPT-4 was evaluated on the HumanEval dataset (a benchmark of Python code generation problems) and achieved pass@1 rates of approximately 67%, compared to GPT-3.5’s 39%. On the more demanding Codex evaluation, GPT-4 demonstrated the ability to solve coding problems that required multi-step reasoning and the integration of multiple library functions.
These benchmark results were not merely incremental improvements over GPT-3.5. The jump in performance—especially on reasoning-intensive tasks like bar exam essays, LSAT logical reasoning problems, and multi-step coding challenges—reflected a qualitative shift in capability. GPT-3.5 had been impressive as a text generator; GPT-4 demonstrated something closer to analytical reasoning.
GPT-4’s Leap Beyond GPT-3.5
The contrast between GPT-4 and GPT-3.5—the model powering the original ChatGPT, released in November 2022—was immediately apparent to anyone who interacted with both systems. While GPT-3.5 could generate plausible, fluent text and could follow simple instructions, it frequently failed on tasks requiring sustained logical reasoning, often contradicted itself within a single response, and was fundamentally limited to text inputs.
GPT-4’s improvements fell into several categories. First, and most visibly, was the multimodal capability. GPT-4 could accept images as inputs—a photograph of a chart, a screenshot of a web page, a diagram of an electrical circuit—and generate detailed textual descriptions, analyses, or responses based on that visual information. This was not a superficial feature bolted onto a text model; it represented a genuine integration of visual understanding into the model’s core reasoning architecture.
Second, GPT-4 demonstrated substantially improved reasoning capabilities. On tasks requiring multi-step logical deduction, mathematical problem-solving with intermediate steps, or the synthesis of information from multiple sources, GPT-4’s performance was markedly superior to GPT-3.5. The model could work through a complex math problem, showing its reasoning step by step. It could analyze a legal contract and identify clauses that created potential conflicts or liabilities. It could review a piece of code and identify bugs, performance issues, or security vulnerabilities.
Third, GPT-4 introduced a significantly expanded context window. While GPT-3.5 supported 4,096 tokens (approximately 3,000 words), GPT-4 supported context windows of up to 128,000 tokens—later extended to 200,000 tokens in certain contexts. This expansion made it possible to provide GPT-4 with entire books, lengthy legal documents, or complete code repositories as input, and have the model reason about them coherently. The practical implication was that GPT-4 could perform tasks that required understanding extended discourse: summarizing a 200-page document, comparing clauses across a lengthy contract, or debugging a multi-file software project.
Fourth, GPT-4 exhibited markedly improved instruction following and alignment. The RLHF training process had been refined significantly since InstructGPT, and GPT-4 was substantially better at doing exactly what was asked, in the format requested, without generating extraneous or inappropriate content. This made the model more reliably useful for specific task completion, as opposed to open-ended text generation.
What GPT-4 Revealed
GPT-4 revealed both the extraordinary potential and the persistent limitations of large language models. The model’s capabilities were, by any reasonable measure, transformative. It could pass professional exams, write functional code, analyze complex legal and scientific documents, and engage in nuanced reasoning that resembled human expertise in ways that previous models had not. Yet it also revealed persistent failure modes that no amount of scale or training had fully eliminated.
Hallucination—the generation of plausible but factually incorrect or fabricated information—remained a significant limitation. GPT-4 could be confident and articulate in stating facts that were simply wrong. It could cite non-existent academic papers, invent legal precedents, and describe events that never occurred. The model’s tendency to hallucinate was most pronounced when asked about specialized domains, proper nouns, or specific dates—precisely the areas where factual accuracy mattered most. OpenAI’s technical report acknowledged that GPT-4 was “still imperfect” and “still capable of making silly mistakes,” a candid admission that set a precedent for transparency about model limitations.
The vision capability—GPT-4’s ability to process image inputs—was both a breakthrough and a source of new failure modes. The model could describe the content of a photograph, identify objects in a diagram, and reason about visual relationships. But it could also fail in surprising ways: misidentifying objects, misinterpreting charts, or drawing incorrect conclusions from visual information. The combination of text and image inputs created new opportunities for confusion, as the model might confidently describe an image in ways that reflected incorrect interpretation.
Long context understanding, while dramatically improved, was not perfect. At the 128,000-token context window, GPT-4 could in principle reason about documents of enormous length, but in practice, performance degraded when the model was asked to retrieve specific details from the middle of a very long context—a phenomenon sometimes called the “lost in the middle” problem. Attention mechanisms, even in a model as capable as GPT-4, could struggle to give equal weight to all parts of an extremely long input.
Safety characteristics represented a significant area of improvement relative to GPT-3.5, but not a complete solution. GPT-4 was substantially less likely to generate harmful content, refuse legitimate requests inappropriately, or engage in toxic output. However, researchers found that GPT-4 could still be manipulated into generating disallowed content through adversarial prompting techniques, and the model’s alignment was not so robust as to eliminate the need for external safety guardrails in high-stakes deployments.
The Enterprise AI Race It Catalyzed
The release of GPT-4 triggered an enterprise AI race that reshaped the technology industry within months. Within one week of GPT-4’s release, Microsoft announced that GPT-4 technology had been integrated into Bing AI, the company’s search engine, which had been rebuilt around AI chat capabilities. This integration was not coincidental; it reflected the deep strategic partnership between Microsoft and OpenAI, cemented by Microsoft’s reported $10 billion investment in OpenAI beginning in 2019 and accelerating through 2023. Microsoft positioned Bing AI as the first major search engine to offer AI-powered conversational search, a direct challenge to Google’s search dominance.
Google, which had long been the dominant player in search and had pioneered the transformer architecture underlying GPT-4, found itself in the unexpected position of racing to catch up. Google’s internal response to GPT-4’s release was described in subsequent reporting as urgent and, at times, frantic. The company had its own large language model efforts—including Bard, built on the LaMDA architecture—but had been cautious about releasing them publicly due to concerns about accuracy and safety. GPT-4’s launch forced a recalibration. Google announced Bard in February 2023, just weeks before GPT-4’s release, and the subsequent GPT-4 launch accelerated Google’s internal timeline for AI product releases.
By May 2023, Google announced a significant expansion of its AI offerings, integrating generative AI capabilities across its Workspace productivity suite—Docs, Sheets, Slides, and Gmail. This move was widely interpreted as a direct response to Microsoft’s announcement that it would integrate GPT-4 into Microsoft 365 (Copilot). The enterprise AI race had escalated from a competition between chatbots to a full-spectrum contest for dominance in AI-powered productivity software.
Microsoft’s deepened partnership with OpenAI was reflected in the June 2023 announcement that Microsoft had extended its Azure OpenAI Service to include GPT-4, offering the model to enterprise customers via a pay-per-token API. This gave businesses around the world access to GPT-4’s capabilities without needing to build their own AI infrastructure—a significant commercial expansion that reinforced Microsoft’s position as the preferred cloud provider for AI workloads.
The competitive pressure also extended to startups and other hyperscalers. Amazon quickly announced expanded partnerships with AI companies to integrate generative AI capabilities into AWS. Meta released parts of its LLaMA model family in an effort to maintain relevance in an environment where OpenAI had established a commanding lead. The release of GPT-4 had, in effect, reset the competitive landscape of the entire AI industry.
The Bridge to Later Models
GPT-4 was not an endpoint but a waypoint. Within months of its release, OpenAI had begun the process of building upon GPT-4’s architecture and training methodology to develop models with still greater capability and different operating characteristics. The most significant of these subsequent releases was GPT-4 Turbo, announced in November 2023, which offered a 128,000-token context window, knowledge cutoff of April 2023, and significantly lower API pricing—addressing two of GPT-4’s notable limitations: context length and cost.
But the more consequential successor was the “o1” series, released in September 2024. The o1 models—o1-preview and o1-mini—represented a fundamental departure from the GPT architecture’s approach to reasoning. Where GPT-4 excelled at generating fluent, coherent responses based on patterns learned during training, o1 was designed to engage in explicit, multi-step reasoning before producing an answer. The model used a “chain of thought” approach internally, generating intermediate reasoning steps that allowed it to solve problems that GPT-4 found intractable—particularly in mathematics, coding, and scientific domains. On competition-level mathematics problems (International Mathematical Olympiad problems), o1 reportedly solved approximately 83% of problems, compared to GPT-4’s 13%.
The o3 model, released in late 2024 and early 2025, extended this reasoning capability further, achieving new state-of-the-art results on multiple benchmarks and, notably, scoring above the threshold for the ARC Prize—a benchmark designed to measure fluid reasoning as opposed to learned pattern matching. This progression—from GPT-4’s “impressive pattern recognition” to o1’s “explicit reasoning” to o3’s “general fluid reasoning”—traced a path that GPT-4 had made possible but not guaranteed.
Looking further ahead, GPT-5 was reported to be in development as of 2024-2025, with expectations that it would incorporate the reasoning advances of the o-series while restoring the broad conversational capability and multimodal polish that had characterized GPT-4. The interplay between the GPT and o-series architectures suggested a future where reasoning and language generation were not separate concerns but complementary aspects of a unified capability.
GPT-4’s release, in retrospect, marked the moment when the AI research community and the broader public recognized that the capabilities of large language models were no longer merely impressive—they were professionally relevant. The bar had been raised for what an AI system could be expected to do, and the race to exceed that bar had only begun.