What are Large Language Models (LLMs)?

LLMs are AI systems trained on vast amounts of text data that can generate, understand, and reason about human language. Examples include GPT-4, Claude, and Llama.

When did LLMs first appear?

The foundations were laid in 2017 with the Transformer architecture. GPT-1 (2018) was one of the first generative pre-trained models, but the breakthrough moment was GPT-3 in 2020 and ChatGPT in 2022.

What was the key breakthrough that enabled modern LLMs?

The Transformer architecture introduced in 2017's 'Attention Is All You Need' paper replaced sequential processing (RNN/LSTM) with parallel self-attention, enabling training on massive datasets.

How have LLMs changed from GPT-1 to GPT-4?

From 117M parameters (GPT-1, 2018) to ~1.8 trillion (GPT-4, 2023) — a 15,000x increase. Beyond size, models gained multimodality, reasoning abilities, and instruction following through RLHF.

Evolution of LLMs: From Rule-Based NLP to GPT-4

The Birth of Artificial Minds

1943 – 1969

The dream of thinking machines is older than computers themselves. In 1943, Warren McCulloch and Walter Pitts published “A Logical Calculus of Ideas Immanent in Nervous Activity” — the first mathematical model of a neuron. Their artificial neuron was binary: it could fire or not fire, nothing in between. It couldn't learn. But it proved something profound: neural computation was mathematically possible. A machine could, in principle, mimic the logic of a brain.

Fourteen years later, Frank Rosenblatt built the Mark I Perceptron at Cornell — the first machine that could learn from data. The U.S. Navy funded it. The New York Times ran a breathless headline: “the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.” The perceptron could classify simple visual patterns by adjusting its connection weights through training. It was primitive, but it was learning.

In 1966, Joseph Weizenbaum at MIT created ELIZA — the first chatbot. ELIZA mimicked a psychotherapist using simple pattern matching: when a user said “I feel sad,” ELIZA would respond “Tell me more about why you feel sad.” It had zero understanding of meaning. Yet people were shocked at how easily they were fooled — some insisted ELIZA truly understood them, even after Weizenbaum explained the trick. He was horrified. He called it the ELIZA effect: humans' tendency to attribute intelligence to machines that mimic human patterns. This effect is alive and well today with modern chatbots.

Then came the crash. In 1969, Marvin Minsky and Seymour Papert published “Perceptrons” — a rigorous mathematical proof that single-layer perceptrons cannot solve non-linear problems. The most famous example was the XOR problem: a perceptron simply could not learn the exclusive-or function, because no single line can separate the two classes. The book was devastating. Funding dried up virtually overnight. Researchers abandoned neural networks. The first AI Winter began.

The lesson

AI progress has always been cyclical. Hype → disappointment → quiet progress → breakthrough. Understanding this cycle is key to understanding where we are today.

From Neuron to Perceptron

Explore the evolution of the first learning machine.

The McCulloch-Pitts neuron (1943): fixed weights, binary output. It could compute logic gates (AND, OR) but could NOT learn — weights had to be set by hand.

AI Winters and Hidden Progress

1970 – 2005

The First AI Winter (1974–1980) was brutal. In 1973, the British government commissioned the Lighthill Report, which concluded that AI research was fundamentally flawed — promises had been wildly overstated, and the combinatorial explosion problem made general AI impossible with existing approaches. DARPA slashed funding. Universities closed AI labs. Researchers stopped using the term “artificial intelligence” in grant proposals to avoid instant rejection. Neural networks were considered a dead end.

But underground, the seeds of the future were being planted. In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published the modern formulation of backpropagation — an algorithm for training neural networks with multiple layers. By propagating error signals backwards through the network, each connection could learn how to adjust its weight. The XOR problem? Solved. Multi-layer networks could learn non-linear boundaries. This algorithm is still used today to train every single LLM.

Three years later, Yann LeCun applied backpropagation to convolutional neural networks (CNNs) for handwritten digit recognition. AT&T deployed his system to read ZIP codes on mail — the first real-world neural network application processing millions of envelopes. It worked, quietly, reliably, without fanfare.

The Second AI Winter (1987–1993) struck when the expert systems boom collapsed. Companies had invested millions in rule-based AI systems that turned out to be brittle and impossible to maintain. Japan's ambitious Fifth Generation Computer project, a $400 million bet on logic programming, failed to deliver on its promises. AI became a dirty word in funding proposals once again.

Yet the quiet work continued. In 1997, Sepp Hochreiter and Jürgen Schmidhuber invented Long Short-Term Memory (LSTM) networks, solving the vanishing gradient problem that had plagued recurrent networks. LSTMs could remember information across long sequences — a crucial capability for language processing. They would dominate NLP for the next twenty years, until Transformers replaced them.

Also in 1997, IBM's Deep Blue defeated world chess champion Garry Kasparov. The world was amazed. But this was brute-force search, not learning — Deep Blue evaluated 200 million positions per second using hand-crafted evaluation functions. It couldn't learn from its games, couldn't play checkers, and didn't advance general AI at all. It was a cultural milestone, not a scientific one.

Meanwhile, statistical methods shaped the NLP landscape. N-grams counted word sequences, TF-IDF weighted important terms, and Bag-of-Words threw away word order entirely. These approaches powered early search engines and spam filters — useful, but fundamentally limited. Computers processed language as symbols, not meaning. For 60 years, NLP was about engineering features by hand. Each new task required new rules. This approach hit a ceiling — language was too complex to capture with explicit rules.

The hidden foundation

During the winters, the foundations of modern AI were quietly being built. Backpropagation, CNNs, and LSTMs — all invented during “AI winters” — are the direct ancestors of today's LLMs. The lesson: real progress often happens when nobody is watching.

AI Hype Cycles

Click on milestones to learn more. Glowing dots mark breakthroughs during AI winters.

MilestoneDiscovered during AI winter

Deep Learning Ignition

2006 – 2016

In 2006, Geoffrey Hinton showed that deep neural networks could be trained effectively using layer-by-layer pretraining with Deep Belief Networks. The term “deep learning” was born. After decades of winter, this was the first crack in the wall — proof that depth wasn't just possible, but beneficial. The research world took notice, cautiously.

Then came 2012 — THE turning point. Alex Krizhevsky's deep CNN, AlexNet, was trained on NVIDIA GPUs and entered the ImageNet Large Scale Visual Recognition Challenge. The result was unprecedented: 15.3% error rate versus 26.2% for the runner-up. That wasn't an incremental improvement — it was a chasm. The entire computer vision field pivoted to deep learning overnight. GPUs, once used only for video games, became the new oil of AI. Every major tech company scrambled to build GPU clusters.

In 2013, Tomas Mikolov at Google published Word2Vec — and everything changed for language. For the first time, words became vectors in a continuous space, where mathematical operations captured meaning: king - man + woman ≈ queen. Words with similar meanings clustered together automatically. This was the seed of modern language understanding — the idea that meaning could be geometry in a high-dimensional space.

Recurrent Neural Networks (RNNs) became the architecture of choice for sequence tasks. They processed text token by token, maintaining a hidden state — a form of memory. But they had a fatal flaw: vanishing gradients. By the time an RNN reached the end of a long sentence, it had forgotten the beginning. LSTM and GRU networks added gate mechanisms — forget, input, and output gates that controlled information flow, dramatically improving memory for medium-length sequences.

In 2014, Ian Goodfellow invented Generative Adversarial Networks (GANs) — two neural networks competing against each other, one generating images and one trying to detect fakes. The generator kept improving until its output was indistinguishable from real data. GANs led directly to modern image generation: Stable Diffusion, DALL-E, and Midjourney all trace their lineage here. The same year, the Seq2Seq architecture introduced encoder-decoder models for machine translation.

2015 brought two critical ingredients. Bahdanau, Cho, and Bengio added an attention mechanism to sequence-to-sequence models. Instead of compressing an entire input sentence into a single fixed-size vector, the model could “look back” at any part of the input when generating each output word. This idea became the heart of the Transformer. In the same year, Kaiming He introduced ResNet with residual (skip) connections, allowing gradients to flow through very deep networks — 152+ layers. Without skip connections, Transformers wouldn't work.

The convergence

Every piece fell into place independently: GPUs (hardware), ImageNet (data), CNNs (vision), Word2Vec (language), RNNs/LSTMs (sequence modeling), attention (mechanism), ResNets (depth). The Transformer unified them all.

The Compute Explosion

Hover over data points to see what unlocked each breakthrough.

Compute grew exponentially → at the GPU threshold (2012), deep learning became viable → once viable, progress became exponential too. Each 10x in compute unlocked qualitatively new capabilities.

Compute milestoneWith ImageNet benchmark

Architecture Evolution

RNN → LSTM → Transformer: how the architecture changed

2013–2015RNN (Recurrent Neural Network)

Processes sequences one token at a time, passing hidden state forward. The foundation of neural NLP.

Handles variable-length sequences

Captures sequential dependencies

✕Vanishing gradients — forgets early tokens

✕Sequential processing — cannot parallelize

✕Slow training on long sequences

The Transformer Revolution

2017 – 2018

In June 2017, Vaswani et al. at Google published “Attention Is All You Need” — arguably the most influential paper in AI history. The key insight was radical: throw away recurrence entirely, use ONLY attention. Their Transformer architecture replaced recurrence with self-attention: every token can directly attend to every other token, enabling massive parallelism and eliminating the vanishing gradient problem. Instead of processing tokens one at a time, the Transformer could handle all tokens in parallel — a perfect fit for GPU hardware. This meant faster training, longer context windows, and better performance. This one paper changed everything.

In 2018, two approaches emerged from the Transformer. OpenAI released GPT-1 (117M parameters): a decoder-only model that generates text left-to-right, trained to predict the next token. It was the first Generative Pre-trained Transformer, showing that a decoder-only Transformer trained on raw text could then be fine-tuned for downstream tasks. The generative pre-training paradigm was born.

Four months later, Google released BERT (340M parameters): an encoder-only model that understands text bidirectionally, trained by masking random words and predicting them. It dominated virtually every NLP benchmark, proving that pre-training on massive unlabeled text followed by fine-tuning was the way forward. GPT was more flexible for generation, but BERT initially seemed superior for understanding tasks. The debate would be settled by scale.

Why it matters

The Transformer didn't just improve performance — it changed the economics of AI. Parallel processing meant you could throw more GPUs at training. This opened the door to scaling — and scaling, as it turned out, was all you needed.

Architecture Evolution

Click each architecture to see what changed and why.

Added hidden layers + backprop

Added memory for sequences

Direct access to all positions

Removed recurrence, full parallelism

Interactive Timeline

Click a dot to see details. Filter by category.

BreakthroughHigh impactMediumLow

The Scaling Era

2019 – 2021

OpenAI released GPT-2 in February 2019 with 1.5 billion parameters — and controversially withheld the full model, deeming it “too dangerous to release.” Its coherent multi-paragraph text generation startled the AI community.

Then came the bombshell: GPT-3 in June 2020, with 175 billion parameters. GPT-3 exhibited in-context learning — it could perform new tasks just from a few examples in the prompt, without any fine-tuning. Few-shot prompting was born, and the entire field shifted.

Researchers discovered scaling laws: model performance improved predictably with more parameters, more data, and more compute. Kaplan et al. (2020) showed smooth power-law relationships. Later, DeepMind's Chinchilla paper (2022) refined this — smaller models trained on more data could outperform larger ones, establishing compute-optimal training ratios.

Perhaps most striking were emergent abilities — capabilities that appeared suddenly at certain model sizes: arithmetic, multilingual translation, and code generation seemingly “switched on” beyond specific parameter thresholds.

Parameter Growth

Hover a dot for details. Click an organization in the legend to filter.

Open model

Training Data Growth

~1B tokens

Pre-Transformer

10x→

~10B tokens

GPT-2 Era

30x→

~300B tokens

GPT-3 Era

~5x→

~1.4T tokens

Chinchilla Era

11x→

~15T tokens

Llama 3 Era

Bar heights use log scale (log₁₀ of tokens)

The ChatGPT Moment

2022 – 2023

In January 2022, OpenAI published InstructGPT — demonstrating how RLHF (Reinforcement Learning from Human Feedback) could align a language model with human preferences. The model went from “capable but unpredictable” to “helpful, harmless, and honest.”

On November 30, 2022, OpenAI launched ChatGPT — and shattered every growth record. 100 million users in 2 months. For the first time, non-technical people could converse naturally with AI. The world woke up to what large language models could do.

In March 2023, GPT-4 arrived — multimodal (text + images), passing the bar exam in the 90th percentile, acing the SAT, and writing production-quality code. It demonstrated that LLMs could be genuinely useful professionals in many domains.

The open-source community responded. Meta released LLaMA in February 2023, sparking an explosion of open models. Mistral showed that small, well-trained models could punch far above their weight. An AI race began: Google launched Gemini, Anthropic released Claude, and dozens of startups entered the fray.

Benchmark Progress

Massive Multitask Language Understanding — 57 subjectsCode generation benchmark — 164 problemsGrade school math — 8.5K problems

The Modern Landscape

2024 – 2026

The frontier models race continues: GPT-4o brought native multimodality at lower cost, Claude 3.5 Sonnet surpassed larger models at half the price, and Gemini 2.0 pushed Google's native multimodal capabilities.

Open-source made massive strides. Llama 3 405B matched closed-source frontiers. DeepSeek V3 proved that efficient MoE training (trained for just $5.5M) could produce frontier-competitive models. Mistral and Qwen continued pushing the small-model frontier.

A new paradigm emerged: reasoning models. OpenAI's o1 introduced chain-of-thought at inference time, achieving PhD-level math and science. DeepSeek R1 brought open-weight reasoning. Claude gained extended thinking capabilities. These models “think” before answering, spending more compute at inference time to work through problems step by step. Chain-of-thought became a training technique, not just a prompting technique.

Context windows exploded: from 4K tokens (GPT-3) to 1M+ tokens (Gemini). The cost per token collapsed by 100x in two years. Multimodality — text, images, audio, video — became the standard, not the exception.

Anthropic's Claude took a distinctive approach with Constitutional AI — training the model to follow a set of principles rather than relying solely on human feedback. From Claude 2 through Claude 3 (Haiku, Sonnet, Opus) to Claude 4, the focus has been on safety, helpfulness, and honesty as core values, not afterthoughts.

AI Agents represent the latest frontier. LLMs evolved from chatbots into autonomous tools that use other tools, browse the web, write and execute code, and orchestrate complex multi-step workflows. Claude Code, Cursor, and Devin are early examples of a future where AI doesn't just answer questions — it completes tasks.

What's different about this AI wave? Unlike previous cycles, this one has a revenue model. ChatGPT, Claude, Gemini, and GitHub Copilot are products with millions of paying users. AI companies are generating billions in revenue. This means development won't stop even if hype cools — there is too much money flowing in. The winter, if it comes, will be milder than the ones that came before.

Where we stand

We are living in the most rapid period of AI advancement in history. The question is no longer “will AI be useful?” but “how do we build with it responsibly?”

Modern Model Landscape

Click on any model to see details. Filter by type.

Closed sourceOpen sourceDot size = estimated parameter count

What's Next

The near future

AI Agents are the next frontier. Models are evolving from chatbots into autonomous tools that can browse the web, write and execute code, manage files, and orchestrate complex workflows. Claude Code, Devin, and computer-use agents hint at a future where AI is a capable collaborator, not just a text generator.

Reasoning and planning are becoming first-class capabilities. Models that “think before answering” through extended chain-of-thought are tackling problems that require genuine multi-step logic.

Compact models are closing the gap with frontier giants through distillation, quantization, and architectural innovations. A 7B parameter model today matches 175B from just two years ago.

The open vs. closed debate intensifies. Open-weight models (Llama, DeepSeek, Mistral) prove that access drives innovation, while closed-model providers argue that safety requires control. The balance between openness and safety will shape the field.

Why This Matters For You

Every era of AI and LLM evolution maps directly to the skills you'll learn on this platform:

Tokenization & Embeddings

How models turn text into numbers (Word2Vec legacy)

Attention & Transformers

The architecture that powers everything since 2017

Prompting Techniques

Leveraging in-context learning discovered in the scaling era

AI Agents

The next frontier: models that act, not just respond

Production & Safety

Deploying LLMs responsibly in the real world

Understanding the history gives you context for every technique, architecture, and practice you'll encounter. You'll know why things work the way they do, not just how to use them.

Ready to start learning?

Your next step is Neural Networks — understanding how the building blocks of modern AI actually work. It's the foundation for everything else.

The Birth of Artificial Minds

1943 – 1969

The lesson

AI progress has always been cyclical. Hype → disappointment → quiet progress → breakthrough. Understanding this cycle is key to understanding where we are today.

From Neuron to Perceptron

Explore the evolution of the first learning machine.

The McCulloch-Pitts neuron (1943): fixed weights, binary output. It could compute logic gates (AND, OR) but could NOT learn — weights had to be set by hand.

AI Winters and Hidden Progress

1970 – 2005

The hidden foundation

AI Hype Cycles

Click on milestones to learn more. Glowing dots mark breakthroughs during AI winters.

MilestoneDiscovered during AI winter

Deep Learning Ignition

2006 – 2016

The convergence

The Compute Explosion

Hover over data points to see what unlocked each breakthrough.

Compute grew exponentially → at the GPU threshold (2012), deep learning became viable → once viable, progress became exponential too. Each 10x in compute unlocked qualitatively new capabilities.

Compute milestoneWith ImageNet benchmark

Architecture Evolution

RNN → LSTM → Transformer: how the architecture changed

2013–2015RNN (Recurrent Neural Network)

Processes sequences one token at a time, passing hidden state forward. The foundation of neural NLP.

Handles variable-length sequences

Captures sequential dependencies

✕Vanishing gradients — forgets early tokens

✕Sequential processing — cannot parallelize

✕Slow training on long sequences

The Transformer Revolution

2017 – 2018

Why it matters

Architecture Evolution

Click each architecture to see what changed and why.

Added hidden layers + backprop

Added memory for sequences

Direct access to all positions

Removed recurrence, full parallelism

Interactive Timeline

Click a dot to see details. Filter by category.

BreakthroughHigh impactMediumLow

The Scaling Era

2019 – 2021

Parameter Growth

Hover a dot for details. Click an organization in the legend to filter.

Open model

Training Data Growth

~1B tokens

Pre-Transformer

10x→

~10B tokens

GPT-2 Era

30x→

~300B tokens

GPT-3 Era

~5x→

~1.4T tokens

Chinchilla Era

11x→

~15T tokens

Llama 3 Era

Bar heights use log scale (log₁₀ of tokens)

The ChatGPT Moment

2022 – 2023

Benchmark Progress

Massive Multitask Language Understanding — 57 subjectsCode generation benchmark — 164 problemsGrade school math — 8.5K problems

The Modern Landscape

2024 – 2026

Where we stand

We are living in the most rapid period of AI advancement in history. The question is no longer “will AI be useful?” but “how do we build with it responsibly?”

Modern Model Landscape

Click on any model to see details. Filter by type.

Closed sourceOpen sourceDot size = estimated parameter count

What's Next

The near future

Compact models are closing the gap with frontier giants through distillation, quantization, and architectural innovations. A 7B parameter model today matches 175B from just two years ago.

Why This Matters For You

Every era of AI and LLM evolution maps directly to the skills you'll learn on this platform:

Tokenization & Embeddings

How models turn text into numbers (Word2Vec legacy)

Attention & Transformers

The architecture that powers everything since 2017

Prompting Techniques

Leveraging in-context learning discovered in the scaling era

AI Agents

The next frontier: models that act, not just respond

Production & Safety

Deploying LLMs responsibly in the real world

Understanding the history gives you context for every technique, architecture, and practice you'll encounter. You'll know why things work the way they do, not just how to use them.

Ready to start learning?

Your next step is Neural Networks — understanding how the building blocks of modern AI actually work. It's the foundation for everything else.

AI Timeline

The Birth of Artificial Minds

From Neuron to Perceptron

AI Winters and Hidden Progress

AI Hype Cycles

Deep Learning Ignition

The Compute Explosion

Architecture Evolution

The Transformer Revolution

Architecture Evolution

Interactive Timeline

The Scaling Era

Parameter Growth

Training Data Growth

The ChatGPT Moment

Benchmark Progress

The Modern Landscape

Modern Model Landscape

What's Next

Why This Matters For You

AI Timeline

The Birth of Artificial Minds

From Neuron to Perceptron

AI Winters and Hidden Progress

AI Hype Cycles

Deep Learning Ignition

The Compute Explosion

Architecture Evolution

The Transformer Revolution

Architecture Evolution

Interactive Timeline

The Scaling Era

Parameter Growth

Training Data Growth

The ChatGPT Moment

Benchmark Progress

The Modern Landscape

Modern Model Landscape

What's Next

Why This Matters For You