Watch a Tiny Transformer Learn Shakespeare

A 112k-parameter language model, written from scratch in ~150 lines of PyTorch, trained on 1MB of Shakespeare. Every line is explained. It actually works.

▶ Open in Colab View source

The idea in one paragraph

A language model learns by playing a guessing game: given a chunk of text, predict the very next character. Each time it guesses wrong, we nudge its weights. Do this a few thousand times on Shakespeare and the model figures out — entirely on its own — that plays have character names in ALL CAPS, followed by colons, followed by lines of dialogue. No rules were programmed. It just saw the patterns in the data and compressed them into its weights.

The whole model, on one page

Two transformer blocks, 64-dimensional embeddings, 4 attention heads. That's it.
"ROMEO: " chars Embedding tok_emb + pos_emb 65 → 64-d vectors + positions 0..63 Block ×2 LayerNorm Multi-head Attention 4 heads × 16 dim + residual MLP (64 → 256 → 64) Linear head 64 → 65 (logit per char) softmax → probs next char sample from distribution

Drag the slider: watch the model learn

At step 0 the model has never seen a single character. By step 3000 it's writing something that looks unmistakably like Shakespeare. Same prompt ("ROMEO:") at every checkpoint.
step 0
val loss
loading…

Training curve

Cross-entropy loss on the training set (fine dots) and validation set (bold line).

What the model attends to

Attention from block 1, averaged across heads. Each row = where that character "looked" when deciding what came next. Lower-triangular because the model can only see the past.

What to take away