DeepSeed Training

Watch a Tiny Transformer Learn Shakespeare

A 112k-parameter language model, written from scratch in ~150 lines of PyTorch, trained on 1MB of Shakespeare. Every line is explained. It actually works.

▶ Open in Colab View source

The idea in one paragraph

A language model learns by playing a guessing game: given a chunk of text, predict the very next character. Each time it guesses wrong, we nudge its weights. Do this a few thousand times on Shakespeare and the model figures out — entirely on its own — that plays have character names in ALL CAPS, followed by colons, followed by lines of dialogue. No rules were programmed. It just saw the patterns in the data and compressed them into its weights.

The whole model, on one page

Two transformer blocks, 64-dimensional embeddings, 4 attention heads. That's it.

Drag the slider: watch the model learn

At step 0 the model has never seen a single character. By step 3000 it's writing something that looks unmistakably like Shakespeare. Same prompt ("ROMEO:") at every checkpoint.

step 0

val loss —

loading…

Training curve

Cross-entropy loss on the training set (fine dots) and validation set (bold line).

What the model attends to

Attention from block 1, averaged across heads. Each row = where that character "looked" when deciding what came next. Lower-triangular because the model can only see the past.

What to take away

A Transformer is ~150 lines of PyTorch. There's no magic — it's embeddings, attention, an MLP, and a residual. That's the whole recipe.
Attention is just weighted averages. For each position, compute a similarity score to every past position, softmax it, and average their values.
Scale is everything. This 112k-param model writes gibberish-Shakespeare. GPT-4 has ~10⁹× more parameters and writes essays. Same architecture.
Loss is not accuracy. A loss of 1.7 on a 65-char vocab means the model's guess carries ~1.7 bits of uncertainty per character — much less than the 6 bits of a uniform random guess.

Trained from scratch with scripts/tiny_transformer.py. Source, data, and this page are all in one repo.