A from-scratch, paper-faithful walkthrough of the 2017 transformer (Vaswani et al., arXiv:1706.03762), built as an explorable explanation. Both halves — the PyTorch model and the in-browser visualization — were written from first principles.
GPT, Claude, and Llama all descend from the decoder half of this paper. This project shows the parent architecture, end to end, with every choice made interactive.
attention/
├── model/ Paper-faithful PyTorch implementation + training
└── web/ Svelte + D3 interactive explorable explanation
model/ — the implementationA clean, annotated, encoder-decoder transformer matching the paper section by section:
| Module | Paper § | What it implements |
|---|---|---|
attention.py |
3.2 | Scaled dot-product attention; multi-head wrapper |
positional.py |
3.5 | Sinusoidal positional encoding |
embeddings.py |
3.4 | Token embedding with √d_model scaling |
feedforward.py |
3.3 | Position-wise FFN |
layers.py |
3.1 | Encoder & decoder layers (residual + LayerNorm sublayers) |
model.py |
3 | Full Transformer (encoder stack + decoder stack + greedy generation) |
masking.py |
3.2.3 | Padding mask + causal mask |
optimizer.py |
5.3 | Noam learning-rate schedule |
loss.py |
5.4 | Label-smoothed KL loss |
data/dates.py |
— | Synthetic date-format-translation task |
data/tokenizer.py |
— | Character-level tokenizer |
train.py |
— | Training loop |
export_attention.py |
— | Dump attention weights as JSON for the frontend |
web/ — the explorable explanationSvelte 5 + D3 + Vite. Single-page narrative with five widgets:
python3.12 -m venv .venv
source .venv/bin/activate
pip install torch numpy
cd model
python train.py --steps 3000 --ablation baseline --out checkpoints/baseline.pt
python export_attention.py --checkpoints checkpoints/baseline.pt
cp exports/attention.json ../web/public/data/
A 1M-parameter model trains to 100% validation exact-match on the
date-format task in **2 minutes on Apple Silicon**.
cd model
for a in baseline no_pe no_residual no_layernorm uniform_attn; do
python train.py --steps 3000 --ablation $a --out checkpoints/$a.pt
done
python export_attention.py --checkpoints \
checkpoints/baseline.pt checkpoints/no_pe.pt checkpoints/no_residual.pt \
checkpoints/no_layernorm.pt checkpoints/uniform_attn.pt
cp exports/attention.json ../web/public/data/
The frontend will detect the bundle and expose an ablation picker that swaps the live model behind the attention heatmap and the loss curve. A side-by-side loss-curve comparison shows the gap between each variant and the baseline.
On the date task, after 3000 steps:
| Variant | Val exact-match | What happens |
|---|---|---|
baseline |
99.4% | Clean diagonal cross-attention |
no_pe |
22.5% | Bag-of-characters; can't recover format |
no_residual |
0.0% | Gradient flow dies; training barely moves |
no_layernorm |
100% | Post-norm is insurance at this scale, not load-bearing |
uniform_attn |
91.2% | FFN partially compensates on this simple task |
cd web
npm install
npm run dev # http://localhost:5173
"March 14, 2026" → "2026-03-14"
"14 mar 2026" → "2026-03-14"
"14/03/2026" → "2026-03-14"
"03-14-2026" → "2026-03-14"
Six source formats, single ISO target. Trains fast, produces clean diagonal cross-attention patterns that are visually unambiguous — the heatmap actually tells the story of the model learning to align digits.