attention-interactively Svelte Themes

Attention Interactively

An interactive, paper-faithful walkthrough of "Attention Is All You Need" (Vaswani et al., 2017). From-scratch PyTorch model + Svelte/D3 explorable explanation.

Attention Is All You Need — Interactively

A from-scratch, paper-faithful walkthrough of the 2017 transformer (Vaswani et al., arXiv:1706.03762), built as an explorable explanation. Both halves — the PyTorch model and the in-browser visualization — were written from first principles.

GPT, Claude, and Llama all descend from the decoder half of this paper. This project shows the parent architecture, end to end, with every choice made interactive.

What's in here

attention/
├── model/      Paper-faithful PyTorch implementation + training
└── web/        Svelte + D3 interactive explorable explanation

model/ — the implementation

A clean, annotated, encoder-decoder transformer matching the paper section by section:

Module Paper § What it implements
attention.py 3.2 Scaled dot-product attention; multi-head wrapper
positional.py 3.5 Sinusoidal positional encoding
embeddings.py 3.4 Token embedding with √d_model scaling
feedforward.py 3.3 Position-wise FFN
layers.py 3.1 Encoder & decoder layers (residual + LayerNorm sublayers)
model.py 3 Full Transformer (encoder stack + decoder stack + greedy generation)
masking.py 3.2.3 Padding mask + causal mask
optimizer.py 5.3 Noam learning-rate schedule
loss.py 5.4 Label-smoothed KL loss
data/dates.py Synthetic date-format-translation task
data/tokenizer.py Character-level tokenizer
train.py Training loop
export_attention.py Dump attention weights as JSON for the frontend

web/ — the explorable explanation

Svelte 5 + D3 + Vite. Single-page narrative with five widgets:

  • Architecture diagram — annotated schematic of Figure 1
  • Scaled-dot-product slider — toggle the √d_k divisor live, watch the attention distribution collapse
  • Positional-encoding visualization — sinusoid grid + dot-product structure as a function of relative offset
  • Attention heatmap — real exported weights from the trained model; switch examples, layers, heads, and encoder-self / decoder-self / cross attention
  • Training loss curve — the actual loss trajectory of the run that produced those weights

Running it

Prerequisites

  • Python 3.11 or 3.12
  • Node 18+
  • macOS / Linux. Apple Silicon (MPS) is auto-detected.

Train the model

python3.12 -m venv .venv
source .venv/bin/activate
pip install torch numpy

cd model
python train.py --steps 3000 --ablation baseline --out checkpoints/baseline.pt
python export_attention.py --checkpoints checkpoints/baseline.pt
cp exports/attention.json ../web/public/data/

A 1M-parameter model trains to 100% validation exact-match on the date-format task in **2 minutes on Apple Silicon**.

Train the full ablation suite (Phase 3)

cd model
for a in baseline no_pe no_residual no_layernorm uniform_attn; do
  python train.py --steps 3000 --ablation $a --out checkpoints/$a.pt
done

python export_attention.py --checkpoints \
  checkpoints/baseline.pt checkpoints/no_pe.pt checkpoints/no_residual.pt \
  checkpoints/no_layernorm.pt checkpoints/uniform_attn.pt
cp exports/attention.json ../web/public/data/

The frontend will detect the bundle and expose an ablation picker that swaps the live model behind the attention heatmap and the loss curve. A side-by-side loss-curve comparison shows the gap between each variant and the baseline.

On the date task, after 3000 steps:

Variant Val exact-match What happens
baseline 99.4% Clean diagonal cross-attention
no_pe 22.5% Bag-of-characters; can't recover format
no_residual 0.0% Gradient flow dies; training barely moves
no_layernorm 100% Post-norm is insurance at this scale, not load-bearing
uniform_attn 91.2% FFN partially compensates on this simple task

Run the explorable site

cd web
npm install
npm run dev    # http://localhost:5173

The demo task

"March 14, 2026" → "2026-03-14"
"14 mar 2026"    → "2026-03-14"
"14/03/2026"     → "2026-03-14"
"03-14-2026"     → "2026-03-14"

Six source formats, single ISO target. Trains fast, produces clean diagonal cross-attention patterns that are visually unambiguous — the heatmap actually tells the story of the model learning to align digits.

Design notes

  • No HuggingFace anywhere. Every layer is hand-written. The point is to demonstrate understanding, not to assemble pre-built parts.
  • Paper-faithful defaults. Post-norm (not pre-norm), Xavier init, Adam β=(0.9, 0.98) ε=1e-9, Noam scheduler, label smoothing ε=0.1, optional weight tying between embeddings and the output projection.
  • The model is tiny on purpose. d_model=128, 2+2 layers, 4 heads, ~1M params. The artifact's value is the explanation, not the capability — and a tiny model trains in minutes and ships as ~700 KB of attention weights.
  • The frontend reads pre-computed attention, not a live model. For v1, the trained model in PyTorch is the source of truth; weights and activations are exported to JSON. Phase 4 adds live in-browser training with onnxruntime-web or transformers.js.

Roadmap

  • Phase 1: From-scratch PyTorch implementation + trained model
  • Phase 2: Interactive site shell + scaled-dot-product / positional encoding / architecture / attention heatmap / loss curve widgets
  • Phase 3: Ablation toggles — independently-trained checkpoints with (no PE, no residual, no LayerNorm, uniform attention); viewer can swap them into the attention heatmap and the loss curve. A side-by-side loss-curve comparison shows each architecture's gap to the baseline.
  • Phase 4: Live in-browser training on a copy / reverse task using onnxruntime-web with WebGPU backend
  • Phase 5: "Paste your own input" demo using a small additional model trained on Urdu shers (Lyrist corpus) — the model the rest of the demo gallery doesn't have

References

  • Vaswani et al., Attention Is All You Need, 2017. arXiv:1706.03762
  • Rush et al., The Annotated Transformer. The reference I checked myself against on every implementation detail.
  • bbycroft's LLM Visualization. The bar this project is trying to clear, for the encoder-decoder side.

Top categories

Loading Svelte Themes