Attention Interactively

vishuGupta08

An interactive, paper-faithful walkthrough of "Attention Is All You Need" (Vaswani et al., 2017). From-scratch PyTorch model + Svelte/D3 explorable explanation.

Demo Download

Attention Is All You Need — Interactively

A from-scratch, paper-faithful walkthrough of the 2017 transformer (Vaswani et al., arXiv:1706.03762), built as an explorable explanation. Both halves — the PyTorch model and the in-browser visualization — were written from first principles.

GPT, Claude, and Llama all descend from the decoder half of this paper. This project shows the parent architecture, end to end, with every choice made interactive.

What's in here

attention/
├── model/      Paper-faithful PyTorch implementation + training
└── web/        Svelte + D3 interactive explorable explanation

`model/` — the implementation

A clean, annotated, encoder-decoder transformer matching the paper section by section:

Module	Paper §	What it implements
`attention.py`	3.2	Scaled dot-product attention; multi-head wrapper
`positional.py`	3.5	Sinusoidal positional encoding
`embeddings.py`	3.4	Token embedding with √d_model scaling
`feedforward.py`	3.3	Position-wise FFN
`layers.py`	3.1	Encoder & decoder layers (residual + LayerNorm sublayers)
`model.py`	3	Full Transformer (encoder stack + decoder stack + greedy generation)
`masking.py`	3.2.3	Padding mask + causal mask
`optimizer.py`	5.3	Noam learning-rate schedule
`loss.py`	5.4	Label-smoothed KL loss
`data/dates.py`	—	Synthetic date-format-translation task
`data/tokenizer.py`	—	Character-level tokenizer
`train.py`	—	Training loop
`export_attention.py`	—	Dump attention weights as JSON for the frontend

`web/` — the explorable explanation

Svelte 5 + D3 + Vite. Single-page narrative with five widgets:

Architecture diagram — annotated schematic of Figure 1
Scaled-dot-product slider — toggle the √d_k divisor live, watch the attention distribution collapse
Positional-encoding visualization — sinusoid grid + dot-product structure as a function of relative offset
Attention heatmap — real exported weights from the trained model; switch examples, layers, heads, and encoder-self / decoder-self / cross attention
Training loss curve — the actual loss trajectory of the run that produced those weights

Running it

Prerequisites

Python 3.11 or 3.12
Node 18+
macOS / Linux. Apple Silicon (MPS) is auto-detected.

Train the model

python3.12 -m venv .venv
source .venv/bin/activate
pip install torch numpy

cd model
python train.py --steps 3000 --ablation baseline --out checkpoints/baseline.pt
python export_attention.py --checkpoints checkpoints/baseline.pt
cp exports/attention.json ../web/public/data/

A 1M-parameter model trains to 100% validation exact-match on the date-format task in **2 minutes on Apple Silicon**.

Train the full ablation suite (Phase 3)

cd model
for a in baseline no_pe no_residual no_layernorm uniform_attn; do
  python train.py --steps 3000 --ablation $a --out checkpoints/$a.pt
done

python export_attention.py --checkpoints \
  checkpoints/baseline.pt checkpoints/no_pe.pt checkpoints/no_residual.pt \
  checkpoints/no_layernorm.pt checkpoints/uniform_attn.pt
cp exports/attention.json ../web/public/data/

The frontend will detect the bundle and expose an ablation picker that swaps the live model behind the attention heatmap and the loss curve. A side-by-side loss-curve comparison shows the gap between each variant and the baseline.

On the date task, after 3000 steps:

Variant	Val exact-match	What happens
`baseline`	99.4%	Clean diagonal cross-attention
`no_pe`	22.5%	Bag-of-characters; can't recover format
`no_residual`	0.0%	Gradient flow dies; training barely moves
`no_layernorm`	100%	Post-norm is insurance at this scale, not load-bearing
`uniform_attn`	91.2%	FFN partially compensates on this simple task

Run the explorable site

cd web
npm install
npm run dev    # http://localhost:5173

The demo task

"March 14, 2026" → "2026-03-14"
"14 mar 2026"    → "2026-03-14"
"14/03/2026"     → "2026-03-14"
"03-14-2026"     → "2026-03-14"

Six source formats, single ISO target. Trains fast, produces clean diagonal cross-attention patterns that are visually unambiguous — the heatmap actually tells the story of the model learning to align digits.

Design notes

No HuggingFace anywhere. Every layer is hand-written. The point is to demonstrate understanding, not to assemble pre-built parts.
Paper-faithful defaults. Post-norm (not pre-norm), Xavier init, Adam β=(0.9, 0.98) ε=1e-9, Noam scheduler, label smoothing ε=0.1, optional weight tying between embeddings and the output projection.
The model is tiny on purpose. d_model=128, 2+2 layers, 4 heads, ~1M params. The artifact's value is the explanation, not the capability — and a tiny model trains in minutes and ships as ~700 KB of attention weights.
The frontend reads pre-computed attention, not a live model. For v1, the trained model in PyTorch is the source of truth; weights and activations are exported to JSON. Phase 4 adds live in-browser training with onnxruntime-web or transformers.js.

Roadmap

Phase 1: From-scratch PyTorch implementation + trained model
Phase 2: Interactive site shell + scaled-dot-product / positional encoding / architecture / attention heatmap / loss curve widgets
Phase 3: Ablation toggles — independently-trained checkpoints with (no PE, no residual, no LayerNorm, uniform attention); viewer can swap them into the attention heatmap and the loss curve. A side-by-side loss-curve comparison shows each architecture's gap to the baseline.
Phase 4: Live in-browser training on a copy / reverse task using onnxruntime-web with WebGPU backend
Phase 5: "Paste your own input" demo using a small additional model trained on Urdu shers (Lyrist corpus) — the model the rest of the demo gallery doesn't have

References

Vaswani et al., Attention Is All You Need, 2017. arXiv:1706.03762
Rush et al., The Annotated Transformer. The reference I checked myself against on every implementation detail.
bbycroft's LLM Visualization. The bar this project is trying to clear, for the encoder-decoder side.

Top categories