A visual, interactive, zero-to-hero textbook on how modern AI systems actually work — from the matrix multiplication that powers a single neuron, all the way to disaggregated inference clusters and multi-agent orchestration.
A single continuous climb: you start with y = Wx + b and you finish understanding why DeepSeek V4 splits prefill and decode onto separate GPU pools, why MLA compresses the KV cache into a latent space, and how a ReAct agent loop closes around a 1.6T-parameter Mixture-of-Experts model running behind a PagedAttention scheduler.
Every concept is paired with an interactive canvas visualization. You don't just read about matrix multiplication — you watch the dot products accumulate. You don't just read about continuous batching — you watch padding waste disappear as requests get evicted at iteration boundaries. You don't just read about speculative decoding — you watch a draft model race ahead and a target model verify in parallel sweeps.
No build step. No package manager. No framework.
git clone https://github.com/salimhs/llm-systems-textbook.git
cd llm-systems-textbook
python -m http.server 8080
Open http://localhost:8080. That's it.
The chapter loader uses
fetch()to pull HTML fragments on demand, which browsers block under thefile://protocol — that's why we run a local server. Any static server works (npx serve,php -S,caddy, etc.).
This is a pure static site — drop it on any static host:
| Host | One-time setup |
|---|---|
| GitHub Pages | Settings → Pages → Source: main branch, root / |
| Netlify / Vercel | Connect repo. No build command. Publish dir: / |
| Cloudflare Pages | Same — no build, root output |
| Self-hosted | Serve the directory with any static server. No runtime needed. |
Each visualization is a self-contained canvas animation registered with the engine. The grid below shows what each chapter's viz teaches:
| Part | Chapters | Visualization focus |
|---|---|---|
| 0 — Math from absolute zero | 7 | Notation palette · function families · vector dot-products · matrix transforms · gradient descent · entropy · optimization |
| 1 — Neural nets | 5 | Perceptron boundary · MLP activations · loss functions · backprop flow · training dynamics |
| 2 — Transformer | 7 | BPE merge · embedding space · RoPE heatmap · attention graph · multi-head split · transformer block · scaling laws |
| 3 — Post-training | 3 | SFT/LoRA/QLoRA · DPO vs RLHF · GRPO loop with advantages |
| 4 — GPUs and memory | 5 | GPU floorplan · memory hierarchy · roofline · quantization bits · FlashAttention tiles |
| 5 — Inference | 4 | Autoregressive decoding · KV cache growth · KV compression (MQA/GQA/MLA) · long-context strategies |
| 6 — Distributed training | 4 | Ring AllReduce · ZeRO stages · tensor/pipeline parallel · expert/sequence parallel |
| 7 — Serving | 7 | Continuous batching · PagedAttention · interference · disaggregation · KV networking · prefix caching · speculative decoding |
| 8 — Reasoning | 3 | Chain-of-Thought · Tree of Thoughts + MCTS · process reward models |
| 9 — Frontier | 2 | Architecture landscape map · model comparator |
| 10 — Beyond transformers | 2 | Mamba/SSM · hybrid (Jamba) |
| 11 — Agents | 4 | ReAct loop · long-term memory · multi-agent topologies · evaluation dashboard |
Three panes, one engine, many modes.
┌─────────────┬───────────────────────────┬──────────────────────────┐
│ │ │ Workshop pane │
│ Sidebar │ Library pane │ ┌────────────────────┐ │
│ (manifest) │ (chapter HTML fragment) │ │ canvas (active │ │
│ │ │ │ visualization) │ │
│ • Part 0 │ Loaded via fetch() on │ └────────────────────┘ │
│ • Part 1 │ chapter switch; MathJax │ Telemetry · controls │
│ ... │ re-typesets the new │ Parameter sliders │
│ │ content. │ Console log │
└─────────────┴───────────────────────────┴──────────────────────────┘
The shell (index.html) is fixed. The chapter loader swaps the library content. The engine swaps the visualization mode. Everything else is a 60 fps requestAnimationFrame loop.
See ARCHITECTURE.md for the design contract: engine lifecycle, mode interface, palette, telemetry, custom elements.
Vectors and matrices. GEMM. Loss functions and cross-entropy. Gradient descent and backpropagation. The artificial neuron. Activation functions (GELU, SwiGLU). MLPs and the Universal Approximation Theorem. RNNs and the sequential bottleneck. Tokenization (BPE) and embeddings. Q/K/V projections. Scaled dot-product attention. Multi-head self-attention.
Autoregressive generation. The O(N²) explosion. The Key-Value cache and why it changes the bottleneck from compute to memory. KV compression: MQA, GQA, and DeepSeek's MLA. Long-context strategies. GPU internals: Tensor Cores and HBM. Arithmetic intensity. The Roofline Model. Prefill vs decode boundedness. Why FP8/MXFP4 doubles decode speed. Static vs continuous batching. PagedAttention.
Co-location interference. TTFT vs ITL SLOs. Splitwise and DistServe disaggregation. KV transit over NVLink, InfiniBand, RoCE/RDMA. Speculative decoding. Mixture of Experts. Multi-LoRA serving (S-LoRA, Punica). The ReAct loop. Reasoning models and thinking budgets (GRPO). Memory architectures: short-term vs long-term vector DBs. Cosine similarity and RAG. Multi-agent topologies (AutoGen).
Read this textbook front to back and the following sentence will make complete mechanical sense:
"Our serving cluster runs DeepSeek V4 with disaggregated prefill-decode, FP8 quantization on the decode pool, MLA-compressed KV cache transferred over RoCE, continuous batching with PagedAttention, and speculative decoding using a 2B draft model. The agents on top are ReAct loops with cosine-similarity retrieval over a vector database, and the reasoning calls go to a separate pool with a larger thinking budget."
No prerequisites. Each chapter builds strictly on the previous one. Read in order — the math compounds.
<vis-trigger> elements that switch the canvas to a sub-mode of the current visualization.Add a chapter, refactor a visualization, fix a typo — all welcome. See CONTRIBUTING.md for:
Written against the state of the art as of May 2026. Models referenced — DeepSeek V4 / V4 Pro / V4 Flash, Llama 4, Qwen 3.7, Claude Opus 4.7 / Sonnet 4.6, Gemini 3.1 Pro, Moonshot Kimi, Mixtral. Serving systems referenced — vLLM (PagedAttention), Orca (continuous batching), Splitwise and DistServe (disaggregation), SGLang, S-LoRA, Punica.
The field moves fast. Treat the textbook as a stable scaffold and the citations as the live edge.
MIT. Use it, fork it, teach with it. Cite the linked papers — they did the original work.