llm-systems-textbook Svelte Themes

Llm Systems Textbook

Interactive visual textbook on LLM systems — transformers through deployment

LLM Systems Textbook

A visual, interactive, zero-to-hero textbook on how modern AI systems actually work — from the matrix multiplication that powers a single neuron, all the way to disaggregated inference clusters and multi-agent orchestration.

A single continuous climb: you start with y = Wx + b and you finish understanding why DeepSeek V4 splits prefill and decode onto separate GPU pools, why MLA compresses the KV cache into a latent space, and how a ReAct agent loop closes around a 1.6T-parameter Mixture-of-Experts model running behind a PagedAttention scheduler.

Every concept is paired with an interactive canvas visualization. You don't just read about matrix multiplication — you watch the dot products accumulate. You don't just read about continuous batching — you watch padding waste disappear as requests get evicted at iteration boundaries. You don't just read about speculative decoding — you watch a draft model race ahead and a target model verify in parallel sweeps.


Quick start

No build step. No package manager. No framework.

git clone https://github.com/salimhs/llm-systems-textbook.git
cd llm-systems-textbook
python -m http.server 8080

Open http://localhost:8080. That's it.

The chapter loader uses fetch() to pull HTML fragments on demand, which browsers block under the file:// protocol — that's why we run a local server. Any static server works (npx serve, php -S, caddy, etc.).


Deploy as a website

This is a pure static site — drop it on any static host:

Host One-time setup
GitHub Pages Settings → Pages → Source: main branch, root /
Netlify / Vercel Connect repo. No build command. Publish dir: /
Cloudflare Pages Same — no build, root output
Self-hosted Serve the directory with any static server. No runtime needed.

What's inside

  • 61 chapters spanning 13 parts — math foundations → transformer internals → GPU memory → distributed training → serving → reasoning → frontier architectures → agents
  • 60 interactive canvas visualizations — each demonstrates the chapter's mechanism with labeled, animated diagrams
  • 6 appendices — worked exercise solutions, glossary, notation reference, model cards, paper library, cheatsheet
  • 484 worked exercise solutions across all chapters
  • 90+ glossary entries with first-introduction pointers

Visualizations

Each visualization is a self-contained canvas animation registered with the engine. The grid below shows what each chapter's viz teaches:

Part Chapters Visualization focus
0 — Math from absolute zero 7 Notation palette · function families · vector dot-products · matrix transforms · gradient descent · entropy · optimization
1 — Neural nets 5 Perceptron boundary · MLP activations · loss functions · backprop flow · training dynamics
2 — Transformer 7 BPE merge · embedding space · RoPE heatmap · attention graph · multi-head split · transformer block · scaling laws
3 — Post-training 3 SFT/LoRA/QLoRA · DPO vs RLHF · GRPO loop with advantages
4 — GPUs and memory 5 GPU floorplan · memory hierarchy · roofline · quantization bits · FlashAttention tiles
5 — Inference 4 Autoregressive decoding · KV cache growth · KV compression (MQA/GQA/MLA) · long-context strategies
6 — Distributed training 4 Ring AllReduce · ZeRO stages · tensor/pipeline parallel · expert/sequence parallel
7 — Serving 7 Continuous batching · PagedAttention · interference · disaggregation · KV networking · prefix caching · speculative decoding
8 — Reasoning 3 Chain-of-Thought · Tree of Thoughts + MCTS · process reward models
9 — Frontier 2 Architecture landscape map · model comparator
10 — Beyond transformers 2 Mamba/SSM · hybrid (Jamba)
11 — Agents 4 ReAct loop · long-term memory · multi-agent topologies · evaluation dashboard

Architecture

Three panes, one engine, many modes.

┌─────────────┬───────────────────────────┬──────────────────────────┐
│             │                           │  Workshop pane           │
│  Sidebar    │  Library pane             │  ┌────────────────────┐  │
│  (manifest) │  (chapter HTML fragment)  │  │  canvas (active    │  │
│             │                           │  │   visualization)   │  │
│  • Part 0   │  Loaded via fetch() on    │  └────────────────────┘  │
│  • Part 1   │  chapter switch; MathJax  │  Telemetry · controls    │
│    ...      │  re-typesets the new      │  Parameter sliders       │
│             │  content.                 │  Console log             │
└─────────────┴───────────────────────────┴──────────────────────────┘

The shell (index.html) is fixed. The chapter loader swaps the library content. The engine swaps the visualization mode. Everything else is a 60 fps requestAnimationFrame loop.

See ARCHITECTURE.md for the design contract: engine lifecycle, mode interface, palette, telemetry, custom elements.


Curriculum

Part I — Foundations: The Mathematics Underneath Everything

Vectors and matrices. GEMM. Loss functions and cross-entropy. Gradient descent and backpropagation. The artificial neuron. Activation functions (GELU, SwiGLU). MLPs and the Universal Approximation Theorem. RNNs and the sequential bottleneck. Tokenization (BPE) and embeddings. Q/K/V projections. Scaled dot-product attention. Multi-head self-attention.

Part II — Single-Node Hardware Limits

Autoregressive generation. The O(N²) explosion. The Key-Value cache and why it changes the bottleneck from compute to memory. KV compression: MQA, GQA, and DeepSeek's MLA. Long-context strategies. GPU internals: Tensor Cores and HBM. Arithmetic intensity. The Roofline Model. Prefill vs decode boundedness. Why FP8/MXFP4 doubles decode speed. Static vs continuous batching. PagedAttention.

Part III — Clusters & Agents: The Frontier

Co-location interference. TTFT vs ITL SLOs. Splitwise and DistServe disaggregation. KV transit over NVLink, InfiniBand, RoCE/RDMA. Speculative decoding. Mixture of Experts. Multi-LoRA serving (S-LoRA, Punica). The ReAct loop. Reasoning models and thinking budgets (GRPO). Memory architectures: short-term vs long-term vector DBs. Cosine similarity and RAG. Multi-agent topologies (AutoGen).

The promise

Read this textbook front to back and the following sentence will make complete mechanical sense:

"Our serving cluster runs DeepSeek V4 with disaggregated prefill-decode, FP8 quantization on the decode pool, MLA-compressed KV cache transferred over RoCE, continuous batching with PagedAttention, and speculative decoding using a 2B draft model. The agents on top are ReAct loops with cosine-similarity retrieval over a vector database, and the reasoning calls go to a separate pool with a larger thinking budget."


Who this is for

  • Engineers who can write Python but have never derived a gradient
  • ML practitioners who can train models but have never served them at scale
  • Systems engineers who can serve models but want to understand what they're actually serving
  • Anyone who has read "Attention Is All You Need" and felt like 30% of it made sense

No prerequisites. Each chapter builds strictly on the previous one. Read in order — the math compounds.


How to use it

  1. Run the local server (or open the deployed site).
  2. Pick a chapter from the left sidebar. The library pane (middle) shows prose, exercises, and inline visual triggers; the workshop pane (right) loads the matching visualization.
  3. Adjust parameters at the bottom of the workshop pane. Telemetry (GPU util, VRAM, TTFT, etc.) updates live.
  4. Click highlighted phrases in the chapter text — these are <vis-trigger> elements that switch the canvas to a sub-mode of the current visualization.
  5. Watch the console at the bottom-right to correlate what you read with what the system is doing.

Contributing

Add a chapter, refactor a visualization, fix a typo — all welcome. See CONTRIBUTING.md for:

  • Running and testing locally
  • Adding a new chapter
  • Adding a new visualization mode
  • Pedagogy guidelines (the 5-axis rubric: concept clarity / label quality / meaningful motion / state legibility / param relevance)
  • Color palette and design tokens

Currency

Written against the state of the art as of May 2026. Models referenced — DeepSeek V4 / V4 Pro / V4 Flash, Llama 4, Qwen 3.7, Claude Opus 4.7 / Sonnet 4.6, Gemini 3.1 Pro, Moonshot Kimi, Mixtral. Serving systems referenced — vLLM (PagedAttention), Orca (continuous batching), Splitwise and DistServe (disaggregation), SGLang, S-LoRA, Punica.

The field moves fast. Treat the textbook as a stable scaffold and the citations as the live edge.


License

MIT. Use it, fork it, teach with it. Cite the linked papers — they did the original work.

Top categories

Loading Svelte Themes