Marcella is a ~60M parameter decoder-only transformer language model trained from scratch using PyTorch. The architecture draws from modern LLM design: RoPE positional embeddings, RMSNorm, SwiGLU FFN, weight-tied embeddings, and a custom per-layer KV cache for inference. Everything — tokenizer, data pipeline, training loop, and model — is built from scratch.
Our reliance over Large Language Models and the ridiculous compute it requires to run such models is absurd. Maybe if we shift to smaller, specialized models - models that can be run locally, inexpensively and efficiently. Starting with small chatbots, basic tool calling, domain specific models, small scale decision making or any such tasks that doesn't require a Large Language Model but we call them regardless, those must be replaced.
| Component | Value |
|---|---|
| Parameters | ~60M |
| Embedding dim | 576 |
| Transformer layers | 12 |
| Attention heads | 12 |
| Head dim | 48 |
| FFN hidden dim | 1,536 (× 2/3 × 4 rule) |
| Context length | 1,024 |
| Vocabulary size | 32,000 |
| Positional encoding | RoPE |
| Normalization | RMSNorm (pre-norm) |
| FFN activation | SwiGLU |
| Embeddings | StableEmbedding (bitsandbytes) |
| Weight tying | Yes (embedding ↔ LM head) |
| Biases | None (all linear layers are bias-free) |
| dtype | bfloat16 |
The attention module is a standard multi-head self-attention with fused QKV projection and two distinct forward paths depending on whether a KV cache is active:
Training path (kv_cache=None):
RoPE is applied to Q and K over the full sequence, then scaled_dot_product_attention (PyTorch's Flash Attention kernel) is used with causal masking and dropout.
Inference path (prefill — cache_len == 0):
The full prompt is processed at once with Flash Attention (causal, no dropout). The resulting K and V tensors are written into the cache.
Inference path (decode — cache_len > 0):
Only the new token's Q, K, V are computed. RoPE is applied with a position offset equal to the current cache length so positional encodings stay consistent with the cached keys. The new K/V are appended, then attention is computed manually over the full cached context: (Q @ K_all.T) * scale → softmax → @ V_all.
This means Flash Attention is used where sequence length is long (prefill), and manual attention is used where it's short (single decode step), which is the correct tradeoff.
RoPE frequencies are precomputed once up to 2 × max_seq_len and registered as non-persistent buffers. This gives headroom for generation beyond the training context without recomputation. The implementation rotates even/odd dimension pairs separately and interleaves them back, matching the standard formulation.
rot_even = x_even * cos − x_odd * sin
rot_odd = x_even * sin + x_odd * cos
During decode, apply_rope_offset slices the correct position from the precomputed tables using the cache length as an offset, keeping Q and K in sync with their absolute positions.
Each transformer block gets its own KV_Cache instance. The cache pre-allocates fixed-size tensors of shape (B, H, max_seq_len, head_dim) in bfloat16 on GPU at init time — no dynamic allocation during generation. A cache_len pointer tracks how far the cache has been filled, and update() writes new K/V slices into the appropriate positions. get_kv() returns only the filled slice, so attention never attends over padding.
Marcella.init_kv_cache() constructs one cache per layer and returns a list that's passed through the forward call at each decode step.
The feedforward block uses SwiGLU with the standard hidden dim scaling of 4 × embed_dim × 2/3:
FFN(x) = down_proj(silu(gate_proj(x)) * up_proj(x))
All three projections are bias-free.
Pre-norm architecture: RMSNorm is applied before both the attention and FFN sublayers, with residual connections around each.
x = x + Attention(RMSNorm(x))
x = x + FFN(RMSNorm(x))
A custom SentencePiece tokenizer (Marcella_vocab_32K_v2) trained on FineWeb-Edu with a vocabulary of 32,000 tokens. The Tokenizer wrapper exposes encode and decode and surfaces the BOS, EOS, PAD, and UNK ids directly.
| Source | Weight |
|---|---|
| FineWeb-Edu | 60% |
| Wikipedia | 25% |
| SlimPajama | 15% |
The dataset is pre-tokenized and stored as binary shards (.bin + .idx pairs). Each .idx file holds byte offsets into the corresponding .bin so sequences can be sliced without loading the full shard. ShardedDataset is an IterableDataset that streams shards in sorted order and supports resuming from a specific (shard_id, seq_idx) — which is saved into checkpoints so training can be resumed exactly where it left off.
uv run python -m training.train
| Setting | Value |
|---|---|
| Batch size | 4 |
| Gradient accumulation | 8 steps (effective batch = 32) |
| Max LR | 2e-4 |
| Min LR | 2e-5 |
| Warmup steps | 610 |
| Total steps | 61,035 |
| LR schedule | Linear warmup → cosine decay |
| Optimizer | AdamW |
Checkpoints save model weights, optimizer state, scheduler state, PyTorch and CUDA RNG states, the current shard and sequence index, and the W&B run ID. This means training can be resumed with exact reproducibility — the dataloader will rewind to the same position in the same shard.
Instruction finetuning on top of the pretrained weights using a masked language modelling objective — loss is only computed over the response tokens, not the instruction.
uv run python finetune.py
| Setting | Value |
|---|---|
| Optimizer | AdamW8bit (bitsandbytes) |
| Max LR | per config |
| LR schedule | Linear warmup → cosine decay |
| Weight decay | 0.1 (non-bias/norm params) |
| Gradient clipping | 1.0 |
| dtype | bfloat16 |
Prompts are formatted as:
### Instruction:
{instruction}
### Response:
{response}
The instruction tokens are masked out (loss_mask = False) so the model only learns to generate responses. The API applies this same template automatically when the finetuned model is active — raw prompts sent to /generate are wrapped transparently.
Evaluated on a held-out tail split (shards 40+, 1,140 sequences):
| Metric | Value |
|---|---|
| Checkpoint | step 488,000 |
| Eval tokens | 1,164,288 |
| Loss | 3.4925 |
| Perplexity | 32.87 |
| Eval throughput | 278,943.9 tokens/sec |
| Generation speed | ~40 tokens/sec |
A lightweight Svelte frontend with a FastAPI streaming backend. Tokens stream into the chat bubble in real time as the model generates them.
Start the backend:
uv run uvicorn api:app --host 0.0.0.0 --port 8000
Start the frontend (in a separate terminal):
cd ui
npm install
npm run dev
Then open http://localhost:5173 in your browser.
The sidebar exposes temperature, top-k, and max tokens controls — all applied per request. You can switch between the pretrained and finetuned model at runtime without restarting the server. Enter sends a message.
MARCELLA-60M/
├── models/
│ ├── Marcella_vocab_32K_v2.model # SentencePiece model
│ ├── Marcella_vocab_32K_v2.vocab
│ ├── marcella_finetuned.pt
│ └── marcella_pretrained.pt
├── src/
│ ├── attention.py # RoPE, KV cache, multi-head attention
│ ├── marcella.py # TransformerBlock, FFN, Marcella model
│ └── tokenizer.py # SentencePiece wrapper
├── training/
│ ├── checkpoints/
│ ├── data/ # Pre-tokenized binary shards
│ ├── checkpoint.py # Save/load with full RNG state
│ ├── config.py # All hyperparameters
│ ├── dataloader.py
│ ├── finetune.py # Intruction FineTuning
│ ├── finetuning_dataloader.py # Instruction dataset + loss masking
│ ├── train.py # Training loop
│ └── validation.txt
├── ui/
│ ├── public/
│ │ └── light.png
│ ├── src/
│ │ ├── App.svelte
│ │ └── main.js
│ ├── index.html
│ ├── package.json
│ └── vite.config.js
├── api.py
├── dark.png
├── light.png
├── pyproject.toml
└── uv.lock
MIT