Svelte Evals

nkwib

Evaluation framework for testing LLM code generation on Svelte 5

Svelte 5 Evals

Evaluation framework for testing LLM code generation on Svelte 5.

The Problem

LLMs generate outdated Svelte code—mixing Svelte 4 reactive statements ($:, let for reactivity) with Svelte 5 runes ($state, $derived, $effect). Generic benchmarks like HumanEval don't catch this.

What This Does

Collects real Svelte 5 code from official docs and top repos
Generates eval tasks (prompt + expected behavior)
Tests LLM outputs for:
- Syntax validity (does it compile?)
- Svelte 5 compliance (uses runes correctly?)
- Functional correctness (does it work?)

Svelte 5 Compliance Rules

These are the patterns we check for:

Old (Svelte 4)	New (Svelte 5)	Rule
`let count = 0` (reactive)	`let count = $state(0)`	RUNE_STATE
`$: doubled = count * 2`	`const doubled = $derived(count * 2)`	RUNE_DERIVED
`$: { console.log(x) }`	`$effect(() => { console.log(x) })`	RUNE_EFFECT
`export let prop`	`let { prop } = $props()`	RUNE_PROPS
`<slot />`	`{@render children()}`	SNIPPET_SLOT
`on:click={handler}`	`onclick={handler}`	EVENT_HANDLER

Project Structure

svelte-evals/
├── collectors/
│   ├── docs_collector.py     # Scrape Svelte docs examples
│   └── repo_collector.py     # Clone and extract from repos
├── evals/
│   ├── compliance.py         # Svelte 5 pattern checking
│   ├── syntax.py             # Compilation/parsing checks
│   └── runner.py             # Run evals against LLMs
├── data/
│   ├── docs/                  # Raw docs examples
│   ├── repos/                 # Raw repo code
│   └── processed/             # Eval tasks (prompt + solution)
└── results/                   # Eval outputs per model

Data Sources

Official:

svelte.dev/docs code examples
github.com/sveltejs/svelte/examples
github.com/sveltejs/kit/examples

Community (Svelte 5 compatible):

shadcn-svelte (UI components)
bits-ui (headless components)
melt-ui (headless primitives)
skeleton (UI toolkit)

Usage

# 1. Collect data
python3 collectors/docs_collector.py
python3 collectors/repo_collector.py

# 2. Generate eval tasks
python3 evals/generate_tasks.py

# 3. Run evals
python3 evals/runner.py --model ollama:llama3.2 --tasks data/processed/test_set.json
python3 evals/runner.py --model gpt-4 --tasks data/processed/all_tasks.json
python3 evals/runner.py --model claude-3-sonnet-20240229 --tasks data/processed/test_set.json

# 4. Compare
python3 evals/compare.py results/*.json

Prompting Strategies

We tested 6 different prompting strategies to find what works best for Svelte 5 code generation:

Strategy	Pass Rate	Compliance	Description
`explicit`	60%	98%	Explicit rules with "Use X (NOT Y)" format
`few_shot`	40%	99%	Single correct example before the task
`negative`	40%	96%	List of patterns to avoid
`combined`	40%	83%	Rules + example together
`basic`	0%	73%	Simple system prompt
`minimal`	0%	20%	No guidance (baseline)

Winner: explicit - Stating rules explicitly with contrasting examples works best.

The Explicit Strategy

This prompt format was most effective:

IMPORTANT: Use Svelte 5 syntax only. The rules are:
- Use `let x = $state(value)` for reactive state (NOT `let x = value`)
- Use `const x = $derived(expr)` for computed values (NOT `$: x = expr`)
- Use `$effect(() => {...})` for side effects (NOT `$: {...}`)
- Use `let { prop } = $props()` for component props (NOT `export let prop`)
- Use `onclick={handler}` for events (NOT `on:click={handler}`)

Why This Works

Explicit contrast - Showing what NOT to do helps override training data
Format consistency - Same structure for each rule is easy to follow
Inline alternatives - The correct syntax is right next to the wrong one
No examples needed - Rules alone work better than few-shot for this task

Usage

# Run with explicit strategy (recommended)
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit

# List all strategies
python3 evals/runner.py --list-strategies

# Compare strategies
python3 evals/runner.py --model ollama:llama3.2 --strategy minimal
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit
python3 evals/compare.py results/*.json

Example Results

Without guidance (minimal):

Pass rate: 0.0% (0/10)
Avg compliance: 20%

With explicit rules:

Pass rate: 60.0% (6/10)
Avg compliance: 98%

The same model (llama3.2) goes from 0% to 60% pass rate just by adding explicit rules to the prompt.

Eval Task Format

{
  "id": "state-counter-001",
  "category": "reactivity",
  "prompt": "Create a Svelte 5 component with a counter that increments on button click",
  "reference": "<script>\n  let count = $state(0);\n</script>\n<button onclick={() => count++}>{count}</button>",
  "tests": [
    {"type": "compiles", "expected": true},
    {"type": "contains_pattern", "pattern": "\\$state\\(", "expected": true},
    {"type": "not_contains", "pattern": "\\$:", "expected": true}
  ]
}

Metrics

pass@1: Does the first generation pass all tests?
compliance_score: % of Svelte 5 patterns used correctly
syntax_valid: Does the code compile?
functional_correct: Does it match expected behavior?

Top categories