Evaluation framework for testing LLM code generation on Svelte 5.
LLMs generate outdated Svelte code—mixing Svelte 4 reactive statements ($:, let for reactivity) with Svelte 5 runes ($state, $derived, $effect). Generic benchmarks like HumanEval don't catch this.
These are the patterns we check for:
| Old (Svelte 4) | New (Svelte 5) | Rule |
|---|---|---|
let count = 0 (reactive) |
let count = $state(0) |
RUNE_STATE |
$: doubled = count * 2 |
const doubled = $derived(count * 2) |
RUNE_DERIVED |
$: { console.log(x) } |
$effect(() => { console.log(x) }) |
RUNE_EFFECT |
export let prop |
let { prop } = $props() |
RUNE_PROPS |
<slot /> |
{@render children()} |
SNIPPET_SLOT |
on:click={handler} |
onclick={handler} |
EVENT_HANDLER |
svelte-evals/
├── collectors/
│ ├── docs_collector.py # Scrape Svelte docs examples
│ └── repo_collector.py # Clone and extract from repos
├── evals/
│ ├── compliance.py # Svelte 5 pattern checking
│ ├── syntax.py # Compilation/parsing checks
│ └── runner.py # Run evals against LLMs
├── data/
│ ├── docs/ # Raw docs examples
│ ├── repos/ # Raw repo code
│ └── processed/ # Eval tasks (prompt + solution)
└── results/ # Eval outputs per model
Official:
Community (Svelte 5 compatible):
# 1. Collect data
python3 collectors/docs_collector.py
python3 collectors/repo_collector.py
# 2. Generate eval tasks
python3 evals/generate_tasks.py
# 3. Run evals
python3 evals/runner.py --model ollama:llama3.2 --tasks data/processed/test_set.json
python3 evals/runner.py --model gpt-4 --tasks data/processed/all_tasks.json
python3 evals/runner.py --model claude-3-sonnet-20240229 --tasks data/processed/test_set.json
# 4. Compare
python3 evals/compare.py results/*.json
We tested 6 different prompting strategies to find what works best for Svelte 5 code generation:
| Strategy | Pass Rate | Compliance | Description |
|---|---|---|---|
explicit |
60% | 98% | Explicit rules with "Use X (NOT Y)" format |
few_shot |
40% | 99% | Single correct example before the task |
negative |
40% | 96% | List of patterns to avoid |
combined |
40% | 83% | Rules + example together |
basic |
0% | 73% | Simple system prompt |
minimal |
0% | 20% | No guidance (baseline) |
Winner: explicit - Stating rules explicitly with contrasting examples works best.
This prompt format was most effective:
IMPORTANT: Use Svelte 5 syntax only. The rules are:
- Use `let x = $state(value)` for reactive state (NOT `let x = value`)
- Use `const x = $derived(expr)` for computed values (NOT `$: x = expr`)
- Use `$effect(() => {...})` for side effects (NOT `$: {...}`)
- Use `let { prop } = $props()` for component props (NOT `export let prop`)
- Use `onclick={handler}` for events (NOT `on:click={handler}`)
# Run with explicit strategy (recommended)
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit
# List all strategies
python3 evals/runner.py --list-strategies
# Compare strategies
python3 evals/runner.py --model ollama:llama3.2 --strategy minimal
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit
python3 evals/compare.py results/*.json
Without guidance (minimal):
Pass rate: 0.0% (0/10)
Avg compliance: 20%
With explicit rules:
Pass rate: 60.0% (6/10)
Avg compliance: 98%
The same model (llama3.2) goes from 0% to 60% pass rate just by adding explicit rules to the prompt.
{
"id": "state-counter-001",
"category": "reactivity",
"prompt": "Create a Svelte 5 component with a counter that increments on button click",
"reference": "<script>\n let count = $state(0);\n</script>\n<button onclick={() => count++}>{count}</button>",
"tests": [
{"type": "compiles", "expected": true},
{"type": "contains_pattern", "pattern": "\\$state\\(", "expected": true},
{"type": "not_contains", "pattern": "\\$:", "expected": true}
]
}