svelte-evals Svelte Themes

Svelte Evals

Evaluation framework for testing LLM code generation on Svelte 5

Svelte 5 Evals

Evaluation framework for testing LLM code generation on Svelte 5.

The Problem

LLMs generate outdated Svelte code—mixing Svelte 4 reactive statements ($:, let for reactivity) with Svelte 5 runes ($state, $derived, $effect). Generic benchmarks like HumanEval don't catch this.

What This Does

  1. Collects real Svelte 5 code from official docs and top repos
  2. Generates eval tasks (prompt + expected behavior)
  3. Tests LLM outputs for:
    • Syntax validity (does it compile?)
    • Svelte 5 compliance (uses runes correctly?)
    • Functional correctness (does it work?)

Svelte 5 Compliance Rules

These are the patterns we check for:

Old (Svelte 4) New (Svelte 5) Rule
let count = 0 (reactive) let count = $state(0) RUNE_STATE
$: doubled = count * 2 const doubled = $derived(count * 2) RUNE_DERIVED
$: { console.log(x) } $effect(() => { console.log(x) }) RUNE_EFFECT
export let prop let { prop } = $props() RUNE_PROPS
<slot /> {@render children()} SNIPPET_SLOT
on:click={handler} onclick={handler} EVENT_HANDLER

Project Structure

svelte-evals/
├── collectors/
│   ├── docs_collector.py     # Scrape Svelte docs examples
│   └── repo_collector.py     # Clone and extract from repos
├── evals/
│   ├── compliance.py         # Svelte 5 pattern checking
│   ├── syntax.py             # Compilation/parsing checks
│   └── runner.py             # Run evals against LLMs
├── data/
│   ├── docs/                  # Raw docs examples
│   ├── repos/                 # Raw repo code
│   └── processed/             # Eval tasks (prompt + solution)
└── results/                   # Eval outputs per model

Data Sources

Official:

  • svelte.dev/docs code examples
  • github.com/sveltejs/svelte/examples
  • github.com/sveltejs/kit/examples

Community (Svelte 5 compatible):

  • shadcn-svelte (UI components)
  • bits-ui (headless components)
  • melt-ui (headless primitives)
  • skeleton (UI toolkit)

Usage

# 1. Collect data
python3 collectors/docs_collector.py
python3 collectors/repo_collector.py

# 2. Generate eval tasks
python3 evals/generate_tasks.py

# 3. Run evals
python3 evals/runner.py --model ollama:llama3.2 --tasks data/processed/test_set.json
python3 evals/runner.py --model gpt-4 --tasks data/processed/all_tasks.json
python3 evals/runner.py --model claude-3-sonnet-20240229 --tasks data/processed/test_set.json

# 4. Compare
python3 evals/compare.py results/*.json

Prompting Strategies

We tested 6 different prompting strategies to find what works best for Svelte 5 code generation:

Strategy Pass Rate Compliance Description
explicit 60% 98% Explicit rules with "Use X (NOT Y)" format
few_shot 40% 99% Single correct example before the task
negative 40% 96% List of patterns to avoid
combined 40% 83% Rules + example together
basic 0% 73% Simple system prompt
minimal 0% 20% No guidance (baseline)

Winner: explicit - Stating rules explicitly with contrasting examples works best.

The Explicit Strategy

This prompt format was most effective:

IMPORTANT: Use Svelte 5 syntax only. The rules are:
- Use `let x = $state(value)` for reactive state (NOT `let x = value`)
- Use `const x = $derived(expr)` for computed values (NOT `$: x = expr`)
- Use `$effect(() => {...})` for side effects (NOT `$: {...}`)
- Use `let { prop } = $props()` for component props (NOT `export let prop`)
- Use `onclick={handler}` for events (NOT `on:click={handler}`)

Why This Works

  1. Explicit contrast - Showing what NOT to do helps override training data
  2. Format consistency - Same structure for each rule is easy to follow
  3. Inline alternatives - The correct syntax is right next to the wrong one
  4. No examples needed - Rules alone work better than few-shot for this task

Usage

# Run with explicit strategy (recommended)
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit

# List all strategies
python3 evals/runner.py --list-strategies

# Compare strategies
python3 evals/runner.py --model ollama:llama3.2 --strategy minimal
python3 evals/runner.py --model ollama:llama3.2 --strategy explicit
python3 evals/compare.py results/*.json

Example Results

Without guidance (minimal):

Pass rate: 0.0% (0/10)
Avg compliance: 20%

With explicit rules:

Pass rate: 60.0% (6/10)
Avg compliance: 98%

The same model (llama3.2) goes from 0% to 60% pass rate just by adding explicit rules to the prompt.

Eval Task Format

{
  "id": "state-counter-001",
  "category": "reactivity",
  "prompt": "Create a Svelte 5 component with a counter that increments on button click",
  "reference": "<script>\n  let count = $state(0);\n</script>\n<button onclick={() => count++}>{count}</button>",
  "tests": [
    {"type": "compiles", "expected": true},
    {"type": "contains_pattern", "pattern": "\\$state\\(", "expected": true},
    {"type": "not_contains", "pattern": "\\$:", "expected": true}
  ]
}

Metrics

  • pass@1: Does the first generation pass all tests?
  • compliance_score: % of Svelte 5 patterns used correctly
  • syntax_valid: Does the code compile?
  • functional_correct: Does it match expected behavior?

Top categories

Loading Svelte Themes