An LLM benchmark for Svelte 5 based on the HumanEval methodology from OpenAI's paper "Evaluating Large Language Models Trained on Code". This benchmark evaluates LLMs' ability to generate functional Svelte 5 components with proper use of runes and modern Svelte features.
SvelteBench evaluates LLM-generated Svelte components by testing them against predefined test suites. It works by sending prompts to LLMs, generating Svelte components, and verifying their functionality through automated tests. The benchmark calculates pass@k metrics (typically pass@1 and pass@10) to measure model performance.
SvelteBench supports multiple LLM providers:
nvm use
pnpm install
# Create .env file from example
cp .env.example .env
Then edit the .env
file and add your API keys:
# OpenAI (optional)
OPENAI_API_KEY=your_openai_api_key_here
# Anthropic (optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Google Gemini (optional)
GEMINI_API_KEY=your_gemini_api_key_here
# OpenRouter (optional)
OPENROUTER_API_KEY=your_openrouter_api_key_here
OPENROUTER_SITE_URL=https://github.com/khromov/svelte-bench # Optional
OPENROUTER_SITE_NAME=SvelteBench # Optional
OPENROUTER_PROVIDER=deepseek # Optional - preferred provider routing
# Ollama (optional - defaults to http://127.0.0.1:11434)
OLLAMA_HOST=http://127.0.0.1:11434
# Z.ai (optional)
Z_AI_API_KEY=your_z_ai_api_key_here
You only need to configure the providers you want to test with.
# Run the full benchmark (sequential execution)
pnpm start
# Run with parallel sample generation (faster)
PARALLEL_EXECUTION=true pnpm start
# Run tests only (without building visualization)
pnpm run run-tests
NOTE: This will run all providers and models that are available!
SvelteBench supports two execution modes:
PARALLEL_EXECUTION=true
.For faster development, or to run just one provider/model, you can enable debug mode in your .env
file:
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219
DEBUG_TEST=counter
Debug mode runs only one provider/model combination, making it much faster for testing during development.
You can now specify multiple models to test in debug mode by providing a comma-separated list:
DEBUG_MODE=true
DEBUG_PROVIDER=anthropic
DEBUG_MODEL=claude-3-7-sonnet-20250219,claude-opus-4-20250514,claude-sonnet-4-20250514
This will run tests with all three models sequentially while still staying within the same provider.
You can provide a context file (like Svelte documentation) to help the LLM generate better components:
# Run with a context file
pnpm run run-tests -- --context ./context/svelte.dev/llms-small.txt && pnpm run build
The context file will be included in the prompt to the LLM, providing additional information for generating components.
After running the benchmark, you can visualize the results using the built-in visualization tool:
pnpm run build
You can now find the visualization in the dist
directory.
To add a new test:
src/tests/
with the name of your testprompt.md
file with instructions for the LLMtest.ts
file with Vitest tests for the generated componentReference.svelte
file with a reference implementation for validationExample structure:
src/tests/your-test/
├── prompt.md # Instructions for the LLM
├── test.ts # Tests for the generated component
└── Reference.svelte # Reference implementation
After running the benchmark, results are saved in multiple formats:
benchmarks/benchmark-results-{timestamp}.json
- Machine-readable results with pass@k metricsbenchmarks/benchmark-results-{timestamp}.html
- Interactive visualization of resultsbenchmarks/benchmark-results-{provider}-{model}-{timestamp}.json
- Per-model resultsWhen running with a context file, the results filename will include "with-context" in the name.
Current Results: All new benchmark runs produce current results with:
Legacy Results (v1): Historical results from the original test suite with known issues in the "inspect" test prompt (stored in benchmarks/v1/
).
You can merge multiple benchmark results into a single file:
# Merge current results (recommended)
pnpm run merge
# Merge legacy results (if needed)
pnpm run merge-v1
# Build visualization from current results
pnpm run build
# Build visualization from legacy results
pnpm run build-v1
This creates merged JSON and HTML files:
pnpm run merge
→ benchmarks/benchmark-results-merged.{json,html}
(current results)pnpm run merge-v1
→ benchmarks/v1/benchmark-results-merged.{json,html}
(legacy results)The standard build process uses current results by default.
SvelteBench automatically saves checkpoints at the sample level, allowing you to resume interrupted benchmark runs:
tmp/checkpoint/
after each sample completionAPI calls have configurable retry logic with exponential backoff. Configure in .env
:
RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts (default: 3)
RETRY_INITIAL_DELAY_MS=1000 # Initial delay before retry (default: 1000ms)
RETRY_MAX_DELAY_MS=30000 # Maximum delay between retries (default: 30s)
RETRY_BACKOFF_FACTOR=2 # Exponential backoff factor (default: 2)
Before running benchmarks, models are automatically validated to ensure they're available and properly configured. Invalid models are skipped with appropriate warnings.
The benchmark calculates pass@k metrics based on the HumanEval methodology:
Verify that all tests have proper structure:
pnpm run verify
This checks that each test has required files (prompt.md, test.ts, Reference.svelte).
The benchmark includes tests for core Svelte 5 features:
$state
rune$derived
rune$derived.by
$effect
rune$props
rune{#each}
blocks$inspect
rune.env
pnpm install
PARALLEL_EXECUTION=true
)Enable detailed logging by examining the generated components in tmp/samples/
directories and test outputs in the console.
Contributions are welcome! Please ensure:
MIT