Showdown

Comprehensive LLM leaderboard aggregating multiple benchmarks into transparent rankings. Open data, community-driven, built with Svelte.

#ai #benchmark #lmarena #score #swe-bench

Demo Download

Showdown

Which AI model is actually the best? We aggregate 20+ benchmarks so you don't have to.

What is Showdown?

Tired of cherry-picked benchmarks and marketing hype? Showdown provides transparent, community-maintained rankings of AI language models across real-world categories:

Coding - Can it actually write working code?
Reasoning - PhD-level science, complex logic
Agents & Tools - Function calling, browser automation
Math - From algebra to competition problems
Multimodal - Vision understanding
Multilingual - Beyond English
Conversation - Creative writing, instruction following

All data is open. All methodology is transparent. All contributions are welcome.

Quick Start

Visit showdown.best to explore the rankings.

Want to run it locally?

git clone https://github.com/verseles/showdown.git
cd showdown
npm install
npm run dev

How Rankings Work

We aggregate scores from 20+ industry benchmarks, weighted by practical importance:

Category	Weight	What it measures
Coding	25%	Real GitHub issues, live coding challenges
Reasoning	25%	PhD science questions, novel problem solving
Agents & Tools	18%	API usage, multi-step tasks, browser automation
Conversation	12%	Creative writing, following complex instructions
Math	10%	Competition math, word problems
Multimodal	7%	Understanding images, charts, diagrams
Multilingual	3%	Performance across languages

Scoring:

Percentage benchmarks used directly
Elo scores normalized to 0-100
Missing data? We estimate using smart imputation (marked with * in UI)
Final score = weighted average across categories

Imputation Methods

When benchmark data is missing, we use two estimation methods:

Superior Model Imputation (green *): For "thinking" variants, we calculate their expected superiority over the base model using benchmarks where both have real data, then apply that ratio to missing benchmarks. More reliable since it's based on real performance differences.
Category Average (yellow *): Falls back to averaging other benchmarks in the same category. Less reliable but ensures all models can be compared.

Note: Estimated values are clearly marked and should be replaced with real data when available. See UPDATE.md for details.

Contributing

Found an outdated score?

Open an issue with the correct value and source.

Want to add a model?

Open an issue with available benchmark scores.

Ready to submit a PR?

Fork this repo
Edit data/showdown.json
Run ./precommit.sh to validate your changes
Submit PR - our CI validates the data automatically
Get merged!

Tech Stack

Frontend: Svelte 5 + SvelteKit (static site generation)
Data: Single JSON file - easy to edit, easy to validate
Hosting: Cloudflare Pages - fast worldwide
CI/CD: GitHub Actions - automated validation on every PR

Data Sources

Rankings aggregate data from trusted sources:

SWE-Bench - Real GitHub issue resolution
GPQA - PhD-level questions
BFCL - Function calling
Arena (formerly LMArena) - Human preferences
Artificial Analysis - Speed metrics

License

AGPL-3.0 - Keep it open!

Built with Svelte. Hosted on Cloudflare. Made for the community.

Top categories

tailwind daisyui admin template popup mdsvex portfolio blog form ecommerce ui carousel auth dark seo image routing