Game Ui Refiner

GeraCollante

VLM critic+generator loop for refining AI-generated game UIs into pixel-faithful Svelte components

🔥 game-ui-refiner

Iteratively refine AI-generated game UIs into pixel-faithful Svelte components, using a vision critic + code generator loop.

A pure-browser tool that takes a screenshot of a game UI you want to replicate and runs an iterative loop where:

A generator VLM writes a Svelte component + standalone HTML preview
The browser renders the HTML in a sandboxed iframe and captures a screenshot
A critic VLM compares (target, render) and emits a structured JSON critique
The generator gets the critique and produces a refined version
Repeat until you pause and apply your own manual feedback for the final touch

No Webpack, no Vite, no React, no Svelte (the irony). One HTML file + a tiny Python server + a handful of compiled TS modules. Open a browser tab and you're refining.

The problem

You ask Claude/GPT/Gemini for a Svelte game UI component from a reference image and you get this:

target:           ┃ what the LLM gives you:
                  ┃
[ ⏸ ▶ ▶▶ ◆ ]      ┃  [pause][play][>>][x]
 ↑ bevels         ┃   ↑ flat
 ↑ gradients      ┃   ↑ no gradients
 ↑ pixel detail   ┃   ↑ generic Tailwind

The model knows what a button bar looks like, but it can't see its own output to verify if the bevels match, the gradient is right, the spacing is exact. So you iterate manually for an hour.

This tool closes the loop: the model sees the target, generates code, the browser renders it, a different VLM compares the render to the target and emits structured feedback, and the generator iterates. After 4-5 epochs you get something that actually looks like the target.

How it works

flowchart TB
    T[🎯 Target<br/><b>image</b>]:::target --> G1[🧠 Generator VLM<br/><b>writes Svelte + HTML</b>]:::gen
    G1 --> R1[🖼️ Render<br/><b>in iframe</b>]:::render
    R1 --> S[📸 Screenshot<br/><b>html2canvas JPEG</b>]:::shot
    S --> C[🔍 Critic VLM<br/><b>compares target vs render</b>]:::critic
    T --> C
    C -->|📋 JSON scores + issues| G2[♻️ Generator refines<br/><b>using critique</b>]:::gen
    G2 --> R1
    C -->|⏸ pause / ⛰️ plateau| OUT[✅ Final .svelte<br/><b>auto run done</b>]:::out
    OUT --> H{✨ Final touch?}:::ask
    H -->|yes| HF[👤 Human feedback<br/><b>plain text</b>]:::human
    HF --> FG[👑 Forced Gemini 3.1 Pro<br/><b>top model, ignores preset</b>]:::premium
    FG --> OUT2[🏆 Polished .svelte<br/><b>deliverable</b>]:::out

    classDef target   fill:#1e293b,stroke:#fbbf24,stroke-width:2px,color:#fbbf24
    classDef gen      fill:#581c87,stroke:#c084fc,stroke-width:2px,color:#f5f3ff
    classDef render   fill:#0c4a6e,stroke:#38bdf8,stroke-width:2px,color:#e0f2fe
    classDef shot     fill:#134e4a,stroke:#2dd4bf,stroke-width:2px,color:#ccfbf1
    classDef critic   fill:#164e63,stroke:#22d3ee,stroke-width:2px,color:#cffafe
    classDef out      fill:#14532d,stroke:#4ade80,stroke-width:2px,color:#dcfce7
    classDef ask      fill:#422006,stroke:#fbbf24,stroke-width:2px,color:#fde68a
    classDef human    fill:#4c1d95,stroke:#a78bfa,stroke-width:2px,color:#ede9fe
    classDef premium  fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#fff7ed

The architecture is heavily inspired by GameUIAgent (the 6-stage pipeline + 5 quality dimensions for the critic), UI2Code^N (the iterative drafting/polishing paradigm), VisRefiner (the diff-aligned learning approach), and AutoGameUI (the separation of UI artist concerns from UX functional concerns). See Related work below for citations.

Per-epoch sequence

sequenceDiagram
    autonumber
    participant U as 👤 You
    participant L as 🔁 Loop
    participant G as 🧠 Generator<br/><b>Gemini 3.1 Pro</b>
    participant I as 🖼️ iframe
    participant C as 🔍 Critic<br/><b>Gemini Flash-Lite</b>
    U->>+L: ▶️ Click Start
    L->>+G: 🎯 target image + system prompt
    G-->>-L: 📦 svelte block<br/>📄 html block
    L->>+I: render via srcdoc
    I-->>-L: 📸 screenshot<br/><b>JPEG html2canvas</b>
    L->>+C: 🎯 target + 📸 render + code
    C-->>-L: 📋 critique JSON<br/><b>5 scores + issues</b>
    L->>L: 📈 scoreHistory.push<br/>🎨 drawChart
    Note over L: ↻ epoch N+1<br/>while scoreHistory.length &lt; epochs
    L->>+G: 🎯 + 📸 + 📋 critique + previous code
    G-->>-L: ♻️ refined svelte + html
    L->>I: re-render
    Note over U: ⏸️ Pause when satisfied
    U->>L: ✨ Type manual feedback
    L->>+G: 👑 forced Gemini 3.1 Pro<br/>🎯 + 📸 + 👤 human text
    G-->>-L: 🏆 final polished code
    L-->>-U: ✅ Done

Provider abstraction

flowchart TB
    App[🔁 runRefinement loop<br/><b>main.ts</b>]:::app --> CM{🚦 callModel<br/><b>dispatcher</b>}:::dispatch
    CM -->|🟦 google| CG[📡 callGoogle<br/><b>v1beta/models/:generateContent</b>]:::google
    CM -->|🟧 openrouter| CO[📡 callOpenRouter<br/><b>api/v1/chat/completions</b>]:::or
    CG --> Gemini[💎 Google AI Studio<br/><b>Gemini 3.x / 2.5 / 2.0 / 1.5</b>]:::google
    CO --> OR[🌐 OpenRouter<br/><b>unified gateway</b>]:::or
    OR --> Claude[🤖 Claude 4.x<br/><b>Anthropic</b>]:::anth
    OR --> GPT[🤖 GPT-5.x<br/><b>OpenAI</b>]:::oai
    OR --> Grok[🤖 Grok 4.x<br/><b>xAI · 2M ctx</b>]:::xai
    OR --> Other[🤖 Qwen / Kimi / Nemotron / Llama<br/><b>open weights</b>]:::other

    classDef app      fill:#1e293b,stroke:#fbbf24,stroke-width:2px,color:#fde68a
    classDef dispatch fill:#422006,stroke:#f59e0b,stroke-width:2px,color:#fde68a
    classDef google   fill:#0c4a6e,stroke:#38bdf8,stroke-width:2px,color:#e0f2fe
    classDef or       fill:#7c2d12,stroke:#fb923c,stroke-width:2px,color:#fff7ed
    classDef anth     fill:#451a03,stroke:#d97706,stroke-width:1.5px,color:#fef3c7
    classDef oai      fill:#14532d,stroke:#22c55e,stroke-width:1.5px,color:#dcfce7
    classDef xai      fill:#0f172a,stroke:#64748b,stroke-width:1.5px,color:#e2e8f0
    classDef other    fill:#581c87,stroke:#a855f7,stroke-width:1.5px,color:#f5f3ff

File save layout per session

flowchart TB
    R[📁 runs/]:::root --> S[🗂️ 20260407_152330_a3b8/<br/><b>session timestamp + random</b>]:::session

    S --> AUTO[🤖 Auto epochs<br/><b>generated by the loop</b>]:::auto
    S --> MAN[👤 Manual epochs<br/><b>human feedback touches</b>]:::manual

    AUTO --> A1[📸 epoch1_render.jpg<br/><b>screenshot for the critic</b>]:::img
    AUTO --> A2[📦 epoch1_component.svelte<br/><b>deliverable</b>]:::code
    AUTO --> A3[📄 epoch1_preview.html<br/><b>standalone preview</b>]:::code
    AUTO --> A4[📋 epoch1_critique.json<br/><b>scores + issues</b>]:::critique
    AUTO --> A5[💬 epoch1_gen_response.txt<br/><b>raw generator output</b>]:::raw
    AUTO --> A6[💬 epoch1_critic_response.txt<br/><b>raw critic output</b>]:::raw
    AUTO --> AN[➕ epochN_*<br/><b>same set per epoch</b>]:::more

    MAN --> M1[✍️ human1_feedback.txt<br/><b>your prompt</b>]:::human
    MAN --> M2[📦 human1_component.svelte<br/><b>polished deliverable</b>]:::code
    MAN --> M3[📄 human1_preview.html<br/><b>polished preview</b>]:::code
    MAN --> M4[📸 human1_render.jpg<br/><b>final screenshot</b>]:::img
    MAN --> MN[➕ humanN_*<br/><b>one set per feedback</b>]:::more

    classDef root     fill:#1e1b4b,stroke:#a78bfa,stroke-width:2px,color:#ede9fe
    classDef session  fill:#312e81,stroke:#818cf8,stroke-width:2px,color:#e0e7ff
    classDef auto     fill:#0c4a6e,stroke:#38bdf8,stroke-width:2px,color:#e0f2fe
    classDef manual   fill:#581c87,stroke:#c084fc,stroke-width:2px,color:#f5f3ff
    classDef img      fill:#134e4a,stroke:#2dd4bf,stroke-width:1.5px,color:#ccfbf1
    classDef code     fill:#14532d,stroke:#4ade80,stroke-width:1.5px,color:#dcfce7
    classDef critique fill:#422006,stroke:#fbbf24,stroke-width:1.5px,color:#fde68a
    classDef raw      fill:#1f2937,stroke:#9ca3af,stroke-width:1.5px,color:#e5e7eb
    classDef human    fill:#4c1d95,stroke:#a78bfa,stroke-width:1.5px,color:#ede9fe
    classDef more     fill:#374151,stroke:#6b7280,stroke-width:1px,color:#d1d5db,stroke-dasharray:4 2

Quickstart

git clone https://github.com/GeraCollante/game-ui-refiner
cd game-ui-refiner

# Get a free Gemini API key at https://aistudio.google.com/app/apikey
cp .env.example .env
$EDITOR .env  # paste your key into GEMINI_API_KEY=AIza...

python3 serve.py
# → http://localhost:8000

That's it. No npm install for end users — the compiled JS is committed in js/.

In the browser:

The default provider is 🟦 Google Direct, the default preset is 🥇 Smart (Gemini 3.1 Flash-Lite as critic + Gemini 3.1 Pro as generator)
Drop, paste, or click to upload your target image
Set epochs (default 4) and click ▶ Start
Watch the score chart climb across epochs
Click ⏸ Pause when it converges
Optionally type a feedback line and click ✨ Aplicar feedback for the final touch with Gemini 3.1 Pro
Copy the Svelte from the 📦 Svelte tab into your project

Models supported

Family	Provider	Models	Vision
Gemini 3.x	Google direct + OpenRouter	3.1 Pro, 3.1 Flash-Lite, 3 Pro, 3 Flash	✅
Gemini 2.5	Google direct + OpenRouter	2.5 Pro, 2.5 Flash, 2.5 Flash-Lite	✅
Gemini 1.5/2.0	Google direct	2.0 Flash, 1.5 Pro/Flash	✅
GPT-5.x	OpenRouter	5.4, 5.4 mini, 5	✅
Claude 4.x	OpenRouter	Sonnet 4.6, Opus 4.6, Haiku 4.5	✅
Grok 4.x	OpenRouter	4.20 (2M ctx), 4.20 multi-agent, 4 Fast, 4	✅
Qwen 3.5	OpenRouter	397B A17B	✅
Kimi K2.5	OpenRouter	K2.5	✅
Nemotron	OpenRouter	Nano 12B VL (+ `:free` variant)	✅
Llama 4	OpenRouter	Maverick	✅

For up-to-date pricing and benchmarks, see Artificial Analysis.

Presets

Preset	Critic	Generator	Cost / 4-epoch run	Notes
🥇 Smart (default)	Gemini 3.1 Flash-Lite	Gemini 3.1 Pro	~$0.04	Best balance
🥈 Smart-3	Gemini 3 Flash	Gemini 3.1 Pro	~$0.05	If 3.1 Flash-Lite hallucinates
⚡ Speed	Gemini 3.1 Flash-Lite	Gemini 3 Flash	~$0.02	Iterate prompts fast
👑 Premium	Gemini 3.1 Pro	Gemini 3.1 Pro	~$0.10	Maximum quality
🪨 Stable	Gemini 2.5 Flash	Gemini 2.5 Pro	~$0.04	Free tier friendly
🅰️ Anthropic (OR only)	Gemini 3.1 Flash-Lite	Claude Sonnet 4.6	~$0.10	A/B test Claude
🤖 Grok (OR only)	Grok 4 Fast	Grok 4.20	~$0.04	Top non-hallucination + IFBench
🤖 Grok Full (OR only)	Grok 4.20	Grok 4	~$0.12	Premium Grok stack
🆓 Free	Gemini 2.5 Flash	Gemini 2.5 Flash	$0	Free tier (15 RPM)

The Manual Feedback button always uses Gemini 3.1 Pro regardless of the active preset, because for final touch-ups you want the strongest model and cost is irrelevant.

Cost

A typical run with the Smart preset is ~$0.04 for 4 epochs (Google direct pricing). The Premium preset is ~$0.10. The Free preset is ~$0 if you stay within Gemini's 1000 requests/day free tier.

The header shows live cost and wall-clock time. Both reset on each run.

Architecture details for the curious

Why dual output (Svelte + HTML preview)?

Browsers can't render .svelte files natively without a compiler. So the generator emits two blocks per response:

A .svelte component (the deliverable)
A self-contained HTML preview (visually identical, used for the iframe rendering and the screenshot loop)

The two stay in sync because the same model emits both in one go.

Why JPEG screenshots instead of PNG?

JPEG quality 0.85 is 5–10× smaller than PNG for typical UI screenshots and visually indistinguishable for fidelity comparison. The savings compound across epochs (each epoch sends 1-2 screenshots in API requests).

Why is the Prompts panel "lazy"?

The Prompts panel renders inline base64 image thumbnails, which Firefox decodes and keeps in RAM. Refreshing it on every API call (instead of only when you click the tab) was a Firefox memory hog. Now it's marked dirty and only rebuilt when you actually view it.

Why a static analyzer?

Because we kept getting bitten by these specific bugs:

Stray </script> in JS strings — kills the inline script tag (the HTML parser doesn't understand JS)
Recursive function calls introduced by typos in dispatcher branches
DOM ID references that drift from declared id="..." attributes after refactors

tools/check.mjs parses every compiled js/*.js with acorn, runs Tarjan SCC over the call graph for indirect cycles, and cross-checks DOM IDs declared in HTML against $('foo') references in JS. Run it with npm run check or as part of npm run lint.

Project layout

game-ui-refiner/
├── index.html                    # ~220 lines: just markup, loads ./js/main.js as type=module
├── serve.py                      # ~150 lines: tiny Python server, .env injection, /save endpoint
├── src/                          # TypeScript source (edit these)
│   ├── types.ts                  # interfaces shared across modules
│   ├── state.ts                  # global mutable state
│   ├── config.ts                 # model catalogs, presets, dim colors
│   ├── parser.ts                 # pure functions: parseDualOutput, extractJson, etc
│   ├── api.ts                    # provider clients + message builders + screenshot
│   ├── ui.ts                     # tabs, chart, history, ticker, save, log, dropdowns
│   └── main.ts                   # entry: runRefinement, runFeedbackEpoch, init
├── js/                           # compiled output (committed, no npm install needed)
├── tests/run.mjs                 # 38 plain-Node parser tests (no jest/vitest)
├── tools/
│   ├── check.mjs                 # static analyzer (acorn + Tarjan SCC + DOM crosscheck)
│   ├── lint.sh                   # full pipeline: tsc + check + tests + ruff
│   └── README.md
├── .github/workflows/check.yml   # CI: lint + tests + serve.py smoke test
├── .env.example
├── tsconfig.json
├── package.json
├── README.md
├── CONTRIBUTING.md
├── CHANGELOG.md
└── LICENSE

Development

# Install dev deps (TypeScript + acorn for the analyzer)
npm install
cd tools && npm install && cd ..

# Edit src/*.ts, then build
npm run build      # one-shot
npm run watch      # auto-recompile

# Run all checks
npm run lint

The full pipeline that CI runs is in tools/lint.sh:

npx tsc --noEmit — type-check
node tools/check.mjs — static analyzer
node tests/run.mjs — parser unit tests
ruff check serve.py — Python lint

Roadmap

Smoke tests with headless Chromium (Playwright) to detect runtime regressions
Multi-target batch mode: process N images in series, save to one session folder
Export entire run as a self-contained zip (target + all epochs + final code + critique JSONs)
Output target formats beyond Svelte: React component, Vue SFC, plain HTML+CSS
Visual diff tab between any two epochs
More providers: Anthropic Direct, AWS Bedrock, Azure OpenAI
Configurable critic schema (currently hardcoded to GameUIAgent's 5 dimensions)

This project would not exist without the following papers. None of their code is reused — the inspiration is conceptual, not literal.

GameUIAgent (arXiv:2603.14724) — An LLM-Powered Framework for Automated Game UI Design with Structured Intermediate Representation (2026). Source of the 6-stage pipeline pattern and the 5 quality dimensions used by the critic (structural_fidelity, color_consistency, typography, spacing_alignment, visual_completeness). Also the inspiration for the Reflection Controller pattern.
UI2Code^N (arXiv:2511.08195) — A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation (Tsinghua/Zhipu, 2025). Validates the iterative drafting → polishing paradigm we apply across epochs. Their 9B model establishes that loop-based refinement matters more than raw model size.
VisRefiner (arXiv:2602.05998) — Learning from Visual Differences for Screenshot-to-Code Generation (CAS, 2026). The conceptual basis for the diff-driven critique format and the truncation detection in the parser.
AutoGameUI (arXiv:2411.03709) — Constructing High-Fidelity GameUI via Multimodal Correspondence Matching (Tencent TiMi, 2026). The clearest articulation of why UI design and UX function should be separated, which inspired the critic/generator split here.
Design2Code (arXiv:2403.03163) — Benchmarking Multimodal Code Generation for Automated Front-End Engineering (Stanford, NAACL 2025). The benchmark methodology for evaluating image-to-code fidelity and the source of our intuition that vision benchmarks lie about real performance.
Reflexion (arXiv:2303.11366) and Self-Refine (arXiv:2303.17651) — Foundational papers on LLM self-critique and iterative refinement loops.

License

MIT — see LICENSE.

Acknowledgments

Built in a single conversation with Claude Code (Opus 4.6 with 1M context). The TypeScript split, the analyzer, the test suite, the README, and most of the system prompt engineering were iterated end-to-end with the model in the loop. The architectural decisions (split critic/generator, lazy DOM, JPEG screenshots, Tarjan-based recursion detection, parser strategies for messy LLM output) emerged from real failures during development.

Citations to all the related papers above. Special mention to GameUIAgent as the conceptual seed.

Top categories