LLM Benchmark Platform
A deployed benchmark platform for running role-based coding-agent evaluations across OpenCode Go models and tracking results over time.
LLM Benchmark Platform is a deployed interface for running and inspecting role-based coding-agent benchmarks.
Live
llm-benchmark.hectorsanchez.eu
What it does
The platform discovers benchmark task definitions from the repository, lists the configured OpenCode Go models, launches background benchmark jobs, and persists run state as JSON so progress and results can be inspected later.
The first live version ships with one runnable backend task: a Flask URL shortener API evaluated with deterministic tests and linting.
Stack
- FastAPI for task/model/run APIs.
- Static HTML and vanilla JavaScript for the UI.
- Pi as the coding-agent harness.
- OpenCode Go as the model provider.
- systemd and nginx on the Hetzner VPS for runtime and routing.
Status
Alpha. The app is live and can inspect available tasks/models. Background run execution is JSON-backed and will move to SQLite after the task catalog grows.
Roadmap
- Add authentication before exposing expensive benchmark execution broadly.
- Add more runnable task workspaces across roles.
- Add richer rankings, run comparisons, and trace viewers.
- Move run storage from JSON files to SQLite.
Stack
python · fastapi · vanilla-js · pi · opencode-go · nginx · systemd · hetzner
Auth
No auth — open access.