Home
Writing All Writing Journal Perspective Research
Lab All Lab Data Analyst AI AgentHello WorldGermany Real Estate IntelligenceEconomic DashboardLLM Benchmark PlatformOpenCode Go Chat
About About Skills CV
Contact
alpha web-app · learning v0.1.0

LLM Benchmark Platform

A deployed benchmark platform for running role-based coding-agent evaluations across OpenCode Go models and tracking results over time.

LLM Benchmark Platform is a deployed interface for running and inspecting role-based coding-agent benchmarks.

Live

llm-benchmark.hectorsanchez.eu

What it does

The platform discovers benchmark task definitions from the repository, lists the configured OpenCode Go models, launches background benchmark jobs, and persists run state as JSON so progress and results can be inspected later.

The first live version ships with one runnable backend task: a Flask URL shortener API evaluated with deterministic tests and linting.

Stack

  • FastAPI for task/model/run APIs.
  • Static HTML and vanilla JavaScript for the UI.
  • Pi as the coding-agent harness.
  • OpenCode Go as the model provider.
  • systemd and nginx on the Hetzner VPS for runtime and routing.

Status

Alpha. The app is live and can inspect available tasks/models. Background run execution is JSON-backed and will move to SQLite after the task catalog grows.

Roadmap

  • Add authentication before exposing expensive benchmark execution broadly.
  • Add more runnable task workspaces across roles.
  • Add richer rankings, run comparisons, and trace viewers.
  • Move run storage from JSON files to SQLite.

Stack

python · fastapi · vanilla-js · pi · opencode-go · nginx · systemd · hetzner

Auth

No auth — open access.