Long-Horizon Coding
Multi-file work across frontend, backend, tests, docs, and project rules.
WYBench by Lee Wyatt Corp
A real-world benchmark for long-term agentic coding. WYBench measures whether a model can work inside messy real projects, follow instructions over time, use tools honestly, and finish without drifting.
Most coding benchmarks reward short patches or isolated task completion. WYBench focuses on long-agent behavior: reading the right files, preserving project rules, using terminal output honestly, avoiding fake success, and finishing the work without breaking unrelated systems.
WYBench is powered by WyCode — Lee Wyatt Corp's AI coding agent desktop app (coming soon). Instead of synthetic prompts, every model runs inside WyCode on real long coding sessions, scoring silently in the background.
Multi-file work across frontend, backend, tests, docs, and project rules.
Finding current truth in large repos, docs, memory folders, logs, and old notes.
Running commands, reading errors, recovering from failure, and verifying honestly.
Following saved project decisions across sessions without reviving obsolete ideas.
Doing exactly what was asked without extra features, rewrites, or fake shortcuts.
Solving the task without damaging adjacent behavior, config, APIs, or user work.
WYBench separates objective task results from Lee Wyatt Corp real-world trust. A model can score well on a task set and still be marked harness-dependent, overrated by public benchmarks, or unsafe for major autonomous work if real use shows reliability problems.
WYBench results should be backed by prompts, logs, diffs, tests, source links, reviewer notes, and verification status. Public benchmark data and expert claims should be cited, dated, and separated from Brandon's personal opinion so the platform stays honest and unbiased.