WYBench by Lee Wyatt Corp

WYBench

A real-world benchmark for long-term agentic coding. WYBench measures whether a model can work inside messy real projects, follow instructions over time, use tools honestly, and finish without drifting.

View Leaderboard View Methodology Submit Results

Explore WYBench

Leaderboard Models Compare Methodology External Benchmarks Tasks Admin

Why WYBench Exists

Most coding benchmarks reward short patches or isolated task completion. WYBench focuses on long-agent behavior: reading the right files, preserving project rules, using terminal output honestly, avoiding fake success, and finishing the work without breaking unrelated systems.

Top Models Snapshot

No verified WYBench model runs are published yet.
When Lee Wyatt Corp publishes verified runs, this section will show the top models by WYBench score, context score, tool score, hallucination resistance, cost, time, and Brandon Trust Score.

Scoring Engine Coming Soon

Real sessions. Automated scores.

WYBench is powered by WyCode, Lee Wyatt Corp's AI coding agent desktop app (coming soon). Instead of synthetic prompts, every model runs inside WyCode on real long coding sessions, scoring silently in the background.

▸WyCode detects which model is active. Switch models mid-session and the scoring context switches with you, each model gets its own window.
▸Partial sessions score proportionally. A model used for one minute earns a partial score for that window. Full scores require completing the full session period.
▸Fully automated. The dev works naturally inside WyCode. At the end of the period, WyBench computes how each model performed, no manual grading, no cherry-picked prompts.
▸Real-world, real coders. Scores come from actual long-session agentic coding, the kind of work that exposes whether a model can actually stay on task, follow rules, and finish.

Benchmark Categories

Long-Horizon Coding

Multi-file work across frontend, backend, tests, docs, and project rules.

Deep Context Navigation

Finding current truth in large repos, docs, memory folders, logs, and old notes.

Terminal Execution

Running commands, reading errors, recovering from failure, and verifying honestly.

Multi-Session Memory

Following saved project decisions across sessions without reviving obsolete ideas.

Instruction Fidelity

Doing exactly what was asked without extra features, rewrites, or fake shortcuts.

Regression Safety

Solving the task without damaging adjacent behavior, config, APIs, or user work.

Numbers vs Brandon

WYBench separates objective task results from Lee Wyatt Corp real-world trust. A model can score well on a task set and still be marked harness-dependent, overrated by public benchmarks, or unsafe for major autonomous work if real use shows reliability problems.

Evidence First

WYBench results should be backed by prompts, logs, diffs, tests, source links, reviewer notes, and verification status. Public benchmark data and expert claims should be cited, dated, and separated from Brandon's personal opinion so the platform stays honest and unbiased.