Methodology

WYBench does not attempt to measure only whether a model can produce a passing patch. It measures whether a model can behave like a reliable long-term coding agent inside a real project.

Scoring

CategoryWeightWhat It Measures
Task success25%Completed the requested task with working behavior.
Instruction fidelity20%Did exactly what was asked without scope creep or ignored constraints.
Hallucination resistance15%Did not invent files, APIs, commands, benchmark numbers, or test results.
Terminal and tool use15%Used commands, logs, and tool output correctly, then verified results.
Regression safety10%Avoided breaking unrelated features, config, public APIs, or user work.
Deep context navigation10%Found current truth in large repos, brain notes, docs, and old records.
Cost and speed5%Tracked useful success against cost, time, retries, tool calls, and babysitting.

What Counts Against a Model

Critical

Claims tests passed when they were not run, fabricates terminal output, adds broken dependencies from fake APIs, or publishes fake completion.

Major

Over-edits unrelated files, creates duplicate systems, ignores user constraints, changes config without reason, or breaks existing behavior.

Minor

Makes a small unsupported assumption, needs a retry, or gets stuck while clearly explaining what failed.

Human Review

Human review covers architecture judgment, maintainability, project-fit, evidence quality, and Lee Wyatt Corp real-world trust. Brandon Trust Score is labeled as a subjective confidence layer, not an objective benchmark score.

Verification Standard

Every official result should show what was tested, who reviewed it, what evidence supports it, and what still might be wrong. Source-backed claims should include URLs, retrieval dates, and notes about conflicts or uncertainty. Outside researchers, engineers, maintainers, executives, or benchmark owners may be cited only when their statement is relevant and attributable.

Bias Separation

WYBench keeps three layers separate: raw benchmark results, source-backed public benchmark analysis, and Brandon's personal opinion. Brandon's notes can explain real-world trust and lived experience with a model, but they do not overwrite the objective score.