External Benchmarks

WYBench will track public benchmark sources, what they measure, what they miss, who supports or disputes them, and whether Lee Wyatt Corp considers them useful, incomplete, disputed, or not trusted.

No external benchmark source reviews are published yet.
Reviews will be added only with current URLs, dates, scope notes, limitations, and explicit trust levels. WYBench will not import leaderboard claims without source review.
Benchmark nameURLLast updatedWhat it measuresWhat it missesTrusted levelVerifier / citationLee Wyatt Corp notes
Awaiting reviewed benchmark sources.

Sources To Review

Agentic Coding

DeepSWE, Artificial Analysis coding-agent views, Terminal-Bench, mini-swe-agent style harness results, and related tool-use benchmarks.

Long Context

Benchmarks that test deep context, repo navigation, memory retrieval, and document-grounded reasoning under realistic pressure.

Disputed Sources

SWE-bench variants can be reviewed with clear concerns when saturation, leakage, harness gaming, or real-world mismatch makes them less useful.

Source Rules

External data should be traceable to the original benchmark, paper, maintainer, company, or public statement. If an expert or CEO vouches for a result, WYBench should show who said it, when they said it, and whether they have a conflict of interest.