Agentic Coding
DeepSWE, Artificial Analysis coding-agent views, Terminal-Bench, mini-swe-agent style harness results, and related tool-use benchmarks.
WYBench will track public benchmark sources, what they measure, what they miss, who supports or disputes them, and whether Lee Wyatt Corp considers them useful, incomplete, disputed, or not trusted.
| Benchmark name | URL | Last updated | What it measures | What it misses | Trusted level | Verifier / citation | Lee Wyatt Corp notes |
|---|---|---|---|---|---|---|---|
| Awaiting reviewed benchmark sources. | |||||||
DeepSWE, Artificial Analysis coding-agent views, Terminal-Bench, mini-swe-agent style harness results, and related tool-use benchmarks.
Benchmarks that test deep context, repo navigation, memory retrieval, and document-grounded reasoning under realistic pressure.
SWE-bench variants can be reviewed with clear concerns when saturation, leakage, harness gaming, or real-world mismatch makes them less useful.
External data should be traceable to the original benchmark, paper, maintainer, company, or public statement. If an expert or CEO vouches for a result, WYBench should show who said it, when they said it, and whether they have a conflict of interest.