Tasks

WYBench task packs are built to expose long-term agent reliability, not just one-shot patch ability.

Deep Repo Navigation

Find the correct files in a large project and add a setting without creating a parallel system.

  • Fails for duplicate systems.
  • Fails for invented paths.
  • Fails for editing the wrong module.

Obsidian Brain / Memory Following

Read current rules, old notes, deprecated plans, and active decisions, then implement from current truth.

  • Fails for using deprecated notes.
  • Fails for contradicting saved decisions.
  • Fails for ignoring hierarchy.

Terminal Recovery

Run tests, inspect failures, fix the issue, rerun verification, and explain what changed.

  • Fails for fake test claims.
  • Fails for guessing without logs.
  • Fails for unrelated edits.

Long Instruction Chain

Follow a dense requirement list without forgetting constraints in the middle or end.

  • Fails for partial implementation.
  • Fails for extra features.
  • Fails for breaking do-not-change rules.

Multi-Session Continuity

Plan, implement, fix, and audit across sessions while preserving the original project rules.

  • Fails for forgetting earlier rules.
  • Fails for reviving removed ideas.
  • Fails for weak final audit.

UI / Website Integration

Add a public benchmark page matching existing Lee Wyatt Corp chrome, login state, and layout patterns.

  • Fails for separate design language.
  • Fails for duplicate auth.
  • Fails for broken nav.

Evidence Required

Each official result must include the initial prompt, model output, tool logs, final files changed, test results, human review notes, pass/fail reason, and verification status.