Tasks

WYBench task packs are built to expose long-term agent reliability, not just one-shot patch ability.

Deep Repo Navigation

Find the correct files in a large project and add a setting without creating a parallel system.

Fails for duplicate systems.
Fails for invented paths.
Fails for editing the wrong module.

Obsidian Brain / Memory Following

Read current rules, old notes, deprecated plans, and active decisions, then implement from current truth.

Fails for using deprecated notes.
Fails for contradicting saved decisions.
Fails for ignoring hierarchy.

Terminal Recovery

Run tests, inspect failures, fix the issue, rerun verification, and explain what changed.

Fails for fake test claims.
Fails for guessing without logs.
Fails for unrelated edits.

Long Instruction Chain

Follow a dense requirement list without forgetting constraints in the middle or end.

Fails for partial implementation.
Fails for extra features.
Fails for breaking do-not-change rules.

Multi-Session Continuity

Plan, implement, fix, and audit across sessions while preserving the original project rules.

Fails for forgetting earlier rules.
Fails for reviving removed ideas.
Fails for weak final audit.

UI / Website Integration

Add a public benchmark page matching existing Lee Wyatt Corp chrome, login state, and layout patterns.

Fails for separate design language.
Fails for duplicate auth.
Fails for broken nav.

Evidence Required

Each official result must include the initial prompt, model output, tool logs, final files changed, test results, human review notes, pass/fail reason, and verification status.