Files
Bellman 3a833c3395 feat(benchmarks): add per-agent prompt benchmark suite for all 4 consolidated agents (#1437)
Extend the existing harsh-critic benchmark framework with reusable
benchmarks for code-reviewer, debugger, and executor agents. Enables
measurable prompt tuning by comparing old (pre-consolidation) vs new
(merged) prompts with ground-truth scoring.

New infrastructure:
- benchmarks/shared/ — generalized scoring types, parser, reporter, runner
- benchmarks/code-reviewer/ — 3 fixtures (SQL injection, clean code, payment edge cases)
- benchmarks/debugger/ — 3 fixtures (React undefined, Redis intermittent, TS build errors)
- benchmarks/executor/ — 3 fixtures (trivial, scoped, complex tasks)
- benchmarks/run-all.ts — top-level orchestrator with --save-baseline and --compare modes
- npm scripts: bench:prompts, bench:prompts:save, bench:prompts:compare

Each benchmark includes archived pre-consolidation prompts for reproducible
comparison even after old agent files are deleted.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 15:13:08 +09:00
..