mirror of
https://fastgit.cc/github.com/Yeachan-Heo/oh-my-claudecode
synced 2026-04-20 21:00:50 +08:00
Extend the existing harsh-critic benchmark framework with reusable benchmarks for code-reviewer, debugger, and executor agents. Enables measurable prompt tuning by comparing old (pre-consolidation) vs new (merged) prompts with ground-truth scoring. New infrastructure: - benchmarks/shared/ — generalized scoring types, parser, reporter, runner - benchmarks/code-reviewer/ — 3 fixtures (SQL injection, clean code, payment edge cases) - benchmarks/debugger/ — 3 fixtures (React undefined, Redis intermittent, TS build errors) - benchmarks/executor/ — 3 fixtures (trivial, scoped, complex tasks) - benchmarks/run-all.ts — top-level orchestrator with --save-baseline and --compare modes - npm scripts: bench:prompts, bench:prompts:save, bench:prompts:compare Each benchmark includes archived pre-consolidation prompts for reproducible comparison even after old agent files are deleted. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>