mirror of https://fastgit.cc/github.com/Yeachan-Heo/oh-my-claudecode synced 2026-04-20 21:00:50 +08:00

Files

Yeachan-Heo df0f27071f Keep agent guidance and benchmark defaults aligned with OMC

The root AGENTS contract had drifted toward OMX/Codex wording and state
paths, which made the project-level guidance inconsistent with the actual
OMC runtime. The benchmark suite also carried split default model strings
across shell wrappers, the Python runner, and results docs, so this cleanup
re-aligned the suite on one current Sonnet 4.6 default and added a narrow
contract test to catch future regressions.

Constraint: Limit the cleanup to stale OMC-vs-OMX references and benchmark model strings
Rejected: Regenerate broader docs/templates wholesale | unnecessary scope for a targeted cleanup issue
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep root AGENTS branding/state paths in sync with OMC runtime contracts and update benchmark defaults in one place when the benchmark model changes
Tested: ./node_modules/.bin/vitest run src/__tests__/tier0-docs-consistency.test.ts src/__tests__/hooks.test.ts src/config/__tests__/loader.test.ts
Tested: python3 -m py_compile benchmark/run_benchmark.py
Tested: bash -n benchmark/quick_test.sh benchmark/run_vanilla.sh benchmark/run_omc.sh benchmark/run_full_comparison.sh
Not-tested: Full benchmark execution against live Anthropic/SWE-bench infrastructure

2026-03-21 01:10:33 +00:00

4.2 KiB

Raw Blame History

SWE-bench Verified Results

Summary

Mode	Pass Rate	Avg Tokens	Avg Time	Total Cost
Vanilla	-%	-	-m	$-
OMC	-%	-	-m	$-

Delta: - percentage points improvement

Methodology

Dataset

Benchmark: SWE-bench Verified (500 instances)
Source: princeton-nlp/SWE-bench_Verified
Selection: Curated subset of real GitHub issues with verified solutions

Evaluation Setup

Model: Claude Sonnet 4.6 (claude-sonnet-4-6-20260217)
Max Tokens: 16,384 output tokens per instance
Timeout: 30 minutes per instance
Workers: 4 parallel evaluations
Hardware: [Specify machine type]

Vanilla Configuration

Standard Claude Code with default settings:

No OMC extensions loaded
Default system prompt
Single-agent execution

OMC Configuration

Oh-My-ClaudeCode enhanced with:

Multi-agent orchestration
Specialist delegation (architect, executor, etc.)
Ralph persistence loop for complex tasks
Ultrawork parallel execution
Automatic skill invocation

Metrics Collected

Pass Rate: Percentage of instances where generated patch passes all tests
Token Usage: Input + output tokens consumed per instance
Time: Wall-clock time from start to patch generation
Cost: Estimated API cost based on token usage

Results Breakdown

By Repository

Repository	Vanilla	OMC	Delta
django	-/-	-/-	-
flask	-/-	-/-	-
requests	-/-	-/-	-
...	...	...	...

By Difficulty

Difficulty	Vanilla	OMC	Delta
Easy	-%	-%	-
Medium	-%	-%	-
Hard	-%	-%	-

Failure Analysis

Top failure categories for each mode:

Vanilla:

Category: N failures (N%)
...

OMC:

Category: N failures (N%)
...

Improvements

Instances that OMC solved but vanilla failed:

Instance ID	Category	Notes
...	...	...

Regressions

Instances that vanilla solved but OMC failed:

Instance ID	Category	Notes
...	...	...

Reproduction

Prerequisites

# Install SWE-bench
pip install swebench

# Install oh-my-claudecode (if testing OMC)
# Follow setup instructions in main README

Running Vanilla Baseline

# Generate predictions
python run_benchmark.py --mode vanilla --dataset swe-bench-verified --output results/vanilla/

# Evaluate
python evaluate.py --predictions results/vanilla/predictions.json --output results/vanilla/

Running OMC

# Generate predictions with OMC
python run_benchmark.py --mode omc --dataset swe-bench-verified --output results/omc/

# Evaluate
python evaluate.py --predictions results/omc/predictions.json --output results/omc/

Comparing Results

python compare_results.py --vanilla results/vanilla/ --omc results/omc/ --output comparison/

Analyzing Failures

python analyze_failures.py --vanilla results/vanilla/ --omc results/omc/ --compare --output analysis/

Files

results/
├── vanilla/
│   ├── predictions.json      # Generated patches
│   ├── summary.json          # Evaluation summary
│   ├── report.md             # Human-readable report
│   └── logs/                 # Per-instance logs
├── omc/
│   ├── predictions.json
│   ├── summary.json
│   ├── report.md
│   └── logs/
├── comparison/
│   ├── comparison_*.json     # Detailed comparison data
│   ├── comparison_*.md       # Comparison report
│   └── comparison_*.csv      # Per-instance CSV
└── analysis/
    ├── failure_analysis_*.json
    └── failure_analysis_*.md

Notes

Results may vary based on API model version and temperature
Some instances may have non-deterministic test outcomes
Cost estimates are approximate based on published pricing

References

Last updated: [DATE]

4.2 KiB Raw Blame History