mirror of
https://fastgit.cc/github.com/Yeachan-Heo/oh-my-claudecode
synced 2026-04-20 21:00:50 +08:00
The root AGENTS contract had drifted toward OMX/Codex wording and state paths, which made the project-level guidance inconsistent with the actual OMC runtime. The benchmark suite also carried split default model strings across shell wrappers, the Python runner, and results docs, so this cleanup re-aligned the suite on one current Sonnet 4.6 default and added a narrow contract test to catch future regressions. Constraint: Limit the cleanup to stale OMC-vs-OMX references and benchmark model strings Rejected: Regenerate broader docs/templates wholesale | unnecessary scope for a targeted cleanup issue Confidence: high Scope-risk: narrow Reversibility: clean Directive: Keep root AGENTS branding/state paths in sync with OMC runtime contracts and update benchmark defaults in one place when the benchmark model changes Tested: ./node_modules/.bin/vitest run src/__tests__/tier0-docs-consistency.test.ts src/__tests__/hooks.test.ts src/config/__tests__/loader.test.ts Tested: python3 -m py_compile benchmark/run_benchmark.py Tested: bash -n benchmark/quick_test.sh benchmark/run_vanilla.sh benchmark/run_omc.sh benchmark/run_full_comparison.sh Not-tested: Full benchmark execution against live Anthropic/SWE-bench infrastructure
4.2 KiB
4.2 KiB
SWE-bench Verified Results
Summary
| Mode | Pass Rate | Avg Tokens | Avg Time | Total Cost |
|---|---|---|---|---|
| Vanilla | -% | - | -m | $- |
| OMC | -% | - | -m | $- |
Delta: - percentage points improvement
Methodology
Dataset
- Benchmark: SWE-bench Verified (500 instances)
- Source: princeton-nlp/SWE-bench_Verified
- Selection: Curated subset of real GitHub issues with verified solutions
Evaluation Setup
- Model: Claude Sonnet 4.6 (claude-sonnet-4-6-20260217)
- Max Tokens: 16,384 output tokens per instance
- Timeout: 30 minutes per instance
- Workers: 4 parallel evaluations
- Hardware: [Specify machine type]
Vanilla Configuration
Standard Claude Code with default settings:
- No OMC extensions loaded
- Default system prompt
- Single-agent execution
OMC Configuration
Oh-My-ClaudeCode enhanced with:
- Multi-agent orchestration
- Specialist delegation (architect, executor, etc.)
- Ralph persistence loop for complex tasks
- Ultrawork parallel execution
- Automatic skill invocation
Metrics Collected
- Pass Rate: Percentage of instances where generated patch passes all tests
- Token Usage: Input + output tokens consumed per instance
- Time: Wall-clock time from start to patch generation
- Cost: Estimated API cost based on token usage
Results Breakdown
By Repository
| Repository | Vanilla | OMC | Delta |
|---|---|---|---|
| django | -/- | -/- | - |
| flask | -/- | -/- | - |
| requests | -/- | -/- | - |
| ... | ... | ... | ... |
By Difficulty
| Difficulty | Vanilla | OMC | Delta |
|---|---|---|---|
| Easy | -% | -% | - |
| Medium | -% | -% | - |
| Hard | -% | -% | - |
Failure Analysis
Top failure categories for each mode:
Vanilla:
- Category: N failures (N%)
- ...
OMC:
- Category: N failures (N%)
- ...
Improvements
Instances that OMC solved but vanilla failed:
| Instance ID | Category | Notes |
|---|---|---|
| ... | ... | ... |
Regressions
Instances that vanilla solved but OMC failed:
| Instance ID | Category | Notes |
|---|---|---|
| ... | ... | ... |
Reproduction
Prerequisites
# Install SWE-bench
pip install swebench
# Install oh-my-claudecode (if testing OMC)
# Follow setup instructions in main README
Running Vanilla Baseline
# Generate predictions
python run_benchmark.py --mode vanilla --dataset swe-bench-verified --output results/vanilla/
# Evaluate
python evaluate.py --predictions results/vanilla/predictions.json --output results/vanilla/
Running OMC
# Generate predictions with OMC
python run_benchmark.py --mode omc --dataset swe-bench-verified --output results/omc/
# Evaluate
python evaluate.py --predictions results/omc/predictions.json --output results/omc/
Comparing Results
python compare_results.py --vanilla results/vanilla/ --omc results/omc/ --output comparison/
Analyzing Failures
python analyze_failures.py --vanilla results/vanilla/ --omc results/omc/ --compare --output analysis/
Files
results/
├── vanilla/
│ ├── predictions.json # Generated patches
│ ├── summary.json # Evaluation summary
│ ├── report.md # Human-readable report
│ └── logs/ # Per-instance logs
├── omc/
│ ├── predictions.json
│ ├── summary.json
│ ├── report.md
│ └── logs/
├── comparison/
│ ├── comparison_*.json # Detailed comparison data
│ ├── comparison_*.md # Comparison report
│ └── comparison_*.csv # Per-instance CSV
└── analysis/
├── failure_analysis_*.json
└── failure_analysis_*.md
Notes
- Results may vary based on API model version and temperature
- Some instances may have non-deterministic test outcomes
- Cost estimates are approximate based on published pricing
References
Last updated: [DATE]