mirror of https://fastgit.cc/github.com/Yeachan-Heo/oh-my-claudecode synced 2026-04-21 05:12:30 +08:00

Files

Yeachan-Heo df0f27071f Keep agent guidance and benchmark defaults aligned with OMC

The root AGENTS contract had drifted toward OMX/Codex wording and state
paths, which made the project-level guidance inconsistent with the actual
OMC runtime. The benchmark suite also carried split default model strings
across shell wrappers, the Python runner, and results docs, so this cleanup
re-aligned the suite on one current Sonnet 4.6 default and added a narrow
contract test to catch future regressions.

Constraint: Limit the cleanup to stale OMC-vs-OMX references and benchmark model strings
Rejected: Regenerate broader docs/templates wholesale | unnecessary scope for a targeted cleanup issue
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep root AGENTS branding/state paths in sync with OMC runtime contracts and update benchmark defaults in one place when the benchmark model changes
Tested: ./node_modules/.bin/vitest run src/__tests__/tier0-docs-consistency.test.ts src/__tests__/hooks.test.ts src/config/__tests__/loader.test.ts
Tested: python3 -m py_compile benchmark/run_benchmark.py
Tested: bash -n benchmark/quick_test.sh benchmark/run_vanilla.sh benchmark/run_omc.sh benchmark/run_full_comparison.sh
Not-tested: Full benchmark execution against live Anthropic/SWE-bench infrastructure

2026-03-21 01:10:33 +00:00

predictions

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

results

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

.env.example

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

.gitignore

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

analyze_failures.py

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

compare_results.py

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

docker-compose.yml

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

Dockerfile

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

entrypoint.sh

refactor: remove remaining sisyphus references repository-wide

2026-02-22 06:52:44 +00:00

evaluate.py

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

quick_test.sh

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

README.md

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

requirements.txt

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

run_benchmark.py

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

run_full_comparison.sh

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

run_omc.sh

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

run_vanilla.sh

Keep agent guidance and benchmark defaults aligned with OMC

2026-03-21 01:10:33 +00:00

setup.sh

fix(permission-handler): remove dead code and add swarm marker support (#144 ) (#157 )

2026-01-27 23:24:02 +09:00

README.md

SWE-bench Benchmark Suite

Automated benchmark comparison between vanilla Claude Code and OMC-enhanced Claude Code.

Quick Start

# 1. One-time setup
./setup.sh

# 2. Quick sanity test (5 instances)
./quick_test.sh

# 3. Full comparison
./run_full_comparison.sh

Scripts

setup.sh

One-time setup and verification:

Installs Python dependencies
Builds Docker image for SWE-bench
Downloads and caches dataset
Verifies API key
Builds OMC project
Runs sanity checks

Usage:

./setup.sh

quick_test.sh

Quick sanity test with limited instances (default: 5):

Tests both vanilla and OMC modes
Fast verification before full runs
Recommended before production benchmarks

Usage:

./quick_test.sh [--limit N] [--model MODEL] [--timeout SECS]

Examples:

./quick_test.sh                    # Test 5 instances
./quick_test.sh --limit 10         # Test 10 instances
./quick_test.sh --timeout 300      # 5 minutes per instance

run_vanilla.sh

Run vanilla Claude Code benchmark:

Standard Claude Code without OMC
Saves predictions to predictions/vanilla/
Logs to logs/vanilla_*.log

Usage:

./run_vanilla.sh [OPTIONS]

Options:

--limit N - Limit to N instances (default: all)
--skip N - Skip first N instances (default: 0)
--model MODEL - Claude model to use (default: claude-sonnet-4-6-20260217)
--timeout SECS - Timeout per instance (default: 300)

Examples:

./run_vanilla.sh                           # Full benchmark
./run_vanilla.sh --limit 100               # First 100 instances
./run_vanilla.sh --skip 100 --limit 100    # Instances 101-200
./run_vanilla.sh --timeout 600             # 10 minutes per instance

run_omc.sh

Run OMC-enhanced benchmark:

Claude Code with oh-my-claudecode orchestration
Saves predictions to predictions/omc/
Logs to logs/omc_*.log

Usage:

./run_omc.sh [OPTIONS]

Options: Same as run_vanilla.sh

Examples:

./run_omc.sh                    # Full benchmark
./run_omc.sh --limit 100        # First 100 instances

run_full_comparison.sh

Complete benchmark suite:

Runs vanilla benchmark
Runs OMC benchmark
Evaluates both runs
Generates comparison report

Usage:

./run_full_comparison.sh [OPTIONS]

Options:

--limit N - Limit to N instances
--skip N - Skip first N instances
--model MODEL - Claude model to use
--timeout SECS - Timeout per instance
--skip-vanilla - Skip vanilla benchmark run
--skip-omc - Skip OMC benchmark run
--skip-eval - Skip evaluation step

Examples:

./run_full_comparison.sh                    # Full comparison
./run_full_comparison.sh --limit 100        # Test 100 instances
./run_full_comparison.sh --skip-vanilla     # Only run OMC (reuse vanilla results)

Directory Structure

benchmark/
├── setup.sh                    # One-time setup
├── quick_test.sh              # Quick sanity test
├── run_vanilla.sh             # Run vanilla benchmark
├── run_omc.sh                 # Run OMC benchmark
├── run_full_comparison.sh     # Full comparison suite
├── run_benchmark.py           # Main Python benchmark runner
├── Dockerfile                 # Docker image for SWE-bench
├── docker-compose.yml         # Docker compose config
├── requirements.txt           # Python dependencies
├── predictions/
│   ├── vanilla/              # Vanilla predictions
│   └── omc/                  # OMC predictions
├── logs/
│   ├── vanilla_*.log         # Vanilla run logs
│   └── omc_*.log            # OMC run logs
├── results/
│   ├── vanilla_results.json  # Vanilla evaluation
│   ├── omc_results.json      # OMC evaluation
│   └── comparison_report.md  # Comparison report
├── data/                      # Test data
└── cache/                     # Dataset cache

Prerequisites

Docker
Python 3.8+
Node.js and npm
ANTHROPIC_API_KEY environment variable

export ANTHROPIC_API_KEY=your_key_here

Workflow

Setup (one-time):
```
./setup.sh
```
Quick Test (recommended):
```
./quick_test.sh
```

Full Benchmark:

# Option A: Run full comparison
./run_full_comparison.sh

# Option B: Run individually
./run_vanilla.sh
./run_omc.sh

Review Results:
- Check results/comparison_report.md
- Inspect predictions in predictions/vanilla/ and predictions/omc/
- Review logs in logs/

Troubleshooting

Setup Issues

./setup.sh
# Check output for specific errors

API Key Issues

# Verify API key is set
echo $ANTHROPIC_API_KEY

# Export if missing
export ANTHROPIC_API_KEY=your_key_here

Docker Issues

# Check Docker is running
docker ps

# Rebuild image
docker build -t swe-bench-runner .

Python Dependencies

# Reinstall dependencies
pip install -r requirements.txt

Advanced Usage

Custom Model

./run_vanilla.sh --model claude-opus-4-6-20260205
./run_omc.sh --model claude-opus-4-6-20260205

Longer Timeout

# 15 minutes per instance
./run_full_comparison.sh --timeout 900

Subset Testing

# Test instances 50-150
./run_full_comparison.sh --skip 50 --limit 100

Resume Failed Run

# If vanilla failed at instance 42, skip to 42 and continue
./run_vanilla.sh --skip 42

Performance Tips

Start Small: Use quick_test.sh to verify setup
Parallel Runs: Don't run vanilla and OMC in parallel (share API rate limits)
Monitor Logs: Use tail -f logs/vanilla_*.log to watch progress
Timeout Tuning: Increase timeout for complex instances
Disk Space: Ensure sufficient space for predictions and Docker containers

Interpreting Results

Metrics

Solve Rate: Percentage of instances successfully resolved
Token Usage: Average tokens per instance
Time: Average time per instance
Error Rate: Percentage of instances that errored

Comparison Report

The results/comparison_report.md includes:

Side-by-side metrics
Statistical significance tests
Instance-level comparisons
Qualitative analysis

License

Same as parent project (MIT)