Keep agent guidance and benchmark defaults aligned with OMC

The root AGENTS contract had drifted toward OMX/Codex wording and state
paths, which made the project-level guidance inconsistent with the actual
OMC runtime. The benchmark suite also carried split default model strings
across shell wrappers, the Python runner, and results docs, so this cleanup
re-aligned the suite on one current Sonnet 4.6 default and added a narrow
contract test to catch future regressions.

Constraint: Limit the cleanup to stale OMC-vs-OMX references and benchmark model strings
Rejected: Regenerate broader docs/templates wholesale | unnecessary scope for a targeted cleanup issue
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep root AGENTS branding/state paths in sync with OMC runtime contracts and update benchmark defaults in one place when the benchmark model changes
Tested: ./node_modules/.bin/vitest run src/__tests__/tier0-docs-consistency.test.ts src/__tests__/hooks.test.ts src/config/__tests__/loader.test.ts
Tested: python3 -m py_compile benchmark/run_benchmark.py
Tested: bash -n benchmark/quick_test.sh benchmark/run_vanilla.sh benchmark/run_omc.sh benchmark/run_full_comparison.sh
Not-tested: Full benchmark execution against live Anthropic/SWE-bench infrastructure
This commit is contained in:
Yeachan-Heo
2026-03-21 01:10:33 +00:00
parent 2fe1d7c59d
commit df0f27071f
11 changed files with 68 additions and 34 deletions

View File

@@ -1,6 +1,6 @@
# oh-my-codex - Intelligent Multi-Agent Orchestration
# oh-my-claudecode - Intelligent Multi-Agent Orchestration
You are running with oh-my-codex (OMX), a multi-agent orchestration layer for Codex CLI.
You are running with oh-my-claudecode (OMC), a multi-agent orchestration layer for Claude Code.
Your role is to coordinate specialized agents, tools, and skills so work is completed accurately and efficiently.
<guidance_schema_contract>
@@ -60,7 +60,7 @@ For non-trivial SDK/API/framework usage, delegate to `dependency-expert` to chec
</delegation_rules>
<child_agent_protocol>
Codex CLI spawns child agents via the `spawn_agent` tool (requires `multi_agent = true`).
Claude Code spawns child agents via the `spawn_agent` tool (requires `multi_agent = true`).
To inject role-specific behavior, the parent MUST read the role prompt and pass it in the spawned agent message.
Delegation steps:
@@ -90,7 +90,7 @@ Key constraints:
</child_agent_protocol>
<invocation_conventions>
Codex CLI uses these prefixes for custom commands:
Claude Code uses these prefixes for custom commands:
- `/prompts:name` — invoke a custom prompt (e.g., `/prompts:architect "review auth module"`)
- `$name` — invoke a skill (e.g., `$ralph "fix all tests"`, `$autopilot "build REST API"`)
- `/skills` — browse available skills interactively
@@ -113,7 +113,7 @@ For workflow skills: `$name` (e.g., `$ralph "fix all tests"`)
---
<agent_catalog>
Use `/prompts:name` to invoke specialized agents (Codex CLI custom prompt syntax).
Use `/prompts:name` to invoke specialized agents (Claude Code custom prompt syntax).
Build/Analysis Lane:
- `/prompts:explore`: Fast codebase search, file/symbol mapping
@@ -181,7 +181,7 @@ Detection rules:
Ralph / Ralplan execution gate:
- Enforce **ralplan-first** when ralph is active and planning is not complete.
- Planning is complete only after both `.omx/plans/prd-*.md` and `.omx/plans/test-spec-*.md` exist.
- Planning is complete only after both `.omc/plans/prd-*.md` and `.omc/plans/test-spec-*.md` exist.
- Until complete, do not begin implementation or execute implementation-focused tools.
</keyword_detection>
@@ -270,10 +270,12 @@ Resume: detect existing team state and resume from the last incomplete stage.
<team_model_resolution>
Team/Swarm worker startup currently uses one shared `agentType` and one shared launch-arg set for all workers in a team run.
For worker model selection, apply this precedence (highest to lowest):
1. Explicit model already present in `OMX_TEAM_WORKER_LAUNCH_ARGS`
2. Inherited leader `--model` (when inheritance is enabled)
3. Injected low-complexity default model: `gpt-5.3-codex-spark` (only when 1+2 are absent and team `agentType` is low-complexity)
For Claude worker model selection, apply this precedence (highest to lowest):
1. Explicit `--model` already present in worker launch args
2. Direct provider model env (`ANTHROPIC_MODEL` / `CLAUDE_MODEL`)
3. Provider tier envs (`CLAUDE_CODE_BEDROCK_SONNET_MODEL`, `ANTHROPIC_DEFAULT_SONNET_MODEL`)
4. OMC tier env (`OMC_MODEL_MEDIUM`)
5. Otherwise let Claude Code use its default model
Model flag normalization contract:
- Accept both `--model <value>` and `--model=<value>`
@@ -315,7 +317,7 @@ Anti-slop workflow:
Visual iteration gate:
- For visual tasks (reference image(s) + generated screenshot), run `$visual-verdict` every iteration before the next edit.
- Persist visual verdict JSON in `.omx/state/{scope}/ralph-progress.json` with both numeric (`score`, threshold pass/fail) and qualitative (`reasoning`, `differences`, `suggestions`, `next_actions`) feedback.
- Persist visual verdict JSON in `.omc/state/{scope}/ralph-progress.json` with both numeric (`score`, threshold pass/fail) and qualitative (`reasoning`, `differences`, `suggestions`, `next_actions`) feedback.
Continuation:
Before concluding, confirm: zero pending tasks, all features working, tests passing, zero errors, verification evidence collected. If any item is unchecked, continue working.
@@ -340,14 +342,14 @@ When not to cancel:
---
<state_management>
oh-my-codex uses the `.omx/` directory for persistent state:
- `.omx/state/` -- Mode state files (JSON)
- `.omx/notepad.md` -- Session-persistent notes
- `.omx/project-memory.json` -- Cross-session project knowledge
- `.omx/plans/` -- Planning documents
- `.omx/logs/` -- Audit logs
oh-my-claudecode uses the `.omc/` directory for persistent state:
- `.omc/state/` -- Mode state files (JSON)
- `.omc/notepad.md` -- Session-persistent notes
- `.omc/project-memory.json` -- Cross-session project knowledge
- `.omc/plans/` -- Planning documents
- `.omc/logs/` -- Audit logs
Tools are available via MCP when configured (`omx setup` registers all servers):
Tools are available via MCP when configured (`omc setup` registers all servers):
State & Memory:
- `state_read`, `state_write`, `state_clear`, `state_list_active`, `state_get_status`
@@ -388,7 +390,7 @@ Recommended mode fields:
## Setup
Run `omx setup` to install all components. Run `omx doctor` to verify installation.
Run `omc setup` to install all components. Run `omc doctor` to verify installation.
---

View File

@@ -63,7 +63,7 @@ Run vanilla Claude Code benchmark:
**Options:**
- `--limit N` - Limit to N instances (default: all)
- `--skip N` - Skip first N instances (default: 0)
- `--model MODEL` - Claude model to use (default: claude-sonnet-4.5-20250929)
- `--model MODEL` - Claude model to use (default: claude-sonnet-4-6-20260217)
- `--timeout SECS` - Timeout per instance (default: 300)
**Examples:**

View File

@@ -36,7 +36,7 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Quick test configuration
TEST_LIMIT=5
MODEL="claude-sonnet-4.5-20250929"
MODEL="claude-sonnet-4-6-20260217"
TIMEOUT="180" # 3 minutes per instance for quick test
# Parse arguments
@@ -61,7 +61,7 @@ while [[ $# -gt 0 ]]; do
echo ""
echo "Options:"
echo " --limit N Number of instances to test (default: 5)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4.5-20250929)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4-6-20260217)"
echo " --timeout SECS Timeout per instance (default: 180)"
echo " -h, --help Show this help message"
exit 0

View File

@@ -19,7 +19,7 @@
### Evaluation Setup
- **Model:** Claude Sonnet 4 (claude-sonnet-4-20250514)
- **Model:** Claude Sonnet 4.6 (claude-sonnet-4-6-20260217)
- **Max Tokens:** 16,384 output tokens per instance
- **Timeout:** 30 minutes per instance
- **Workers:** 4 parallel evaluations

View File

@@ -58,7 +58,7 @@ class BenchmarkConfig:
limit: Optional[int] = None
retries: int = 3
retry_delay: int = 30
model: str = "claude-sonnet-4-20250514"
model: str = "claude-sonnet-4-6-20260217"
skip: int = 0
@@ -581,8 +581,8 @@ Examples:
)
parser.add_argument(
"--model",
default="claude-sonnet-4-20250514",
help="Claude model to use (default: claude-sonnet-4-20250514)",
default="claude-sonnet-4-6-20260217",
help="Claude model to use (default: claude-sonnet-4-6-20260217)",
)
parser.add_argument(
"--skip",

View File

@@ -37,7 +37,7 @@ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# Parse arguments
LIMIT=""
SKIP=""
MODEL="claude-sonnet-4.5-20250929"
MODEL="claude-sonnet-4-6-20260217"
TIMEOUT="300"
SKIP_VANILLA=false
SKIP_OMC=false
@@ -81,7 +81,7 @@ while [[ $# -gt 0 ]]; do
echo "Options:"
echo " --limit N Limit to N instances (default: all)"
echo " --skip N Skip first N instances (default: 0)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4.5-20250929)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4-6-20260217)"
echo " --timeout SECS Timeout per instance (default: 300)"
echo " --skip-vanilla Skip vanilla benchmark run"
echo " --skip-omc Skip OMC benchmark run"

View File

@@ -38,7 +38,7 @@ LOG_FILE="$LOGS_DIR/${RUN_MODE}_${TIMESTAMP}.log"
# Parse arguments
LIMIT=""
SKIP=""
MODEL="claude-sonnet-4.5-20250929"
MODEL="claude-sonnet-4-6-20260217"
TIMEOUT="300"
while [[ $# -gt 0 ]]; do
@@ -65,7 +65,7 @@ while [[ $# -gt 0 ]]; do
echo "Options:"
echo " --limit N Limit to N instances (default: all)"
echo " --skip N Skip first N instances (default: 0)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4.5-20250929)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4-6-20260217)"
echo " --timeout SECS Timeout per instance (default: 300)"
echo " -h, --help Show this help message"
exit 0

View File

@@ -38,7 +38,7 @@ LOG_FILE="$LOGS_DIR/${RUN_MODE}_${TIMESTAMP}.log"
# Parse arguments
LIMIT=""
SKIP=""
MODEL="claude-sonnet-4.5-20250929"
MODEL="claude-sonnet-4-6-20260217"
TIMEOUT="300"
while [[ $# -gt 0 ]]; do
@@ -65,7 +65,7 @@ while [[ $# -gt 0 ]]; do
echo "Options:"
echo " --limit N Limit to N instances (default: all)"
echo " --skip N Skip first N instances (default: 0)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4.5-20250929)"
echo " --model MODEL Claude model to use (default: claude-sonnet-4-6-20260217)"
echo " --timeout SECS Timeout per instance (default: 300)"
echo " -h, --help Show this help message"
exit 0

View File

@@ -577,7 +577,7 @@ describe('Team staged workflow integration', () => {
});
it('compacts OMC-style root AGENTS guidance on session-start without dropping key sections', async () => {
const agentsContent = `# oh-my-codex - Intelligent Multi-Agent Orchestration
const agentsContent = `# oh-my-claudecode - Intelligent Multi-Agent Orchestration
<guidance_schema_contract>
schema

View File

@@ -62,4 +62,36 @@ describe('Tier-0 contract docs consistency', () => {
expect(claudeDoc).toContain('Team orchestration is explicit via `/team`.');
expect(referenceDoc).not.toContain('| `team`, `coordinated team`');
});
it('keeps root AGENTS.md aligned with OMC branding and state paths', () => {
const agentsDoc = readProjectFile('AGENTS.md');
expect(agentsDoc).toContain('# oh-my-claudecode - Intelligent Multi-Agent Orchestration');
expect(agentsDoc).toContain('You are running with oh-my-claudecode (OMC), a multi-agent orchestration layer for Claude Code.');
expect(agentsDoc).toContain('`.omc/state/`');
expect(agentsDoc).toContain('Run `omc setup` to install all components. Run `omc doctor` to verify installation.');
expect(agentsDoc).not.toContain('oh-my-codex');
expect(agentsDoc).not.toContain('OMX_TEAM_WORKER_LAUNCH_ARGS');
expect(agentsDoc).not.toContain('gpt-5.3-codex-spark');
});
it('keeps benchmark default model references aligned across docs and scripts', () => {
const benchmarkReadme = readProjectFile('benchmark', 'README.md');
const benchmarkRunner = readProjectFile('benchmark', 'run_benchmark.py');
const quickTest = readProjectFile('benchmark', 'quick_test.sh');
const vanilla = readProjectFile('benchmark', 'run_vanilla.sh');
const omc = readProjectFile('benchmark', 'run_omc.sh');
const fullComparison = readProjectFile('benchmark', 'run_full_comparison.sh');
const resultsReadme = readProjectFile('benchmark', 'results', 'README.md');
const expectedModel = 'claude-sonnet-4-6-20260217';
for (const content of [benchmarkReadme, benchmarkRunner, quickTest, vanilla, omc, fullComparison, resultsReadme]) {
expect(content).toContain(expectedModel);
}
expect(benchmarkReadme).not.toContain('claude-sonnet-4.5-20250929');
expect(benchmarkRunner).not.toContain('claude-sonnet-4-20250514');
expect(resultsReadme).toContain('Claude Sonnet 4.6');
});
});

View File

@@ -141,7 +141,7 @@ describe("startup context compaction", () => {
try {
const omcAgentsPath = join(tempDir, "AGENTS.md");
const omcGuidance = `# oh-my-codex - Intelligent Multi-Agent Orchestration
const omcGuidance = `# oh-my-claudecode - Intelligent Multi-Agent Orchestration
<guidance_schema_contract>
schema