Commit Graph

198 Commits

Author SHA1 Message Date
Gustavo Madeira Santana
fb92ca1a4d QA: genericize mock streaming fixtures 2026-04-14 23:44:41 -04:00
Gustavo Madeira Santana
85eac42d34 QA: remove runner install fallback catalog
Drop the generated qa-runner catalog and the missing/install placeholder
path for repo-private QA runners. The host should discover bundled QA
commands from manifest plus runtime surface only.

Also trim stale qa-matrix install docs and package metadata so the
source-only QA policy stays consistent.
2026-04-14 17:37:18 -04:00
Gustavo Madeira Santana
653100488d QA: fix matrix runner staging and host registration 2026-04-14 17:18:25 -04:00
Gustavo Madeira Santana
82a2db71e8 refactor(qa): split Matrix QA into optional plugin (#66723)
Merged via squash.

Prepared head SHA: 27241bd089
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com>
Reviewed-by: @gumadeiras
2026-04-14 16:28:57 -04:00
Peter Steinberger
d2240a9476 test: harden qa-lab concurrent web scenarios 2026-04-14 13:42:02 +01:00
Vincent Koc
33a698fe10 fix(qa-lab): correct scenario catalog type 2026-04-14 09:39:20 +01:00
Vincent Koc
900681751d test(qa-lab): seed broken-turn recovery scenarios (#66416) 2026-04-14 09:03:49 +01:00
Vincent Koc
e63cbe831b test(qa-lab): cover GPT-style broken turns 2026-04-14 01:39:49 +01:00
pashpashpash
83f6a26d77 qa: keep OpenAI live defaults when auth exists 2026-04-12 23:34:54 -07:00
pashpashpash
ae4b997a00 qa: prefer codex auth for live defaults 2026-04-12 23:34:54 -07:00
pashpashpash
eede525970 qa: relax repo-contract artifact matcher 2026-04-12 22:43:22 -07:00
pashpashpash
b13844732e qa: salvage GPT-5.4 parity proof slice (#65664)
* test(qa): gate parity prose scenarios on real tool calls

Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no
fake progress / fake tool completion') for the two first/second-wave
parity scenarios that can currently pass with a prose-only reply.

Background: the scenario framework already exposes tool-call assertions
via /debug/requests on the mock server (see approval-turn-tool-followthrough
for the pattern). Most parity scenarios use this seam to require a specific
plannedToolName, but source-docs-discovery-report and subagent-handoff
only checked the assistant's prose text, which means a model could fabricate:

- a Worked / Failed / Blocked / Follow-up report without ever calling
  the read tool on the docs / source files the prompt named
- three labeled 'Delegated task', 'Result', 'Evidence' sections without
  ever calling sessions_spawn to delegate

Both gaps are fake-progress loopholes for the parity gate.

Changes:

- source-docs-discovery-report: require at least one read tool call tied
  to the 'worked, failed, blocked' prompt in /debug/requests. Failure
  message dumps the observed plannedToolName list for debugging.
- subagent-handoff: require at least one sessions_spawn tool call tied
  to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same
  debug-friendly failure message.

Both assertions are gated behind !env.mock so they no-op in live-frontier
mode where the real provider exposes plannedToolName through a different
channel (or not at all).

Not touched: memory-recall is also in the parity pack but its pass path
is legitimately 'read the fact from prior-turn context'. That is a valid
recall strategy, not fake progress, so it is out of scope for this PR.
memory-recall's fake-progress story (no real memory_search call) would
require bigger mock-server changes and belongs in a follow-up that
extends the mock memory pipeline.

Validation:

- pnpm test extensions/qa-lab/src/scenario-catalog.test.ts

Refs #64227

* test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch

Addresses loop-6 review feedback on PR #64681:

1. Copilot / Greptile / codex-connector all flagged that the discovery
   scenario's .includes('worked, failed, blocked') assertion is
   case-sensitive but the real prompt says 'Worked, Failed, Blocked...',
   so the mock-mode assertion never matches. Fix: lowercase-normalize
   allInputText before the contains check.
2. Greptile P2: the expr and message.expr each called fetchJson
   separately, incurring two round-trips to /debug/requests. Fix: hoist
   the fetch to a set step (discoveryDebugRequests / subagentDebugRequests)
   and reuse the snapshot.
3. Copilot: the subagent-handoff assertion scanned the entire request
   log and matched the first request with 'delegate' in its input text,
   which could false-pass on a stale prior scenario. Fix: reverse the
   array and take the most recent matching request instead.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests

Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I
used on pass 1 usually lands on the FOLLOW-UP request after the mock
runs sessions_spawn, not the pre-tool planning request that actually
has plannedToolName === 'sessions_spawn'. The mock only plans that tool
on requests with !toolOutput (mock-openai-server.ts:662), so the
post-tool request has plannedToolName unset and the assertion fails
even when the handoff succeeded.

Fix: switch the assertion back to a forward .some() match but add a
!request.toolOutput filter so the match is pinned to the pre-tool
planning phase. The case-insensitive regex, the fetchJson dedupe, and
the failure-message diagnostic from pass 1 are unchanged.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): pin subagent-handoff tool-call assertion to scenario prompt

Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix
filtered to pre-tool requests but still used a broad
`/delegate|subagent handoff/i` regex. The `subagent-fanout-synthesis`
scenario runs BEFORE `subagent-handoff` in catalog order (scenarios
are sorted by path), and the fanout prompt reads
'Subagent fanout synthesis check: delegate exactly two bounded
subagents sequentially' — which contains 'delegate' and also plans
sessions_spawn pre-tool. That produces a cross-scenario false pass
where the fanout's earlier sessions_spawn request satisfies the
handoff assertion even when the handoff run never delegates.

Fix: tighten the input-text match from `/delegate|subagent handoff/i`
to `/delegate one bounded qa task/i`, which is the exact scenario-
unique substring from the `subagent-handoff` config.prompt. That
pins the assertion to this scenario's request window and closes the
cross-scenario false positive.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): align parity assertion comments with actual filter logic

Addresses two loop-7 Copilot findings on PR #64681:

1. source-docs-discovery-report.md: the explanatory comment said the
   debug request log was 'lowercased for case-insensitive matching',
   but the code actually lowercases each request's allInputText inline
   inside the .some() predicate, not the discoveryDebugRequests
   snapshot. Rewrite the comment to describe the inline-lowercase
   pattern so a future reader matches the code they see.

2. subagent-handoff.md: the comment said the assertion 'must be
   pinned to THIS scenario's request window' but the implementation
   actually relies on matching a scenario-unique prompt substring
   (/delegate one bounded qa task/i), not a request-window. Rewrite
   the comment to describe the substring pinning and keep the
   pre-tool filter rationale intact.

No runtime change; comment-only fix to keep reviewer expectations
aligned with the actual assertion shape.

Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts
(4/4 pass).

Refs #64227

* test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios

* Guard mock-only image parity assertions

* Expand agentic parity second wave

* test(qa): pad parity suspicious-pass isolation to second wave

* qa-lab: parametrize parity report title and drop stale first-wave comment

Addresses two loop-7 Copilot findings on PR #64662:

1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a
   template string that interpolates candidateLabel and baselineLabel, so
   any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate
   title in saved reports. Default CLI flags still produce
   openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair.

2. Stale 'declared first-wave parity scenarios' comment in
   scopeSummaryToParityPack: the parity pack is now the ten-scenario
   first-wave+second-wave set (PR D + PR E). Comment updated to drop
   the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS
   constant the scope is filtering against.

New regression: 'parametrizes the markdown header from the comparison
labels' — asserts that non-default labels (openai/gpt-5.4-alt vs
openai/gpt-5.4) render in the H1.

Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts
(13/13 pass).

Refs #64227

* qa-lab: fail parity gate on required scenario failures regardless of baseline parity

* test(qa): update readable-report test to cover all 10 parity scenarios

* qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels

* Tighten parity label and scenario checks

* fix: tighten parity label provenance checks

* fix: scope parity tool-call metrics to tool lanes

* Fix parity report label and fake-success checks

* fix(qa): tighten parity report edge cases

* qa-lab: add Anthropic /v1/messages mock route for parity baseline

Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity
completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats
Opus 4.6 on the agreed metrics').

Background: the parity gate needs two comparable scenario runs - one
against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the
aggregate metrics and verdict in PR D (#64441) can be computed. Today the
qa-lab mock server only implements /v1/responses, so the baseline run
against Claude Opus 4.6 requires a real Anthropic API key. That makes the
gate impossible to prove end-to-end from a local worktree and means the
CI story is always 'two real providers + quota + keys'.

This PR adds a /v1/messages Anthropic-compatible route to the existing
mock OpenAI server. The route is a thin adapter that:

- Parses Anthropic Messages API request shapes (system as string or
  [{type:text,text}], messages with string or block content, text and
  tool_result and tool_use and image blocks)
- Translates them into the ResponsesInputItem[] shape the existing shared
  scenario dispatcher (buildResponsesPayload) already understands
- Calls the shared dispatcher so both the OpenAI and Anthropic lanes run
  through the exact same scenario prompt-matching logic (same subagent
  fanout state machine, same extractRememberedFact helper, same
  '/debug/requests' telemetry)
- Converts the resulting OpenAI-format events back into an Anthropic
  message response with text and tool_use content blocks and a correct
  stop_reason (tool_use vs end_turn)

Non-streaming only: the QA suite runner falls back to non-streaming mock
mode so real Anthropic SSE isn't necessary for the parity baseline.

Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline
model-list probes from the suite runner resolve without extra config.

Tests added:

- advertises Anthropic claude-opus-4-6 baseline model on /v1/models
- dispatches an Anthropic /v1/messages read tool call for source discovery
  prompts (tool_use stop_reason, correct input path, /debug/requests
  records plannedToolName=read)
- dispatches Anthropic /v1/messages tool_result follow-ups through the
  shared scenario logic (subagent-handoff two-stage flow: tool_use -
  tool_result - 'Delegated task / Evidence' prose summary)

Local validation:

- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass)
- pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass)

Refs #64227
Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper
by giving the baseline lane a local-only mock path.

* qa-lab: fix Anthropic tool_result ordering in messages adapter

Addresses the loop-6 Copilot / Greptile finding on PR #64685: in
`convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were
pushed to `items` inside the per-block loop while the surrounding
user/assistant message was only pushed after the loop finished. That
reordered the function_call_output BEFORE its parent user message
whenever a user turn mixed `tool_result` with fresh text/image blocks,
which broke `extractToolOutput` (it scans AFTER the last user-role
index; function_call_output placed BEFORE that index is invisible to it)
and made the downstream scenario dispatcher behave as if no tool output
had been returned on mixed-content turns.

Fix: buffer `tool_result` and `tool_use` blocks in local arrays during
the per-block loop, push the parent role message first (when it has any
text/image pieces), then push the accumulated function_call /
function_call_output items in original order. tool_result-only user
turns still omit the parent message as before, so the non-mixed
subagent-fanout-synthesis two-stage flow that already worked keeps
working.

Regression added:

- `places tool_result after the parent user message even in mixed-content
  turns` — sends a user turn that mixes a `tool_result` block with a
  trailing fresh text block, then inspects `/debug/last-request` to
  assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found
  the function_call_output AFTER the last user index) and
  `prompt === 'Keep going with the fanout.'` (extractLastUserText picked
  up the trailing fresh text).

Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts
(19/19 pass).

Refs #64227

* qa-lab: reject Anthropic streaming and empty model in messages mock

* qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider

* Handle invalid Anthropic mock JSON

* fix: wire mock parity providers by model ref

* fix(qa): support Anthropic message streaming in mock parity lane

* qa-lab: record provider/model/mode in qa-suite-summary.json

Closes the 'summary cannot be label-verified' half of criterion 5 on the
GPT-5.4 parity completion gate in #64227.

Background: the parity gate in #64441 compares two qa-suite-summary.json
files and trusts whatever candidateLabel / baselineLabel the caller
passes. Today the summary JSON only contains { scenarios, counts }, so
nothing in the summary records which provider/model the run actually
used. If a maintainer swaps candidate and baseline summary paths in a
parity-report call, the verdict is silently mislabeled and nobody can
retroactively verify which run produced which summary.

Changes:

- Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt,
  providerMode, primaryModel (+ provider and model splits),
  alternateModel (+ provider and model splits), fastMode, concurrency,
  scenarioIds (when explicitly filtered).
- Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary
  JSON shape is unit-testable and the parity gate (and any future parity
  wrapper) can import the exact same type rather than reverse-engineering
  the JSON shape at runtime.
- Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so
  --scenario-ids flags are recorded in the summary.

Unit tests added (src/suite.summary-json.test.ts, 5 cases):

- records provider/model/mode so parity gates can verify labels
- includes scenarioIds in run metadata when provided
- records an Anthropic baseline lane cleanly for parity runs
- leaves split fields null when a model ref is malformed
- keeps scenarios and counts alongside the run metadata

This is additive: existing callers of qa-suite-summary.json continue to
see the same { scenarios, counts } shape, just with an extra run field.
No existing consumers of the JSON need to change.

The follow-up 'qa parity run' CLI wrapper (run the parity pack twice
against candidate + baseline, emit two labeled summaries in one command)
stacks cleanly on top of this change and will land as a separate PR
once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand
directly.

Local validation:

- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass)
- pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass)

Refs #64227
Unblocks the final parity run for #64441 / #64662 by making summaries
self-describing.

* qa-lab: strengthen qa-suite-summary builder types and empty-array semantics

Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689
(re-opened as #64789):

1. P2 codex + Copilot: empty `scenarioIds` array was serialized as
   `[]` because of a truthiness check. The CLI passes an empty array
   when --scenario is omitted, so full-suite runs would incorrectly
   record an explicit empty selection. Fix: switch to a
   `length > 0` check so '[] or undefined' both encode as `null`
   in the summary run metadata.

2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate
   consumers but its return type was `Record<string, unknown>`, which
   defeated the point of exporting it. Fix: introduce a concrete
   `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and
   make the builder return it. Downstream code (parity gate, parity
   run wrapper) can now import the type and keep consumers
   type-checked.

3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the
   `'mock-openai' | 'live-frontier'` string union even though
   `QaProviderMode` is already imported from model-selection.ts. Fix:
   reuse `QaProviderMode` so provider-mode additions flow through
   both types at once.

4. Copilot: test fixtures omitted `steps` from the fake scenario
   results, creating shape drift with the real suite scenario-result
   shape. Fix: pad the test fixtures with `steps: []` and tighten the
   scenarioIds assertion to read `json.run.scenarioIds` directly (the
   new concrete return type makes the type-cast unnecessary).

New regression: `treats an empty scenarioIds array as unspecified
(no filter)` — passes `scenarioIds: []` and asserts the summary
records `scenarioIds: null`.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: record executed scenarioIds in summary run metadata

Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689):
`run.scenarioIds` was copied from the raw `params.scenarioIds`
caller input, but `runQaSuite` normalizes that input through
`selectQaSuiteScenarios` which dedupes via `Set` and reorders the
selection to catalog order. When callers repeat --scenario ids or
pass them in non-catalog order, the summary metadata drifted from
the scenarios actually executed, which can make parity/report
tooling treat equivalent runs as different or trust inaccurate
provenance.

Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass
`selectedCatalogScenarios.map(scenario => scenario.id)` instead of
`params?.scenarioIds`, so the summary records the post-selection
executed list. This also covers the full-suite case automatically
(the executed list is the full lane-filtered catalog), giving parity
consumers a stable record of exactly which scenarios landed in the
run regardless of how the caller phrased the request.

buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2
semantics are preserved so the public helper still treats an empty
array as 'unspecified' for any future caller that legitimately passes
one.

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: preserve null scenarioIds for unfiltered suite runs

Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix
always passed `selectedCatalogScenarios.map(...)` to
writeQaSuiteArtifacts, which made unfiltered full-suite runs
indistinguishable from an explicit all-scenarios selection in the
summary metadata. The 'unfiltered → null' semantic (documented in
the buildQaSuiteSummaryJson JSDoc and exercised by the
"treats an empty scenarioIds array as unspecified" regression) was
lost.

Fix: both writeQaSuiteArtifacts call sites now condition on the
caller's original `params.scenarioIds`. When the caller passed an
explicit non-empty filter, record the post-selection executed list
(pass-3 behavior, preserving Set-dedupe + catalog-order
normalization). When the caller passed undefined or an empty array,
pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's
length-check serializes null (pass-2 behavior, preserving unfiltered
semantics).

This keeps both codex-connector findings satisfied simultaneously:
- explicit --scenario filter reorders/dedupes through the executed
  list, not the raw caller input
- unfiltered full-suite run records null, not a full catalog dump
  that would shadow "explicit all-scenarios" selections

Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts
(6/6 pass).

Refs #64227

* qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type

* qa-lab: stage mock auth profiles so the parity gate runs without real credentials

* fix(qa): clean up mock auth staging follow-ups

* ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock

* ci: use supported parity gate runner label

* ci: watch gateway changes in parity gate

* docs: pin parity runbook alternate models

* fix(ci): watch qa-channel parity inputs

* qa: roll up parity proof closeout

* qa: harden mock parity review fixes

* qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection

* qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch

* qa-lab: clean up positive-tone comment + fix stale test expectations

* qa: pin workflow Node version to 22.14.0 + fix stale label-match wording

* qa-lab: refresh mock provider routing expectation

* docs: drop stale parity rollup rewrite from proof slice

* qa: run parity gate against mock lane

* deps: sync qa-lab lockfile

* build: refresh a2ui bundle hash

* ci: widen parity gate triggers

---------

Co-authored-by: Eva <eva@100yen.org>
2026-04-13 13:01:54 +09:00
Josh Avant
3d07dfbb65 feat(qa-lab): add Convex credential broker and admin CLI (#65596)
* QA Lab: add Convex credential source for Telegram lane

* QA Lab: scaffold Convex credential broker

* QA Lab: add Convex credential admin CLI

* QA Lab: harden Convex credential security paths

* QA Broker: validate Telegram payloads on admin add

* fix: note QA Convex credential broker in changelog (#65596) (thanks @joshavant)
2026-04-12 22:03:42 -05:00
Peter Steinberger
20266c14cb feat(qa-lab): add control ui qa-channel roundtrip scenario 2026-04-12 19:41:06 -07:00
Peter Steinberger
1a47660518 feat(browser): add qa web runtime support 2026-04-12 19:41:06 -07:00
Peter Steinberger
fcee268373 feat(qa-lab): support scenario-defined plugin runs 2026-04-12 11:59:50 -07:00
Peter Steinberger
e4841d767d test: stabilize loaded full-suite checks 2026-04-12 11:52:56 -07:00
Marcus Castro
000fc7f233 refactor(qa): add shared QA channel contract and harden worker startup (#64562)
* refactor(qa): add shared transport contract and suite migration

* refactor(qa): harden worker gateway startup

* fix(qa): scope waits and sanitize shutdown artifacts

* fix(qa): confine artifacts and redact preserved logs

* fix(qa): block symlink escapes in artifact paths

* fix(gateway): clear shutdown race timers

* fix(qa): harden shutdown cleanup paths

* fix(qa): sanitize gateway logs in thrown errors

* fix(qa): harden suite startup and artifact paths

* fix(qa): stage bundled plugins from mutated config

* fix(qa): broaden gateway log bearer redaction

* fix(qa-channel): restore runtime export

* fix(qa): stop failed gateway startups as a process tree

* fix(qa-channel): load runtime hook from api surface
2026-04-12 15:02:57 -03:00
Peter Steinberger
a8e140e395 chore: bump version to 2026.4.12 2026-04-12 10:37:18 -07:00
Edder Talmor
5f92094d51 fix: gracefully handle missing QA scenario pack in npm distributions (closes #65082) (#65118)
* fix: allow built-in chat commands to bypass plugins.allow check (closes #65083)

The 'commands' CLI command is a built-in chat command registered in the
chat commands registry, not a plugin-backed command. When plugins.allow
is configured, the error message incorrectly suggests adding 'commands'
to plugins.allow, which produces a second error because no 'commands'
plugin exists.

Check if the command has a plugin entry or manifest alias before
suggesting plugins.allow. Built-in commands without plugin entries
now proceed normally instead of showing misleading errors.

* fix: gracefully handle missing QA scenario pack in npm distributions (closes #65082)

The completion cache update fails with a fatal error when the
qa/scenarios/index.md file is not present in the installed npm package,
even though the directory is listed in package.json "files".

Instead of throwing an error, return an empty QA scenario pack with
default agent identity. This allows completion cache updates to succeed
while QA scenarios remain unavailable in the npm distribution.

The QA scenario pack is primarily used for internal testing and QA
automation — it is not critical for end-user functionality.

* revert: remove unintended run-main.ts changes from PR #65118

The scenario-catalog.ts fix is the correct change for this PR.
The run-main.ts changes were accidentally included and cause a
regression in plugins.allow error handling.

* fix(qa): tolerate missing packaged scenario config

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-04-12 16:50:58 +01:00
Vincent Koc
cded4fc5db test(qa-lab): share mock openai response helpers 2026-04-12 05:34:58 +01:00
Peter Steinberger
0e3f9657da fix(plugins): preserve bundled host compatibility floor 2026-04-12 00:22:32 +01:00
Peter Steinberger
057fe786bd style: apply formatter drift 2026-04-11 21:25:24 +01:00
HDYA
26f633b604 feat(msteams): add federated credential support (certificate + managed identity) (#53615)
* feat(msteams): add federated authentication support (certificate + managed identity + workload identity)

* msteams: fix vitest 4.1.2 compat, type errors, and regenerate config baseline

* msteams: fix lint errors, update fetch allowlist, regenerate protocol Swift

* fix(msteams): gate secret-only delegated auth flows

* fix(ci): unblock gateway watch and install smoke

* fix(ci): restore mergeability for pr 53615

* fix(ci): restore channel registry helper typing

* fix(ci): refresh raw fetch guard allowlist

---------

Co-authored-by: Chudi Huang <Chudi.Huang@microsoft.com>
Co-authored-by: Brad Groux <3053586+BradGroux@users.noreply.github.com>
2026-04-11 13:29:22 -05:00
Tak Hoffman
958c34e82c feat(qa-lab): Add proxy capture stack and QA Lab inspector (#64895)
* Add proxy capture core and CLI

* Expand transport capture coverage

* Add QA Lab capture backend

* Refine QA Lab capture UI

* Fix proxy capture review feedback

* Fix proxy run cleanup and TTS capture

* Fix proxy capture transport follow-ups

* Fix debug proxy CONNECT target parsing

* Harden QA Lab asset path containment
2026-04-11 12:34:57 -05:00
Eva
108e5c89de qa-lab: scope parity metrics and harden fake-success detector
- scope computeQaAgenticParityMetrics to QA_AGENTIC_PARITY_SCENARIO_TITLES
  in buildQaAgenticParityComparison so extra non-parity lanes in a full
  qa-suite-summary.json cannot influence completion / unintended-stop /
  valid-tool / fake-success rates
- filter coverageMismatch by !parityTitleSet.has(name) so each required
  parity scenario fails the gate exactly once (from requiredScenarioCoverage)
  instead of being double-reported as a coverage mismatch too
- drop the bare /\\berror\\b/i rule from SUSPICIOUS_PASS_PATTERNS — it was
  false-flagging legitimate passes that narrate "Error budget: 0" or
  "no errors found" — and replace it with targeted /error occurred/i and
  /an error was/i phrases that indicate a real mid-turn error
- add regressions: error-budget/no-errors-observed passes yield
  fakeSuccessCount === 0, genuine error-occurred narration still flags,
  each missing required scenario fires exactly one failure line, and
  non-parity lanes do not perturb scoped metrics
- isolate the baseline suspicious-pass test by padding it to the full
  first-wave scenario set so it asserts the isolated fake-success path
  via toEqual([...]) rather than toContain
2026-04-11 14:22:48 +01:00
Eva
95f8ad215f Treat skipped parity scenarios as uncovered 2026-04-11 14:22:48 +01:00
Eva
17252df122 Tighten parity proof heuristics 2026-04-11 14:22:48 +01:00
Eva
fd45ea2bf1 test(qa): add compaction retry parity scenario 2026-04-11 14:22:48 +01:00
Eva
3211aa2540 fix(qa): surface missing required scenarios in parity report 2026-04-11 14:22:48 +01:00
Eva
55df6f11a4 fix: harden parity gate review findings 2026-04-11 14:22:48 +01:00
Eva
db09edacfc qa-lab: gate parity on shared scenario coverage 2026-04-11 14:22:48 +01:00
Eva
67fdd3b4df benchmarks: add agentic parity report gate 2026-04-11 14:22:48 +01:00
Eva
79f539d9ce docs: clarify GPT-5.4 parity harness and review flow 2026-04-11 14:22:48 +01:00
Eva
d9c7ddb099 test: add agentic parity scenario pack 2026-04-11 14:22:48 +01:00
Vincent Koc
1167093773 test(qa): drop rebase conflict marker 2026-04-11 13:24:45 +01:00
Vincent Koc
d21573d3a1 fix(qa): catch leaked harness meta replies 2026-04-11 13:23:26 +01:00
Peter Steinberger
d72fb7efb9 fix: harden QA scenario matcher validation 2026-04-11 13:19:13 +01:00
Peter Steinberger
cd89892b1f fix(release): keep private QA bundles out of npm pack 2026-04-11 13:13:11 +01:00
Ayaan Zaidi
478a2e15c5 fix: narrow qa cli facade startup path 2026-04-11 10:41:19 +05:30
Peter Steinberger
1ab6e5dbf0 chore(release): bump version to 2026.4.11 2026-04-11 04:51:17 +01:00
Ayaan Zaidi
959b1472dc test(qa-lab): include telegram mentioned-message scenario 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
b0b0fb308d feat(qa-lab): add telegram mentioned-message scenario 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
a0b5c7b0c4 test(qa-lab): cover telegram command demo scenarios 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
7c14d8b0f4 feat(qa-lab): add telegram command demo scenarios 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
f9a03f0f4b test(qa-lab): cover telegram mention-gating 2026-04-11 08:48:42 +05:30
Ayaan Zaidi
355690a72c feat(qa-lab): add telegram mention-gating scenario 2026-04-11 08:48:42 +05:30
Vincent Koc
350299401f fix(cycles): continue shared seam extraction 2026-04-11 02:46:41 +01:00
Peter Steinberger
39d1a817fa lint: enable small oxlint rules 2026-04-11 02:15:21 +01:00
Peter Steinberger
55578a5c40 fix: stabilize Codex runtime truthfulness (#64439) (thanks @100yenadmin) 2026-04-11 01:19:32 +01:00