openclaw

mirror of https://fastgit.cc/github.com/openclaw/openclaw synced 2026-04-30 22:12:32 +08:00

Author	SHA1	Message	Date
Gustavo Madeira Santana	fb92ca1a4d	QA: genericize mock streaming fixtures	2026-04-14 23:44:41 -04:00
Gustavo Madeira Santana	85eac42d34	QA: remove runner install fallback catalog Drop the generated qa-runner catalog and the missing/install placeholder path for repo-private QA runners. The host should discover bundled QA commands from manifest plus runtime surface only. Also trim stale qa-matrix install docs and package metadata so the source-only QA policy stays consistent.	2026-04-14 17:37:18 -04:00
Gustavo Madeira Santana	653100488d	QA: fix matrix runner staging and host registration	2026-04-14 17:18:25 -04:00
Gustavo Madeira Santana	82a2db71e8	refactor(qa): split Matrix QA into optional plugin (#66723 ) Merged via squash. Prepared head SHA: `27241bd089` Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Co-authored-by: gumadeiras <5599352+gumadeiras@users.noreply.github.com> Reviewed-by: @gumadeiras	2026-04-14 16:28:57 -04:00
Peter Steinberger	d2240a9476	test: harden qa-lab concurrent web scenarios	2026-04-14 13:42:02 +01:00
Vincent Koc	33a698fe10	fix(qa-lab): correct scenario catalog type	2026-04-14 09:39:20 +01:00
Vincent Koc	900681751d	test(qa-lab): seed broken-turn recovery scenarios (#66416 )	2026-04-14 09:03:49 +01:00
Vincent Koc	e63cbe831b	test(qa-lab): cover GPT-style broken turns	2026-04-14 01:39:49 +01:00
pashpashpash	83f6a26d77	qa: keep OpenAI live defaults when auth exists	2026-04-12 23:34:54 -07:00
pashpashpash	ae4b997a00	qa: prefer codex auth for live defaults	2026-04-12 23:34:54 -07:00
pashpashpash	eede525970	qa: relax repo-contract artifact matcher	2026-04-12 22:43:22 -07:00
pashpashpash	b13844732e	qa: salvage GPT-5.4 parity proof slice (#65664 ) * test(qa): gate parity prose scenarios on real tool calls Closes criterion 2 of the GPT-5.4 parity completion gate in #64227 ('no fake progress / fake tool completion') for the two first/second-wave parity scenarios that can currently pass with a prose-only reply. Background: the scenario framework already exposes tool-call assertions via /debug/requests on the mock server (see approval-turn-tool-followthrough for the pattern). Most parity scenarios use this seam to require a specific plannedToolName, but source-docs-discovery-report and subagent-handoff only checked the assistant's prose text, which means a model could fabricate: - a Worked / Failed / Blocked / Follow-up report without ever calling the read tool on the docs / source files the prompt named - three labeled 'Delegated task', 'Result', 'Evidence' sections without ever calling sessions_spawn to delegate Both gaps are fake-progress loopholes for the parity gate. Changes: - source-docs-discovery-report: require at least one read tool call tied to the 'worked, failed, blocked' prompt in /debug/requests. Failure message dumps the observed plannedToolName list for debugging. - subagent-handoff: require at least one sessions_spawn tool call tied to the 'delegate' / 'subagent handoff' prompt in /debug/requests. Same debug-friendly failure message. Both assertions are gated behind !env.mock so they no-op in live-frontier mode where the real provider exposes plannedToolName through a different channel (or not at all). Not touched: memory-recall is also in the parity pack but its pass path is legitimately 'read the fact from prior-turn context'. That is a valid recall strategy, not fake progress, so it is out of scope for this PR. memory-recall's fake-progress story (no real memory_search call) would require bigger mock-server changes and belongs in a follow-up that extends the mock memory pipeline. Validation: - pnpm test extensions/qa-lab/src/scenario-catalog.test.ts Refs #64227 * test(qa): fix case-sensitive tool-call assertions and dedupe debug fetch Addresses loop-6 review feedback on PR #64681: 1. Copilot / Greptile / codex-connector all flagged that the discovery scenario's .includes('worked, failed, blocked') assertion is case-sensitive but the real prompt says 'Worked, Failed, Blocked...', so the mock-mode assertion never matches. Fix: lowercase-normalize allInputText before the contains check. 2. Greptile P2: the expr and message.expr each called fetchJson separately, incurring two round-trips to /debug/requests. Fix: hoist the fetch to a set step (discoveryDebugRequests / subagentDebugRequests) and reuse the snapshot. 3. Copilot: the subagent-handoff assertion scanned the entire request log and matched the first request with 'delegate' in its input text, which could false-pass on a stale prior scenario. Fix: reverse the array and take the most recent matching request instead. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): narrow subagent-handoff tool-call assertion to pre-tool requests Pass-2 codex-connector P1 finding on #64681: the reverse-find pattern I used on pass 1 usually lands on the FOLLOW-UP request after the mock runs sessions_spawn, not the pre-tool planning request that actually has plannedToolName === 'sessions_spawn'. The mock only plans that tool on requests with !toolOutput (mock-openai-server.ts:662), so the post-tool request has plannedToolName unset and the assertion fails even when the handoff succeeded. Fix: switch the assertion back to a forward .some() match but add a !request.toolOutput filter so the match is pinned to the pre-tool planning phase. The case-insensitive regex, the fetchJson dedupe, and the failure-message diagnostic from pass 1 are unchanged. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): pin subagent-handoff tool-call assertion to scenario prompt Addresses the pass-3 codex-connector P1 on #64681: the pass-2 fix filtered to pre-tool requests but still used a broad `/delegate\|subagent handoff/i` regex. The `subagent-fanout-synthesis` scenario runs BEFORE `subagent-handoff` in catalog order (scenarios are sorted by path), and the fanout prompt reads 'Subagent fanout synthesis check: delegate exactly two bounded subagents sequentially' — which contains 'delegate' and also plans sessions_spawn pre-tool. That produces a cross-scenario false pass where the fanout's earlier sessions_spawn request satisfies the handoff assertion even when the handoff run never delegates. Fix: tighten the input-text match from `/delegate\|subagent handoff/i` to `/delegate one bounded qa task/i`, which is the exact scenario- unique substring from the `subagent-handoff` config.prompt. That pins the assertion to this scenario's request window and closes the cross-scenario false positive. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): align parity assertion comments with actual filter logic Addresses two loop-7 Copilot findings on PR #64681: 1. source-docs-discovery-report.md: the explanatory comment said the debug request log was 'lowercased for case-insensitive matching', but the code actually lowercases each request's allInputText inline inside the .some() predicate, not the discoveryDebugRequests snapshot. Rewrite the comment to describe the inline-lowercase pattern so a future reader matches the code they see. 2. subagent-handoff.md: the comment said the assertion 'must be pinned to THIS scenario's request window' but the implementation actually relies on matching a scenario-unique prompt substring (/delegate one bounded qa task/i), not a request-window. Rewrite the comment to describe the substring pinning and keep the pre-tool filter rationale intact. No runtime change; comment-only fix to keep reviewer expectations aligned with the actual assertion shape. Validation: pnpm test extensions/qa-lab/src/scenario-catalog.test.ts (4/4 pass). Refs #64227 * test(qa): extend tool-call assertions to image-understanding, subagent-fanout, and capability-flip scenarios * Guard mock-only image parity assertions * Expand agentic parity second wave * test(qa): pad parity suspicious-pass isolation to second wave * qa-lab: parametrize parity report title and drop stale first-wave comment Addresses two loop-7 Copilot findings on PR #64662: 1. Hard-coded 'GPT-5.4 / Opus 4.6' markdown H1: the renderer now uses a template string that interpolates candidateLabel and baselineLabel, so any parity run (not only gpt-5.4 vs opus 4.6) renders an accurate title in saved reports. Default CLI flags still produce openai/gpt-5.4 vs anthropic/claude-opus-4-6 as the baseline pair. 2. Stale 'declared first-wave parity scenarios' comment in scopeSummaryToParityPack: the parity pack is now the ten-scenario first-wave+second-wave set (PR D + PR E). Comment updated to drop the first-wave qualifier and name the full QA_AGENTIC_PARITY_SCENARIOS constant the scope is filtering against. New regression: 'parametrizes the markdown header from the comparison labels' — asserts that non-default labels (openai/gpt-5.4-alt vs openai/gpt-5.4) render in the H1. Validation: pnpm test extensions/qa-lab/src/agentic-parity-report.test.ts (13/13 pass). Refs #64227 * qa-lab: fail parity gate on required scenario failures regardless of baseline parity * test(qa): update readable-report test to cover all 10 parity scenarios * qa-lab: strengthen parity-report fake-success detector and verify run.primaryProvider labels * Tighten parity label and scenario checks * fix: tighten parity label provenance checks * fix: scope parity tool-call metrics to tool lanes * Fix parity report label and fake-success checks * fix(qa): tighten parity report edge cases * qa-lab: add Anthropic /v1/messages mock route for parity baseline Closes the last local-runnability gap on criterion 5 of the GPT-5.4 parity completion gate in #64227 ('the parity gate shows GPT-5.4 matches or beats Opus 4.6 on the agreed metrics'). Background: the parity gate needs two comparable scenario runs - one against openai/gpt-5.4 and one against anthropic/claude-opus-4-6 - so the aggregate metrics and verdict in PR D (#64441) can be computed. Today the qa-lab mock server only implements /v1/responses, so the baseline run against Claude Opus 4.6 requires a real Anthropic API key. That makes the gate impossible to prove end-to-end from a local worktree and means the CI story is always 'two real providers + quota + keys'. This PR adds a /v1/messages Anthropic-compatible route to the existing mock OpenAI server. The route is a thin adapter that: - Parses Anthropic Messages API request shapes (system as string or [{type:text,text}], messages with string or block content, text and tool_result and tool_use and image blocks) - Translates them into the ResponsesInputItem[] shape the existing shared scenario dispatcher (buildResponsesPayload) already understands - Calls the shared dispatcher so both the OpenAI and Anthropic lanes run through the exact same scenario prompt-matching logic (same subagent fanout state machine, same extractRememberedFact helper, same '/debug/requests' telemetry) - Converts the resulting OpenAI-format events back into an Anthropic message response with text and tool_use content blocks and a correct stop_reason (tool_use vs end_turn) Non-streaming only: the QA suite runner falls back to non-streaming mock mode so real Anthropic SSE isn't necessary for the parity baseline. Also adds claude-opus-4-6 and claude-sonnet-4-6 to /v1/models so baseline model-list probes from the suite runner resolve without extra config. Tests added: - advertises Anthropic claude-opus-4-6 baseline model on /v1/models - dispatches an Anthropic /v1/messages read tool call for source discovery prompts (tool_use stop_reason, correct input path, /debug/requests records plannedToolName=read) - dispatches Anthropic /v1/messages tool_result follow-ups through the shared scenario logic (subagent-handoff two-stage flow: tool_use - tool_result - 'Delegated task / Evidence' prose summary) Local validation: - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (18/18 pass) - pnpm test extensions/qa-lab/src/mock-openai-server.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (47/47 pass) Refs #64227 Unblocks #64441 (parity harness) and the forthcoming qa parity run wrapper by giving the baseline lane a local-only mock path. * qa-lab: fix Anthropic tool_result ordering in messages adapter Addresses the loop-6 Copilot / Greptile finding on PR #64685: in `convertAnthropicMessagesToResponsesInput`, `tool_result` blocks were pushed to `items` inside the per-block loop while the surrounding user/assistant message was only pushed after the loop finished. That reordered the function_call_output BEFORE its parent user message whenever a user turn mixed `tool_result` with fresh text/image blocks, which broke `extractToolOutput` (it scans AFTER the last user-role index; function_call_output placed BEFORE that index is invisible to it) and made the downstream scenario dispatcher behave as if no tool output had been returned on mixed-content turns. Fix: buffer `tool_result` and `tool_use` blocks in local arrays during the per-block loop, push the parent role message first (when it has any text/image pieces), then push the accumulated function_call / function_call_output items in original order. tool_result-only user turns still omit the parent message as before, so the non-mixed subagent-fanout-synthesis two-stage flow that already worked keeps working. Regression added: - `places tool_result after the parent user message even in mixed-content turns` — sends a user turn that mixes a `tool_result` block with a trailing fresh text block, then inspects `/debug/last-request` to assert that `toolOutput === 'SUBAGENT-OK'` (extractToolOutput found the function_call_output AFTER the last user index) and `prompt === 'Keep going with the fanout.'` (extractLastUserText picked up the trailing fresh text). Local validation: pnpm test extensions/qa-lab/src/mock-openai-server.test.ts (19/19 pass). Refs #64227 * qa-lab: reject Anthropic streaming and empty model in messages mock * qa-lab: tag mock request snapshots with a provider variant so parity runs can diff per provider * Handle invalid Anthropic mock JSON * fix: wire mock parity providers by model ref * fix(qa): support Anthropic message streaming in mock parity lane * qa-lab: record provider/model/mode in qa-suite-summary.json Closes the 'summary cannot be label-verified' half of criterion 5 on the GPT-5.4 parity completion gate in #64227. Background: the parity gate in #64441 compares two qa-suite-summary.json files and trusts whatever candidateLabel / baselineLabel the caller passes. Today the summary JSON only contains { scenarios, counts }, so nothing in the summary records which provider/model the run actually used. If a maintainer swaps candidate and baseline summary paths in a parity-report call, the verdict is silently mislabeled and nobody can retroactively verify which run produced which summary. Changes: - Add a 'run' block to qa-suite-summary.json with startedAt, finishedAt, providerMode, primaryModel (+ provider and model splits), alternateModel (+ provider and model splits), fastMode, concurrency, scenarioIds (when explicitly filtered). - Extract a pure 'buildQaSuiteSummaryJson(params)' helper so the summary JSON shape is unit-testable and the parity gate (and any future parity wrapper) can import the exact same type rather than reverse-engineering the JSON shape at runtime. - Thread 'scenarioIds' from 'runQaSuite' into writeQaSuiteArtifacts so --scenario-ids flags are recorded in the summary. Unit tests added (src/suite.summary-json.test.ts, 5 cases): - records provider/model/mode so parity gates can verify labels - includes scenarioIds in run metadata when provided - records an Anthropic baseline lane cleanly for parity runs - leaves split fields null when a model ref is malformed - keeps scenarios and counts alongside the run metadata This is additive: existing callers of qa-suite-summary.json continue to see the same { scenarios, counts } shape, just with an extra run field. No existing consumers of the JSON need to change. The follow-up 'qa parity run' CLI wrapper (run the parity pack twice against candidate + baseline, emit two labeled summaries in one command) stacks cleanly on top of this change and will land as a separate PR once #64441 and #64662 merge so the wrapper can call runQaParityReportCommand directly. Local validation: - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (5/5 pass) - pnpm test extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/scenario-catalog.test.ts (34/34 pass) Refs #64227 Unblocks the final parity run for #64441 / #64662 by making summaries self-describing. * qa-lab: strengthen qa-suite-summary builder types and empty-array semantics Addresses 4 loop-6 Copilot / codex-connector findings on PR #64689 (re-opened as #64789): 1. P2 codex + Copilot: empty `scenarioIds` array was serialized as `[]` because of a truthiness check. The CLI passes an empty array when --scenario is omitted, so full-suite runs would incorrectly record an explicit empty selection. Fix: switch to a `length > 0` check so '[] or undefined' both encode as `null` in the summary run metadata. 2. Copilot: `buildQaSuiteSummaryJson` was exported for parity-gate consumers but its return type was `Record<string, unknown>`, which defeated the point of exporting it. Fix: introduce a concrete `QaSuiteSummaryJson` type that matches the JSON shape 1-for-1 and make the builder return it. Downstream code (parity gate, parity run wrapper) can now import the type and keep consumers type-checked. 3. Copilot: `QaSuiteSummaryJsonParams.providerMode` re-declared the `'mock-openai' \| 'live-frontier'` string union even though `QaProviderMode` is already imported from model-selection.ts. Fix: reuse `QaProviderMode` so provider-mode additions flow through both types at once. 4. Copilot: test fixtures omitted `steps` from the fake scenario results, creating shape drift with the real suite scenario-result shape. Fix: pad the test fixtures with `steps: []` and tighten the scenarioIds assertion to read `json.run.scenarioIds` directly (the new concrete return type makes the type-cast unnecessary). New regression: `treats an empty scenarioIds array as unspecified (no filter)` — passes `scenarioIds: []` and asserts the summary records `scenarioIds: null`. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: record executed scenarioIds in summary run metadata Addresses the pass-3 codex-connector P2 on #64789 (repl of #64689): `run.scenarioIds` was copied from the raw `params.scenarioIds` caller input, but `runQaSuite` normalizes that input through `selectQaSuiteScenarios` which dedupes via `Set` and reorders the selection to catalog order. When callers repeat --scenario ids or pass them in non-catalog order, the summary metadata drifted from the scenarios actually executed, which can make parity/report tooling treat equivalent runs as different or trust inaccurate provenance. Fix: both writeQaSuiteArtifacts call sites in runQaSuite now pass `selectedCatalogScenarios.map(scenario => scenario.id)` instead of `params?.scenarioIds`, so the summary records the post-selection executed list. This also covers the full-suite case automatically (the executed list is the full lane-filtered catalog), giving parity consumers a stable record of exactly which scenarios landed in the run regardless of how the caller phrased the request. buildQaSuiteSummaryJson's `length > 0 ? [...] : null` pass-2 semantics are preserved so the public helper still treats an empty array as 'unspecified' for any future caller that legitimately passes one. Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: preserve null scenarioIds for unfiltered suite runs Addresses the pass-4 codex-connector P2 on #64789: the pass-3 fix always passed `selectedCatalogScenarios.map(...)` to writeQaSuiteArtifacts, which made unfiltered full-suite runs indistinguishable from an explicit all-scenarios selection in the summary metadata. The 'unfiltered → null' semantic (documented in the buildQaSuiteSummaryJson JSDoc and exercised by the "treats an empty scenarioIds array as unspecified" regression) was lost. Fix: both writeQaSuiteArtifacts call sites now condition on the caller's original `params.scenarioIds`. When the caller passed an explicit non-empty filter, record the post-selection executed list (pass-3 behavior, preserving Set-dedupe + catalog-order normalization). When the caller passed undefined or an empty array, pass undefined to writeQaSuiteArtifacts so buildQaSuiteSummaryJson's length-check serializes null (pass-2 behavior, preserving unfiltered semantics). This keeps both codex-connector findings satisfied simultaneously: - explicit --scenario filter reorders/dedupes through the executed list, not the raw caller input - unfiltered full-suite run records null, not a full catalog dump that would shadow "explicit all-scenarios" selections Validation: pnpm test extensions/qa-lab/src/suite.summary-json.test.ts (6/6 pass). Refs #64227 * qa-lab: reuse QaProviderMode in writeQaSuiteArtifacts param type * qa-lab: stage mock auth profiles so the parity gate runs without real credentials * fix(qa): clean up mock auth staging follow-ups * ci: add parity-gate workflow that runs the GPT-5.4 vs Opus 4.6 gate end-to-end against the qa-lab mock * ci: use supported parity gate runner label * ci: watch gateway changes in parity gate * docs: pin parity runbook alternate models * fix(ci): watch qa-channel parity inputs * qa: roll up parity proof closeout * qa: harden mock parity review fixes * qa-lab: fix review findings — comment wording, placeholder key, exported type, ordering assertion, remove false-positive positive-tone detection * qa: fix memory-recall scenario count, update criterion 2 comment, cache fetchJson in model-switch * qa-lab: clean up positive-tone comment + fix stale test expectations * qa: pin workflow Node version to 22.14.0 + fix stale label-match wording * qa-lab: refresh mock provider routing expectation * docs: drop stale parity rollup rewrite from proof slice * qa: run parity gate against mock lane * deps: sync qa-lab lockfile * build: refresh a2ui bundle hash * ci: widen parity gate triggers --------- Co-authored-by: Eva <eva@100yen.org>	2026-04-13 13:01:54 +09:00
Josh Avant	3d07dfbb65	feat(qa-lab): add Convex credential broker and admin CLI (#65596 ) * QA Lab: add Convex credential source for Telegram lane * QA Lab: scaffold Convex credential broker * QA Lab: add Convex credential admin CLI * QA Lab: harden Convex credential security paths * QA Broker: validate Telegram payloads on admin add * fix: note QA Convex credential broker in changelog (#65596) (thanks @joshavant)	2026-04-12 22:03:42 -05:00
Peter Steinberger	20266c14cb	feat(qa-lab): add control ui qa-channel roundtrip scenario	2026-04-12 19:41:06 -07:00
Peter Steinberger	1a47660518	feat(browser): add qa web runtime support	2026-04-12 19:41:06 -07:00
Peter Steinberger	fcee268373	feat(qa-lab): support scenario-defined plugin runs	2026-04-12 11:59:50 -07:00
Peter Steinberger	e4841d767d	test: stabilize loaded full-suite checks	2026-04-12 11:52:56 -07:00
Marcus Castro	000fc7f233	refactor(qa): add shared QA channel contract and harden worker startup (#64562 ) * refactor(qa): add shared transport contract and suite migration * refactor(qa): harden worker gateway startup * fix(qa): scope waits and sanitize shutdown artifacts * fix(qa): confine artifacts and redact preserved logs * fix(qa): block symlink escapes in artifact paths * fix(gateway): clear shutdown race timers * fix(qa): harden shutdown cleanup paths * fix(qa): sanitize gateway logs in thrown errors * fix(qa): harden suite startup and artifact paths * fix(qa): stage bundled plugins from mutated config * fix(qa): broaden gateway log bearer redaction * fix(qa-channel): restore runtime export * fix(qa): stop failed gateway startups as a process tree * fix(qa-channel): load runtime hook from api surface	2026-04-12 15:02:57 -03:00
Peter Steinberger	a8e140e395	chore: bump version to 2026.4.12	2026-04-12 10:37:18 -07:00
Edder Talmor	5f92094d51	fix: gracefully handle missing QA scenario pack in npm distributions (closes #65082 ) (#65118 ) * fix: allow built-in chat commands to bypass plugins.allow check (closes #65083) The 'commands' CLI command is a built-in chat command registered in the chat commands registry, not a plugin-backed command. When plugins.allow is configured, the error message incorrectly suggests adding 'commands' to plugins.allow, which produces a second error because no 'commands' plugin exists. Check if the command has a plugin entry or manifest alias before suggesting plugins.allow. Built-in commands without plugin entries now proceed normally instead of showing misleading errors. * fix: gracefully handle missing QA scenario pack in npm distributions (closes #65082) The completion cache update fails with a fatal error when the qa/scenarios/index.md file is not present in the installed npm package, even though the directory is listed in package.json "files". Instead of throwing an error, return an empty QA scenario pack with default agent identity. This allows completion cache updates to succeed while QA scenarios remain unavailable in the npm distribution. The QA scenario pack is primarily used for internal testing and QA automation — it is not critical for end-user functionality. * revert: remove unintended run-main.ts changes from PR #65118 The scenario-catalog.ts fix is the correct change for this PR. The run-main.ts changes were accidentally included and cause a regression in plugins.allow error handling. * fix(qa): tolerate missing packaged scenario config --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-12 16:50:58 +01:00
Vincent Koc	cded4fc5db	test(qa-lab): share mock openai response helpers	2026-04-12 05:34:58 +01:00
Peter Steinberger	0e3f9657da	fix(plugins): preserve bundled host compatibility floor	2026-04-12 00:22:32 +01:00
Peter Steinberger	057fe786bd	style: apply formatter drift	2026-04-11 21:25:24 +01:00
HDYA	26f633b604	feat(msteams): add federated credential support (certificate + managed identity) (#53615 ) * feat(msteams): add federated authentication support (certificate + managed identity + workload identity) * msteams: fix vitest 4.1.2 compat, type errors, and regenerate config baseline * msteams: fix lint errors, update fetch allowlist, regenerate protocol Swift * fix(msteams): gate secret-only delegated auth flows * fix(ci): unblock gateway watch and install smoke * fix(ci): restore mergeability for pr 53615 * fix(ci): restore channel registry helper typing * fix(ci): refresh raw fetch guard allowlist --------- Co-authored-by: Chudi Huang <Chudi.Huang@microsoft.com> Co-authored-by: Brad Groux <3053586+BradGroux@users.noreply.github.com>	2026-04-11 13:29:22 -05:00
Tak Hoffman	958c34e82c	feat(qa-lab): Add proxy capture stack and QA Lab inspector (#64895 ) * Add proxy capture core and CLI * Expand transport capture coverage * Add QA Lab capture backend * Refine QA Lab capture UI * Fix proxy capture review feedback * Fix proxy run cleanup and TTS capture * Fix proxy capture transport follow-ups * Fix debug proxy CONNECT target parsing * Harden QA Lab asset path containment	2026-04-11 12:34:57 -05:00
Eva	108e5c89de	qa-lab: scope parity metrics and harden fake-success detector - scope computeQaAgenticParityMetrics to QA_AGENTIC_PARITY_SCENARIO_TITLES in buildQaAgenticParityComparison so extra non-parity lanes in a full qa-suite-summary.json cannot influence completion / unintended-stop / valid-tool / fake-success rates - filter coverageMismatch by !parityTitleSet.has(name) so each required parity scenario fails the gate exactly once (from requiredScenarioCoverage) instead of being double-reported as a coverage mismatch too - drop the bare /\\berror\\b/i rule from SUSPICIOUS_PASS_PATTERNS — it was false-flagging legitimate passes that narrate "Error budget: 0" or "no errors found" — and replace it with targeted /error occurred/i and /an error was/i phrases that indicate a real mid-turn error - add regressions: error-budget/no-errors-observed passes yield fakeSuccessCount === 0, genuine error-occurred narration still flags, each missing required scenario fires exactly one failure line, and non-parity lanes do not perturb scoped metrics - isolate the baseline suspicious-pass test by padding it to the full first-wave scenario set so it asserts the isolated fake-success path via toEqual([...]) rather than toContain	2026-04-11 14:22:48 +01:00
Eva	95f8ad215f	Treat skipped parity scenarios as uncovered	2026-04-11 14:22:48 +01:00
Eva	17252df122	Tighten parity proof heuristics	2026-04-11 14:22:48 +01:00
Eva	fd45ea2bf1	test(qa): add compaction retry parity scenario	2026-04-11 14:22:48 +01:00
Eva	3211aa2540	fix(qa): surface missing required scenarios in parity report	2026-04-11 14:22:48 +01:00
Eva	55df6f11a4	fix: harden parity gate review findings	2026-04-11 14:22:48 +01:00
Eva	db09edacfc	qa-lab: gate parity on shared scenario coverage	2026-04-11 14:22:48 +01:00
Eva	67fdd3b4df	benchmarks: add agentic parity report gate	2026-04-11 14:22:48 +01:00
Eva	79f539d9ce	docs: clarify GPT-5.4 parity harness and review flow	2026-04-11 14:22:48 +01:00
Eva	d9c7ddb099	test: add agentic parity scenario pack	2026-04-11 14:22:48 +01:00
Vincent Koc	1167093773	test(qa): drop rebase conflict marker	2026-04-11 13:24:45 +01:00
Vincent Koc	d21573d3a1	fix(qa): catch leaked harness meta replies	2026-04-11 13:23:26 +01:00
Peter Steinberger	d72fb7efb9	fix: harden QA scenario matcher validation	2026-04-11 13:19:13 +01:00
Peter Steinberger	cd89892b1f	fix(release): keep private QA bundles out of npm pack	2026-04-11 13:13:11 +01:00
Ayaan Zaidi	478a2e15c5	fix: narrow qa cli facade startup path	2026-04-11 10:41:19 +05:30
Peter Steinberger	1ab6e5dbf0	chore(release): bump version to 2026.4.11	2026-04-11 04:51:17 +01:00
Ayaan Zaidi	959b1472dc	test(qa-lab): include telegram mentioned-message scenario	2026-04-11 08:48:42 +05:30
Ayaan Zaidi	b0b0fb308d	feat(qa-lab): add telegram mentioned-message scenario	2026-04-11 08:48:42 +05:30
Ayaan Zaidi	a0b5c7b0c4	test(qa-lab): cover telegram command demo scenarios	2026-04-11 08:48:42 +05:30
Ayaan Zaidi	7c14d8b0f4	feat(qa-lab): add telegram command demo scenarios	2026-04-11 08:48:42 +05:30
Ayaan Zaidi	f9a03f0f4b	test(qa-lab): cover telegram mention-gating	2026-04-11 08:48:42 +05:30
Ayaan Zaidi	355690a72c	feat(qa-lab): add telegram mention-gating scenario	2026-04-11 08:48:42 +05:30
Vincent Koc	350299401f	fix(cycles): continue shared seam extraction	2026-04-11 02:46:41 +01:00
Peter Steinberger	39d1a817fa	lint: enable small oxlint rules	2026-04-11 02:15:21 +01:00
Peter Steinberger	55578a5c40	fix: stabilize Codex runtime truthfulness (#64439 ) (thanks @100yenadmin)	2026-04-11 01:19:32 +01:00

1 2 3 4

198 Commits