[ci-scan] Improve ci-failure-scan emission rules and add daily feedback loop with KPI tracker#128440
[ci-scan] Improve ci-failure-scan emission rules and add daily feedback loop with KPI tracker#128440kotlarmilos wants to merge 11 commits into
Conversation
…itted templates - Tighter stable-signature requirements and clearer skip cases to reduce false-positive KBEs. - Footer added to all three emitted templates (KBE literal, KBE regex, test-disable PR) pointing maintainers at the new ci-failure-scan-feedback workflow. Placed below the JSON fence so Build Analysis's single-fenced-block check still passes. - Intro pointer in the workflow header. - Fork guard so the scanner only runs on dotnet/runtime. - Lock file: re-applies the manual pat_pool patch on the detection job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sister workflow to ci-failure-scan that closes the loop: - Runs every 3 days; fork-guarded. - Reads the last 10 scanner runs (deep-dives the latest two) and gathers in-scope feedback: items labelled agentic-workflows OR with [ci-scan] in the title, restricted to min-integrity: approved. - Scores recent emissions against a rubric (title scoping, classification, JSON validity, signature specificity, log cross-check, maintainer feedback). - Translates findings into proposed prompt edits scoped to ci-failure-scan.md only. - Single open draft PR at a time: pushes to the existing [ci-scan-feedback] PR when one exists, otherwise opens a new one. - Maintains a pinned [ci-scan-feedback] KPI Tracker issue regenerated each tick. Running window starts at the scanner workflow's created_at. Metrics: issues filed/open/closed, closed last 7d/30d, median + p90 time-to-close, % closed as not_planned (false-positive proxy), top 3 pipelines, PR filed/merged/closed-unmerged. Charts: weekly filed-vs-closed and weekly median time-to-close as mermaid xychart-beta over the last 12 weeks, plus raw weekly buckets in a collapsed details block. - All writes via gh-aw safe-outputs; no issues: write / contents: write. - Lock file: pat_pool patched onto the detection job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @dotnet/runtime-infrastructure |
There was a problem hiding this comment.
Pull request overview
This PR updates the gh-aw driven ci-failure-scan workflow prompt to reduce false-positive emissions (tighter stability rules, explicit skip cases, same-run dedup, follow-up-build gate) and adds a companion ci-failure-scan-feedback workflow that periodically aggregates maintainer feedback and proposes prompt edits via a single draft PR and a regenerated KPI tracker issue.
Changes:
- Add repo guard conditions and additional emission/skip gates to reduce
ci-failure-scannoise (follow-up build presence, same-run dedup cache, expanded KBE/tracker/fix PR search heuristics). - Append a standard “how to provide feedback” footer to emitted KBE / PR templates and add match-verification guidance tied to a persisted failure log.
- Introduce a new scheduled feedback workflow (and compiled lock file) to harvest in-scope feedback and generate prompt-edit PRs plus a KPI tracker snapshot.
Show a summary per file
| File | Description |
|---|---|
| .github/workflows/ci-failure-scan.md | Tightens scan/emit rules, adds follow-up/dedup gates, and adds feedback footers to templates. |
| .github/workflows/ci-failure-scan.lock.yml | Regenerated compiled workflow reflecting the new prompt and repo guard conditions. |
| .github/workflows/ci-failure-scan-feedback.md | New gh-aw workflow prompt for periodic feedback ingestion, KPI tracking, and prompt-edit PR proposals. |
| .github/workflows/ci-failure-scan-feedback.lock.yml | Compiled lock workflow for the new feedback workflow. |
Copilot's findings
- Files reviewed: 3/4 changed files
- Comments generated: 6
This comment has been minimized.
This comment has been minimized.
Tighter feedback loop. Schedule changes from every 3d to fuzzy 'daily' (compiled to 03:39 UTC, stable via --schedule-seed dotnet/runtime). Lock files: re-applies the manual pat_pool patch on both detection jobs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Two fixes driven by the first scanner run on this branch: - Step 4.4 (existing test-disable PR dedup): also search the test-name stem after stripping verb prefixes and platform suffixes. PR titles often abbreviate (e.g. DnsGetHostEntry_X -> X). Without the fallback, the scanner filed #128442 as a duplicate of #128425. - KBE check #6 and bad-vs-good table: reject array-form signatures whose second element is a generic xunit assertion stem ('Assert.Equal() Failure: Values differ', 'Assert.True() Failure', etc). Require the unique 'Expected:'/'Actual:' value lines or the actual exception type + message. #128444 (now closed) emitted such a weak signature; Build Analysis would have mismatched it against unrelated failures. Lock file: re-applies the manual pat_pool patch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
…elines 72h was blanket-suppressing actionable failures on the 12 JIT-stress family pipelines (defs 109, 110, 111, 112, 115, 116, 118, 138, 140, 150, 153, 155, 159, 235) which run on a weekly-or-longer cadence by design. Iteration 2 of the scanner on this branch surfaced this: 12 pipelines skipped with 'stale build window (>72h)' on their normal cadence. Extending to 14d accommodates the cadence; the 7-day 'no qualifying build' rule still catches genuine inactivity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
The previous guard 'event_name == workflow_dispatch || repository == dotnet/runtime' had a flaw: workflow_dispatch from a fork bypasses the repo check because the OR short-circuits as soon as the event matches. David Hartglass observed the scanner running on dhartglassMSFT/runtime twice a day (schedule trigger). pre_activation runs there with the current guard because the fork inherits the old version of the condition. Even with the corrected guard, the issue would persist on forks via the UI 'Run workflow' button which dispatches with the same event_name. Replace with the single condition 'repository == dotnet/runtime'. Both schedule and workflow_dispatch on a fork now skip every job. The activation job's compound condition (activated && (repository == dotnet/runtime)) gets simplified by the compiler to the same effective guard, plus the pre_activation skip propagates to every downstream job via needs. Both lock files: re-applies the manual pat_pool patch on the detection job (gh aw compile strips it on every recompile). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
…orce KBE match verification Two failures observed on #128446 (Interpreter assert in CompressDebugInfo): 1. The agent never found in-flight fix PR #128428. Step 4.5 only searched by test-name / test-file-path / assembly. A JIT assert has none of those, so the search returned nothing and the agent filed a sibling KBE instead of skipping. Step 4.5 now adds three queries for failures without a test workitem: by C/C++ source-symbol (CompressDebugInfo, iOffsetMapping), by 6-12 word literal slice of the assertion text, and by 'Fixes #<tracker>' once Step 4.3 linked a tracker. 2. The KBE matcher came up 0/0/0 in Build Analysis. Per the prompt the agent should grep-verify each signature element against KBE has that marker; the verification is being skipped uniformly. Step 7 is now a hard gate: emit the KBE only if the match-count marker is present and positive count >= 1. Records 'skipped: signature did not match failure.log' otherwise. Adds explicit guidance for JIT/runtime/build-level asserts: prefer Template C array form pairing source-symbol with assertion text (two anchors double match probability), and skip emission if neither greps positive (native asserts often don't appear in the xunit log Build Analysis indexes — rely on tracker + test-disable PR in that case). Both fixes preserve the existing model that a plain area-team tracker is NOT a KBE substitute (sibling KBE still required when no fix PR is in flight). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
Resolves the actionable inline comments on #128440: ci-failure-scan.md - Footer at the bottom of all three emitted templates (KBE literal, KBE regex, test-disable PR) now uses absolute https://github.com/... URLs. Relative .github/workflows/... links 404 when clicked from a filed issue or PR. - Step 2 says 'fetch at least 25 builds' but the Data Sources example showed %24top=20. Updated the example to %24top=25 for consistency. - Step 3's note 'Step 4.0 and KBE check 7 read it back' was inaccurate (Step 4.0 does not reference failure.log). Reworded to 'KBE check 7 greps it for the verbatim signature'. ci-failure-scan-feedback.md - Hard-rule clarification: 'min-integrity: approved' applies to reads of user-supplied content (issue/PR bodies and comments) which must go through the github MCP tool. gh is allowed for run-metadata and for enumerating the workflow's own [ci-scan-feedback] artifacts, but explicitly forbidden for substituting integrity-gated reads. Step 3's commands rewritten to use github MCP search_issues / search_pull_requests with updated:>=<today-30d> queries. - Step 3 now covers PRs as well as issues (four search queries) and applies the documented 30-day window via updated:>= in the query. - Step 4 rubric bullet escaped the literal '```json' fence with double backticks so it no longer breaks GFM list rendering. - Step 7 dropped the 'pinned' adjective. The workflow cannot pin issues; reworded to note maintainers may pin manually if desired. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
… comments
ci-failure-scan.md
- Step 6 'Recognized values' list now includes the two new skip
reasons introduced by KBE check 7 ('signature did not match
failure.log (N=<count>)', 'native assert not in xunit log'). Added
a clarifying sentence noting the list is non-exhaustive but new
reasons should reuse one of these phrasings so the feedback
workflow's tally aggregation stays stable.
- 'Emit each template verbatim except for <placeholder> slots' rule
marker. Added an explicit placeholder line on all three KBE
templates (literal, regex, array) so the agent doesn't have to
choose between violating the verbatim rule and omitting the gate
marker.
ci-failure-scan-feedback.md
- KPI window start source: previously used the workflow file's
'.created_at' (file creation), which can predate any run. Changed
to derive from the first recorded run via
'gh api .../workflows/X.lock.yml/runs?per_page=1&order=asc'.
Persists the resulting timestamp as
and prefers that cached value on subsequent ticks (read via the
github MCP tool, per the existing hard rule).
- top_pipelines metric previously said 'top 3 definition_id mentions
in issue bodies' which contradicts the hard rule that body reads
go through the github MCP tool. Reworded the metric to require
per-item body fetches via 'issue_read get'/'pull_request_read get',
count [Filtered] separately as integrity_filtered_pipelines, and
exclude filtered items from the top 3.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This comment has been minimized.
This comment has been minimized.
…add unknown-skip-reason rubric Addresses three findings from PR #128440's latest code review: - Step 2 'gh run view --log | grep -A 200' could surface arbitrary trailing agent log content that quotes maintainer-supplied comments, an indirect integrity-gate bypass. Replaced with an awk block that extracts only the tally table (header + body rows, terminated by the first non-pipe line). - Step 4 rubric now scores tally rows whose 'skipped:' reason isn't in the Step 6 'Recognized values' list as 'unknown-skip-reason: <verbatim string>'. The recognized-values list is the source of truth; the feedback PR should propose adding new reasons there before they appear in tallies. Closes the loop on the advisory- only nature of the skip-reason catalogue. - KPI tracker 'Raw weekly buckets' details block previously had no row cap and would grow unbounded as the scanner ran. Constrained to the same 12-week window the mermaid charts use; older buckets are dropped on each regeneration (tracker body is a current snapshot, not a permanent ledger). The fourth finding ('schedule: daily syntax') was a false alarm — gh-aw v0.71.5 compiles it correctly to cron: '39 3 * * *'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Note This review was generated by GitHub Copilot. 🤖 Copilot Code Review — PR #128440Holistic AssessmentWhat this does: Adds a follow-up build gate (Step 3.5), same-run dedup cache (Step 4.0), mandatory Motivation: Justified — the scanner has been filing noisy/false-positive KBEs. Structural tightening + a self-improving feedback loop is the right approach. Approach: Sound — incremental prompt refinements rather than a rewrite, constrained companion workflow with strict safe-outputs caps. Verdict: ✅ Approve (workflow prompt changes only — no runtime code affected) Findings
|
…nk tracker workflow Addresses three correct review findings: - awk filter printed the first non-pipe line (the terminator) before exiting, defeating the purpose of restricting tally extraction to the pipe-table block. Restructured to only print lines starting with '|' and exit on the first non-pipe line after flag is set. - The GitHub REST /runs endpoint does not accept order=asc; the previous command silently returned the most recent run, not the first. Switched to gh api --paginate (per_page=100) + jq sort_by + min(created_at), and added an explicit warning so future edits don't reintroduce the bug. - '[ci-failure-scan-feedback.lock.yml]' in the tracker body template was reference-style with no defined target, so it rendered as literal text. Made it an explicit URL to the workflow file on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Two changes to the
ci-failure-scanagentic workflow that converts dnceng-public outer-loop pipeline failures intoKnown Build Errorissues and test-disable PRs:Impact
not_planned, median and p90 time-to-close, and weekly trend charts over a running window since the scanner was establishedsafe-outputs, noissues: writeorcontents: writepermissions