[ci-scan] Improve ci-failure-scan emission rules and add daily feedback loop with KPI tracker by kotlarmilos · Pull Request #128440 · dotnet/runtime

kotlarmilos · 2026-05-21T10:22:10Z

Description

Two changes to the ci-failure-scan agentic workflow that converts dnceng-public outer-loop pipeline failures into Known Build Error issues and test-disable PRs:

Tightens emission rules in the existing scanner: stricter stable-signature requirement and clearer skip cases so issues are only filed for failures actually reproducible in the scanned window.
Adds a feedback footer to every emitted KBE issue and test-disable PR pointing readers at the workflow file and the new feedback channel. Maintainers can comment on any noisy issue/PR to push back inline; the next feedback tick reads those comments and proposes prompt edits.

Impact

Fewer false-positive KBEs from the scanner: stricter emission rules
Maintainers have a clear inline feedback channel from inside every emitted artifact
The feedback loop is self-improving: daily tick reads recent issues/PRs and pushes prompt edits to a single draft PR
Noise reduction is measurable: KPI tracker reports issues filed vs closed, % closed as not_planned, median and p90 time-to-close, and weekly trend charts over a running window since the scanner was established
All writes via gh-aw safe-outputs, no issues: write or contents: write permissions
No runs on personal forks

…itted templates - Tighter stable-signature requirements and clearer skip cases to reduce false-positive KBEs. - Footer added to all three emitted templates (KBE literal, KBE regex, test-disable PR) pointing maintainers at the new ci-failure-scan-feedback workflow. Placed below the JSON fence so Build Analysis's single-fenced-block check still passes. - Intro pointer in the workflow header. - Fork guard so the scanner only runs on dotnet/runtime. - Lock file: re-applies the manual pat_pool patch on the detection job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Sister workflow to ci-failure-scan that closes the loop: - Runs every 3 days; fork-guarded. - Reads the last 10 scanner runs (deep-dives the latest two) and gathers in-scope feedback: items labelled agentic-workflows OR with [ci-scan] in the title, restricted to min-integrity: approved. - Scores recent emissions against a rubric (title scoping, classification, JSON validity, signature specificity, log cross-check, maintainer feedback). - Translates findings into proposed prompt edits scoped to ci-failure-scan.md only. - Single open draft PR at a time: pushes to the existing [ci-scan-feedback] PR when one exists, otherwise opens a new one. - Maintains a pinned [ci-scan-feedback] KPI Tracker issue regenerated each tick. Running window starts at the scanner workflow's created_at. Metrics: issues filed/open/closed, closed last 7d/30d, median + p90 time-to-close, % closed as not_planned (false-positive proxy), top 3 pipelines, PR filed/merged/closed-unmerged. Charts: weekly filed-vs-closed and weekly median time-to-close as mermaid xychart-beta over the last 12 weeks, plus raw weekly buckets in a collapsed details block. - All writes via gh-aw safe-outputs; no issues: write / contents: write. - Lock file: pat_pool patched onto the detection job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-05-21T10:23:32Z

Tagging subscribers to this area: @dotnet/runtime-infrastructure
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

This PR updates the gh-aw driven ci-failure-scan workflow prompt to reduce false-positive emissions (tighter stability rules, explicit skip cases, same-run dedup, follow-up-build gate) and adds a companion ci-failure-scan-feedback workflow that periodically aggregates maintainer feedback and proposes prompt edits via a single draft PR and a regenerated KPI tracker issue.

Changes:

Add repo guard conditions and additional emission/skip gates to reduce ci-failure-scan noise (follow-up build presence, same-run dedup cache, expanded KBE/tracker/fix PR search heuristics).
Append a standard “how to provide feedback” footer to emitted KBE / PR templates and add match-verification guidance tied to a persisted failure log.
Introduce a new scheduled feedback workflow (and compiled lock file) to harvest in-scope feedback and generate prompt-edit PRs plus a KPI tracker snapshot.

Show a summary per file

File	Description
.github/workflows/ci-failure-scan.md	Tightens scan/emit rules, adds follow-up/dedup gates, and adds feedback footers to templates.
.github/workflows/ci-failure-scan.lock.yml	Regenerated compiled workflow reflecting the new prompt and repo guard conditions.
.github/workflows/ci-failure-scan-feedback.md	New gh-aw workflow prompt for periodic feedback ingestion, KPI tracking, and prompt-edit PR proposals.
.github/workflows/ci-failure-scan-feedback.lock.yml	Compiled lock workflow for the new feedback workflow.

Copilot's findings

Files reviewed: 3/4 changed files
Comments generated: 6

Tighter feedback loop. Schedule changes from every 3d to fuzzy 'daily' (compiled to 03:39 UTC, stable via --schedule-seed dotnet/runtime). Lock files: re-applies the manual pat_pool patch on both detection jobs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Two fixes driven by the first scanner run on this branch: - Step 4.4 (existing test-disable PR dedup): also search the test-name stem after stripping verb prefixes and platform suffixes. PR titles often abbreviate (e.g. DnsGetHostEntry_X -> X). Without the fallback, the scanner filed #128442 as a duplicate of #128425. - KBE check #6 and bad-vs-good table: reject array-form signatures whose second element is a generic xunit assertion stem ('Assert.Equal() Failure: Values differ', 'Assert.True() Failure', etc). Require the unique 'Expected:'/'Actual:' value lines or the actual exception type + message. #128444 (now closed) emitted such a weak signature; Build Analysis would have mismatched it against unrelated failures. Lock file: re-applies the manual pat_pool patch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 3/4 changed files
Comments generated: 3

…elines 72h was blanket-suppressing actionable failures on the 12 JIT-stress family pipelines (defs 109, 110, 111, 112, 115, 116, 118, 138, 140, 150, 153, 155, 159, 235) which run on a weekly-or-longer cadence by design. Iteration 2 of the scanner on this branch surfaced this: 12 pipelines skipped with 'stale build window (>72h)' on their normal cadence. Extending to 14d accommodates the cadence; the 7-day 'no qualifying build' rule still catches genuine inactivity. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The previous guard 'event_name == workflow_dispatch || repository == dotnet/runtime' had a flaw: workflow_dispatch from a fork bypasses the repo check because the OR short-circuits as soon as the event matches. David Hartglass observed the scanner running on dhartglassMSFT/runtime twice a day (schedule trigger). pre_activation runs there with the current guard because the fork inherits the old version of the condition. Even with the corrected guard, the issue would persist on forks via the UI 'Run workflow' button which dispatches with the same event_name. Replace with the single condition 'repository == dotnet/runtime'. Both schedule and workflow_dispatch on a fork now skip every job. The activation job's compound condition (activated && (repository == dotnet/runtime)) gets simplified by the compiler to the same effective guard, plus the pre_activation skip propagates to every downstream job via needs. Both lock files: re-applies the manual pat_pool patch on the detection job (gh aw compile strips it on every recompile). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 3/4 changed files
Comments generated: 2

…orce KBE match verification Two failures observed on #128446 (Interpreter assert in CompressDebugInfo): 1. The agent never found in-flight fix PR #128428. Step 4.5 only searched by test-name / test-file-path / assembly. A JIT assert has none of those, so the search returned nothing and the agent filed a sibling KBE instead of skipping. Step 4.5 now adds three queries for failures without a test workitem: by C/C++ source-symbol (CompressDebugInfo, iOffsetMapping), by 6-12 word literal slice of the assertion text, and by 'Fixes #<tracker>' once Step 4.3 linked a tracker. 2. The KBE matcher came up 0/0/0 in Build Analysis. Per the prompt the agent should grep-verify each signature element against KBE has that marker; the verification is being skipped uniformly. Step 7 is now a hard gate: emit the KBE only if the match-count marker is present and positive count >= 1. Records 'skipped: signature did not match failure.log' otherwise. Adds explicit guidance for JIT/runtime/build-level asserts: prefer Template C array form pairing source-symbol with assertion text (two anchors double match probability), and skip emission if neither greps positive (native asserts often don't appear in the xunit log Build Analysis indexes — rely on tracker + test-disable PR in that case). Both fixes preserve the existing model that a plain area-team tracker is NOT a KBE substitute (sibling KBE still required when no fix PR is in flight). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Resolves the actionable inline comments on #128440: ci-failure-scan.md - Footer at the bottom of all three emitted templates (KBE literal, KBE regex, test-disable PR) now uses absolute https://github.com/... URLs. Relative .github/workflows/... links 404 when clicked from a filed issue or PR. - Step 2 says 'fetch at least 25 builds' but the Data Sources example showed %24top=20. Updated the example to %24top=25 for consistency. - Step 3's note 'Step 4.0 and KBE check 7 read it back' was inaccurate (Step 4.0 does not reference failure.log). Reworded to 'KBE check 7 greps it for the verbatim signature'. ci-failure-scan-feedback.md - Hard-rule clarification: 'min-integrity: approved' applies to reads of user-supplied content (issue/PR bodies and comments) which must go through the github MCP tool. gh is allowed for run-metadata and for enumerating the workflow's own [ci-scan-feedback] artifacts, but explicitly forbidden for substituting integrity-gated reads. Step 3's commands rewritten to use github MCP search_issues / search_pull_requests with updated:>=<today-30d> queries. - Step 3 now covers PRs as well as issues (four search queries) and applies the documented 30-day window via updated:>= in the query. - Step 4 rubric bullet escaped the literal '```json' fence with double backticks so it no longer breaks GFM list rendering. - Step 7 dropped the 'pinned' adjective. The workflow cannot pin issues; reworded to note maintainers may pin manually if desired. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Copilot's findings

Files reviewed: 3/4 changed files
Comments generated: 4

… comments ci-failure-scan.md - Step 6 'Recognized values' list now includes the two new skip reasons introduced by KBE check 7 ('signature did not match failure.log (N=<count>)', 'native assert not in xunit log'). Added a clarifying sentence noting the list is non-exhaustive but new reasons should reuse one of these phrasings so the feedback workflow's tally aggregation stays stable. - 'Emit each template verbatim except for <placeholder> slots' rule marker. Added an explicit placeholder line on all three KBE templates (literal, regex, array) so the agent doesn't have to choose between violating the verbatim rule and omitting the gate marker. ci-failure-scan-feedback.md - KPI window start source: previously used the workflow file's '.created_at' (file creation), which can predate any run. Changed to derive from the first recorded run via 'gh api .../workflows/X.lock.yml/runs?per_page=1&order=asc'. Persists the resulting timestamp as and prefers that cached value on subsequent ticks (read via the github MCP tool, per the existing hard rule). - top_pipelines metric previously said 'top 3 definition_id mentions in issue bodies' which contradicts the hard rule that body reads go through the github MCP tool. Reworded the metric to require per-item body fetches via 'issue_read get'/'pull_request_read get', count [Filtered] separately as integrity_filtered_pipelines, and exclude filtered items from the top 3. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…add unknown-skip-reason rubric Addresses three findings from PR #128440's latest code review: - Step 2 'gh run view --log | grep -A 200' could surface arbitrary trailing agent log content that quotes maintainer-supplied comments, an indirect integrity-gate bypass. Replaced with an awk block that extracts only the tally table (header + body rows, terminated by the first non-pipe line). - Step 4 rubric now scores tally rows whose 'skipped:' reason isn't in the Step 6 'Recognized values' list as 'unknown-skip-reason: <verbatim string>'. The recognized-values list is the source of truth; the feedback PR should propose adding new reasons there before they appear in tallies. Closes the loop on the advisory- only nature of the skip-reason catalogue. - KPI tracker 'Raw weekly buckets' details block previously had no row cap and would grow unbounded as the scanner ran. Constrained to the same 12-week window the mermaid charts use; older buckets are dropped on each regeneration (tracker body is a current snapshot, not a permanent ledger). The fourth finding ('schedule: daily syntax') was a false alarm — gh-aw v0.71.5 compiles it correctly to cron: '39 3 * * *'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions · 2026-05-21T14:28:45Z

Note

This review was generated by GitHub Copilot.

🤖 Copilot Code Review — PR #128440

Holistic Assessment

What this does: Adds a follow-up build gate (Step 3.5), same-run dedup cache (Step 4.0), mandatory failure.log grep verification (hardened KBE check 7), expanded skip-reason vocabulary, and a feedback footer on all emitted artifacts. A new companion workflow (ci-failure-scan-feedback.md) reads maintainer feedback daily and proposes prompt edits via draft PR, plus maintains a KPI tracker issue.

Motivation: Justified — the scanner has been filing noisy/false-positive KBEs. Structural tightening + a self-improving feedback loop is the right approach.

Approach: Sound — incremental prompt refinements rather than a rewrite, constrained companion workflow with strict safe-outputs caps.

Verdict: ✅ Approve (workflow prompt changes only — no runtime code affected)

Findings

⚠️ Warning: Integrity boundary gap in feedback workflow Step 2

ci-failure-scan-feedback.md Step 2 runs gh run view <run-id> --log and pipes through awk. The hard-rules section explicitly prohibits reading maintainer-supplied content via gh, yet the scanner agent's log may quote issue bodies/comments it read during its own run. The awk filter (/^\| pipeline \|/) is narrow, but if the tally table itself embeds user-supplied content (e.g., quoted signatures from issue titles), it leaks through the integrity gate. Consider adding a hard-rule clarification that the extracted tally lines must not be parsed for content beyond their structured columns.

⚠️ Warning: Step 3.5 follow-up gate may over-suppress on slow-cadence pipelines

Step 2 skip reason says: "No follow_up (source is the absolute latest) → pipeline-skipped: no follow-up build yet — defer to next run". For weekly JIT-stress pipelines (defs 109–160), if a failure appears in the latest build, there may not be a newer build for 7+ days. This means the scanner will defer for an entire week (or more) before filing. The PR description mentions the 14-day window accommodates these, but the follow-up gate (not the staleness window) is the bottleneck. Consider an escape hatch: if source is older than N days with no follow-up, treat as stable.

💡 Suggestion: `ci-scan-match-count` marker is outside the JSON fence

The  marker is placed after the closing ``` of the JSON block. Build Analysis presumably ignores HTML comments, but this is implicit. If Build Analysis ever changes its body parsing to be stricter, the marker could interfere. Document explicitly that the marker is a sentinel for the feedback workflow's rubric and is invisible to Build Analysis.

💡 Suggestion: Feedback workflow KPI tracker — `gh search issues` vs MCP tool consistency

Step 7 uses gh search issues to discover the tracker and collect the universe of [ci-scan] items, then switches to the github MCP tool for body reads. The hard-rules section says gh is allowed for "enumerating this workflow's own artifacts" but the broader search for ALL [ci-scan] items in Step 7 goes beyond that narrow scope. This is probably fine since it only reads metadata (numbers, URLs), but the distinction could confuse future editors. Consider a one-line clarification in the hard-rules.

💡 Suggestion: Dedup cache race with parallel signature processing

Step 4.0's dedup cache uses a plain TSV file with grep -Fxq for lookup and tee -a for append. If the agent processes signatures sequentially (as the prompt implies), this works. But if a future edit allows parallel processing (or the agent batches API calls), the file-based cache has a TOCTOU race. Not a current bug, but worth a one-line comment: "Sequential processing assumed; do not parallelize signature walks without replacing the TSV cache."

Summary

No blocking issues. The emission tightening (follow-up gate, mandatory failure.log verification, expanded dedup) should meaningfully reduce false positives. The feedback workflow is well-scoped with appropriate safe-outputs caps. The warnings above are edge cases worth addressing but none are merge-blocking for a draft PR targeting workflow prompts.

Generated by Code Review for issue #128440 · ● 3.4M · ◷

Copilot

Copilot's findings

Files reviewed: 3/4 changed files
Comments generated: 3

…nk tracker workflow Addresses three correct review findings: - awk filter printed the first non-pipe line (the terminator) before exiting, defeating the purpose of restricting tally extraction to the pipe-table block. Restructured to only print lines starting with '|' and exit on the first non-pipe line after flag is set. - The GitHub REST /runs endpoint does not accept order=asc; the previous command silently returned the most recent run, not the first. Switched to gh api --paginate (per_page=100) + jq sort_by + min(created_at), and added an explicit warning so future edits don't reintroduce the bug. - '[ci-failure-scan-feedback.lock.yml]' in the tracker body template was reference-style with no defined target, so it rendered as literal text. Made it an explicit URL to the workflow file on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kotlarmilos and others added 2 commits May 21, 2026 12:19

Copilot AI review requested due to automatic review settings May 21, 2026 10:22

Copilot started reviewing on behalf of kotlarmilos May 21, 2026 10:22 View session

github-actions Bot added the area-Infrastructure label May 21, 2026

github-project-automation Bot added this to Runtime Infra May 21, 2026

dotnet-policy-service Bot assigned kotlarmilos May 21, 2026

kotlarmilos changed the title ~~Reduce ci-failure-scan noise and add feedback loop~~ [ci-scan] Reduce ci-failure-scan noise and add feedback loop May 21, 2026

Copilot AI reviewed May 21, 2026

View reviewed changes