Skip to content

metrics.md: add lean_aggregator_skipped_total + update aggregated_signatures_building_time status#36

Open
ch4r10t33r wants to merge 1 commit into
leanEthereum:mainfrom
ch4r10t33r:add-aggregator-skipped-metric
Open

metrics.md: add lean_aggregator_skipped_total + update aggregated_signatures_building_time status#36
ch4r10t33r wants to merge 1 commit into
leanEthereum:mainfrom
ch4r10t33r:add-aggregator-skipped-metric

Conversation

@ch4r10t33r
Copy link
Copy Markdown

Summary

Adds a new cross-client metric lean_aggregator_skipped_total{reason=...} so operators can answer "how many slots did aggregation actually run for, and how many were skipped and why?" with a single counter rather than deriving the answer from coverage gauges or grepping logs.

Also updates the existing lean_pq_sig_aggregated_signatures_building_time_seconds row to reflect what's actually exposed on the live devnet (Grandine and Zeam).

Motivation

Today no client exposes a standard skip counter. Operators surveying the live devnet found:

  • zeam exposes zeam_aggregate_skip_total{reason=not_aggregator|not_synced|missing_state|spawn_failed} (client-namespaced)
  • ream / grandine / ethlambda / lantern expose nothing equivalent

That leaves "did this aggregator drop a slot?" as a derived question. The two indirect proxies — per-slot subnet coverage (lean_attestation_aggregate_coverage_subnets) and lean_pq_sig_aggregated_signatures_total rate — both miss the "had-duty-and-silently-dropped" case, and the coverage gauge isn't even exposed by every client.

A first-class counter for skips ends the guesswork and makes "missed aggregations" comparable across the fleet.

Proposed label values

reason When it fires
not_aggregator This slot's aggregation duty wasn't ours. Bookkeeping — lets you separate "no duty" from "had duty but skipped"
not_synced Wall-lag or sync-status gate prevented aggregation (e.g. node is in behind_peers and aggregation is gated)
missing_state Pre-state for the att_data target couldn't be resolved when the aggregator ran
spawn_failed Aggregation worker queue was full / spawn error
other Catch-all so clients can adopt incrementally without enumerating every internal failure mode

Sum across labels = total aggregation cycles seen. sum by (reason) (rate(lean_aggregator_skipped_total[5m])) then gives both the duty distribution and the genuine-miss rate.

Status table

Client Status Notes
Zeam 📝 Has equivalent counter under zeam_aggregate_skip_total; rename to lean_aggregator_skipped_total upstream-adoption
Others Not yet implemented

Drive-by updates

lean_pq_sig_aggregated_signatures_building_time_seconds:

  • Grandine: □ → ✅ — verified exposed on the live devnet (~600 observations on grandine_0 over a ~16 min run, p50≈1.19s)
  • Zeam: 📝 → ✅ — implemented in zeam #941, exposed on devnet image sha256:bb801c18…, ~500 observations per aggregator (p50≈0.38s)

Test plan

  • Reviewed metrics.md rendering locally
  • Reviewers confirm naming + label set is acceptable
  • Reviewers from each client team confirm/correct the status table
…natures_building_time status

`lean_aggregator_skipped_total` (Validator Metrics) gives cross-client
visibility into missed aggregations. Today no standard skip counter
exists — zeam exposes a client-namespaced `zeam_aggregate_skip_total`,
no other client has anything equivalent. Deriving "missed aggregation"
from coverage gauges is best-effort and silently misses 100% drop
failures.

Proposed labels:
  not_aggregator  — slot in which the node had no aggregation duty
                    (bookkeeping; lets you separate "no duty" from
                     "had duty but skipped")
  not_synced      — wall-lag or sync-status gate prevented aggregation
  missing_state   — pre-state for the att_data target was unavailable
  spawn_failed    — aggregation worker queue was full / spawn error
  other           — catch-all so clients can adopt incrementally

`sum by (reason) (rate(lean_aggregator_skipped_total[5m]))` then tells
you both the duty distribution and the genuine-miss rate.

Zeam status set to 📝 (in-progress): currently has a semantically-
equivalent counter under a `zeam_*` prefix that will be renamed once
adopted upstream.

Also updates `lean_pq_sig_aggregated_signatures_building_time_seconds`:
  - Grandine: □ → ✅ (verified exposed on the live devnet, ~600 obs)
  - Zeam:     📝 → ✅ (implemented in zeam PR #941, exposed in
                       devnet image sha256:bb801c18..., ~500 obs per
                       aggregator)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant