Skip to content

pageserver: attach failure leaves tenant broken and unavailable #12074

Open
@erikgrinaker

Description

@erikgrinaker

If a tenant attach fails, it leaves the tenant in a broken state (unavailable) until an operator intervenes. This can happen e.g. with temporary S3 unavailability during the preload or manifest upload.

When a tenant is attached to a Pageserver, the tenant activation happens async from the location_config attach call, in a spawned task (see TenantShard::spawn). The storage controller receives a successful response regardless of the attach outcome.

If the attachment fails (e.g. due to temporary S3 unavailability), the tenant is set to TenantState::Broken, but nothing will ever retry the attachment so it remains unavailable.

We should either:

  1. Internally retry the attachment in the Pageserver.
  2. Have the storage controller be responsive to broken tenants and retry them, either on the same or a different Pageserver.

It should also be responsive to subsequent location_config and tenant_reset calls.

Metadata

Metadata

Assignees

Labels

c/storage/pageserverComponent: storage: pageserverm/good_first_issueMoment: when doing your first Neon contributionst/bugIssue Type: Bugtriagedbugs that were already triaged

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions