Open
Description
If a tenant attach fails, it leaves the tenant in a broken state (unavailable) until an operator intervenes. This can happen e.g. with temporary S3 unavailability during the preload or manifest upload.
When a tenant is attached to a Pageserver, the tenant activation happens async from the location_config attach call, in a spawned task (see TenantShard::spawn
). The storage controller receives a successful response regardless of the attach outcome.
If the attachment fails (e.g. due to temporary S3 unavailability), the tenant is set to TenantState::Broken
, but nothing will ever retry the attachment so it remains unavailable.
We should either:
- Internally retry the attachment in the Pageserver.
- Have the storage controller be responsive to broken tenants and retry them, either on the same or a different Pageserver.
It should also be responsive to subsequent location_config
and tenant_reset
calls.