|
| 1 | +# Memo: Endpoint Persistent Unlogged Files Storage |
| 2 | +Created on 2024-11-05 |
| 3 | +Implemented on N/A |
| 4 | + |
| 5 | +## Summary |
| 6 | +A design for a storage system that allows storage of files required to make |
| 7 | +Neon's Endpoints have a better experience at or after a reboot. |
| 8 | + |
| 9 | +## Motivation |
| 10 | +Several systems inside PostgreSQL (and Neon) need some persistent storage for |
| 11 | +optimal workings across reboots and restarts, but still work without. |
| 12 | +Examples are the query-level statistics files of `pg_stat_statements` in |
| 13 | +`pg_stat/pg_stat_statements.stat`, and `pg_prewarm`'s `autoprewarm.blocks`. |
| 14 | +We need a storage system that can store and manage these files for each |
| 15 | +Endpoint, without necessarily granting users access to an unlimited storage |
| 16 | +device. |
| 17 | + |
| 18 | +## Goals |
| 19 | +- Store known files for Endpoints with reasonable persistence. |
| 20 | + _Data loss in this service, while annoying and bad for UX, won't lose any |
| 21 | + customer's data._ |
| 22 | + |
| 23 | +## Non Goals (if relevant) |
| 24 | +- This storage system does not need branching, file versioning, or other such |
| 25 | + features. The files are as ephemeral to the timeline of the data as the |
| 26 | + Endpoints that host the data. |
| 27 | +- This storage system does not need to store _all_ user files, only 'known' |
| 28 | + user files. |
| 29 | +- This storage system does not need to be hosted fully inside Computes. |
| 30 | + _Instead, this will be a separate component similar to Pageserver, |
| 31 | + SafeKeeper, the S3 proxy used for dynamically loaded extensions, etc._ |
| 32 | + |
| 33 | +## Impacted components |
| 34 | +- Compute needs new code to load and store these files in its lifetime. |
| 35 | +- Control Plane needs to consider this new storage system when signalling |
| 36 | + the deletion of an Endpoint, Timeline, or Tenant. |
| 37 | +- Control Plane needs to consider this new storage system when it resets |
| 38 | + or re-assigns an endpoint's timeline/branch state. |
| 39 | + |
| 40 | +A new service is created: the Endpoint Persistent Unlogged Files Storage |
| 41 | +service. This could be integrated in e.g. Pageserver or Control Plane, or a |
| 42 | +separately hosted service. |
| 43 | + |
| 44 | +## Proposed implementation |
| 45 | +Endpoint-related data files are managed by a newly designed service (which |
| 46 | +optionally is integrated in an existing service like Pageserver or Control |
| 47 | +Plane), which stores data directly into S3 or any blob storage of choice. |
| 48 | + |
| 49 | +Upon deletion of the Endpoint, or reassignment of the endpoint to a different |
| 50 | +branch, this ephemeral data is dropped: the data stored may not match the |
| 51 | +state of the branch's data after reassignment, and on endpoint deletion the |
| 52 | +data won't have any use to the user. |
| 53 | + |
| 54 | +Compute gets credentials (JWT token with Tenant, Timeline & Endpoint claims) |
| 55 | +which it can use to authenticate to this new service and retrieve and store |
| 56 | +data associated with this endpoint. This limited scope reduces leaks of data |
| 57 | +across endpoints and timeline resets, and limits the ability of endpoints to |
| 58 | +mess with other endpoints' data. |
| 59 | + |
| 60 | +The path of this endpoint data in S3 is initially as follows: |
| 61 | + |
| 62 | + s3://<regional-epufs-bucket>/ |
| 63 | + tenants/ |
| 64 | + <hex-tenant-id>/ |
| 65 | + tenants/ |
| 66 | + <hex-timeline-id>/ |
| 67 | + endpoints/ |
| 68 | + <endpoint-id>/ |
| 69 | + pgdata/ |
| 70 | + <file_path_in_pgdatadir> |
| 71 | + |
| 72 | +For other blob storages an equivalent or similar path can be constructed. |
| 73 | + |
| 74 | +### Reliability, failure modes and corner cases (if relevant) |
| 75 | +Reliability is important, but not critical to the workings of Neon. The data |
| 76 | +stored in this service will, when lost, reduce performance, but won't be a |
| 77 | +cause of permanent data loss - only operational metadata is stored. |
| 78 | + |
| 79 | +Most, if not all, blob storage services have sufficiently high persistence |
| 80 | +guarantees to cater our need for persistence and uptime. The only concern with |
| 81 | +blob storages is that the access latency is generally higher than local disk, |
| 82 | +but for the object types stored (cache state, ...) I don't think this will be |
| 83 | +much of an issue. |
| 84 | + |
| 85 | +### Interaction/Sequence diagram (if relevant) |
| 86 | + |
| 87 | +In these diagrams you can replace S3 with any persistent storage device of |
| 88 | +choice, but S3 is chosen as representative name: The well-known and short name |
| 89 | +of AWS' blob storage. Azure Blob Storage should work too, but it has a much |
| 90 | +longer name making it less practical for the diagrams. |
| 91 | + |
| 92 | +Write data: |
| 93 | + |
| 94 | +```http |
| 95 | +POST /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path> |
| 96 | +Host: epufs.svc.neon.local |
| 97 | +
|
| 98 | +<<< |
| 99 | +
|
| 100 | +200 OK |
| 101 | +{ |
| 102 | + "version": "<opaque>", # opaque file version token, changes when the file contents change |
| 103 | + "size": <bytes>, |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +```mermaid |
| 108 | +sequenceDiagram |
| 109 | + autonumber |
| 110 | + participant co as Compute |
| 111 | + participant ep as EPUFS |
| 112 | + participant s3 as Blob Storage |
| 113 | +
|
| 114 | + co-->ep: Connect with credentials |
| 115 | + co->>+ep: Store Unlogged Persistent File |
| 116 | + opt is authenticated |
| 117 | + ep->>s3: Write UPF to S3 |
| 118 | + end |
| 119 | + ep->>-co: OK / Failure / Auth Failure |
| 120 | + co-->ep: Cancel connection |
| 121 | +``` |
| 122 | + |
| 123 | +Read data: (optional with cache-relevant request parameters, e.g. If-Modified-Since) |
| 124 | +```http |
| 125 | +GET /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path> |
| 126 | +Host: epufs.svc.neon.local |
| 127 | +
|
| 128 | +<<< |
| 129 | +
|
| 130 | +200 OK |
| 131 | +
|
| 132 | +<file data> |
| 133 | +``` |
| 134 | + |
| 135 | +```mermaid |
| 136 | +sequenceDiagram |
| 137 | + autonumber |
| 138 | + participant co as Compute |
| 139 | + participant ep as EPUFS |
| 140 | + participant s3 as Blob Storage |
| 141 | +
|
| 142 | + co->>+ep: Read Unlogged Persistent File |
| 143 | + opt is authenticated |
| 144 | + ep->>+s3: Request UPF from storage |
| 145 | + s3->>-ep: Receive UPF from storage |
| 146 | + end |
| 147 | + ep->>-co: OK(response) / Failure(storage, auth, ...) |
| 148 | +``` |
| 149 | + |
| 150 | +Compute Startup: |
| 151 | +```mermaid |
| 152 | +sequenceDiagram |
| 153 | + autonumber |
| 154 | + participant co as Compute |
| 155 | + participant ps as Pageserver |
| 156 | + participant ep as EPUFS |
| 157 | + participant es as Extension server |
| 158 | +
|
| 159 | + note over co: Bind endpoint ep-xxx |
| 160 | + par Get basebackup |
| 161 | + co->>+ps: Request basebackup @ LSN |
| 162 | + ps-)ps: Construct basebackup |
| 163 | + ps->>-co: Receive basebackup TAR @ LSN |
| 164 | + and Get startup-critical Unlogged Persistent Files |
| 165 | + co->>+ep: Get all UPFs of endpoint ep-xxx |
| 166 | + ep-)ep: Retrieve and gather all UPFs |
| 167 | + ep->>-co: TAR of UPFs |
| 168 | + and Get startup-critical extensions |
| 169 | + loop For every startup-critical extension |
| 170 | + co->>es: Get critical extension |
| 171 | + es->>co: Receive critical extension |
| 172 | + end |
| 173 | + end |
| 174 | + note over co: Start compute |
| 175 | +``` |
| 176 | + |
| 177 | +CPlane ops: |
| 178 | +```http |
| 179 | +DELETE /tenants/<tenant-id>/timelines/<timeline-id>/endpoints/<endpoint-id> |
| 180 | +Host: epufs.svc.neon.local |
| 181 | +
|
| 182 | +<<< |
| 183 | +
|
| 184 | +200 OK |
| 185 | +{ |
| 186 | + "tenant": "<tenant-id>", |
| 187 | + "timeline": "<timeline-id>", |
| 188 | + "endpoint": "<endpoint-id>", |
| 189 | + "deleted": { |
| 190 | + "files": <count>, |
| 191 | + "bytes": <count>, |
| 192 | + }, |
| 193 | +} |
| 194 | +``` |
| 195 | + |
| 196 | +```http |
| 197 | +DELETE /tenants/<tenant-id>/timelines/<timeline-id> |
| 198 | +Host: epufs.svc.neon.local |
| 199 | +
|
| 200 | +<<< |
| 201 | +
|
| 202 | +200 OK |
| 203 | +{ |
| 204 | + "tenant": "<tenant-id>", |
| 205 | + "timeline": "<timeline-id>", |
| 206 | + "deleted": { |
| 207 | + "files": <count>, |
| 208 | + "bytes": <count>, |
| 209 | + }, |
| 210 | +} |
| 211 | +``` |
| 212 | + |
| 213 | +```http |
| 214 | +DELETE /tenants/<tenant-id> |
| 215 | +Host: epufs.svc.neon.local |
| 216 | +
|
| 217 | +<<< |
| 218 | +
|
| 219 | +200 OK |
| 220 | +{ |
| 221 | + "tenant": "<tenant-id>", |
| 222 | + "deleted": { |
| 223 | + "files": <count>, |
| 224 | + "bytes": <count>, |
| 225 | + }, |
| 226 | +} |
| 227 | +``` |
| 228 | + |
| 229 | +```mermaid |
| 230 | +sequenceDiagram |
| 231 | + autonumber |
| 232 | + participant cp as Control Plane |
| 233 | + participant ep as EPUFS |
| 234 | + participant s3 as Blob Storage |
| 235 | +
|
| 236 | + alt Tenant deleted |
| 237 | + cp-)ep: Tenant deleted |
| 238 | + loop For every object associated with removed tenant |
| 239 | + ep->>s3: Remove data of deleted tenant from Storage |
| 240 | + end |
| 241 | + opt |
| 242 | + ep-)cp: Tenant cleanup complete |
| 243 | + end |
| 244 | + alt Timeline deleted |
| 245 | + cp-)ep: Timeline deleted |
| 246 | + loop For every object associated with removed timeline |
| 247 | + ep->>s3: Remove data of deleted timeline from Storage |
| 248 | + end |
| 249 | + opt |
| 250 | + ep-)cp: Timeline cleanup complete |
| 251 | + end |
| 252 | + else Endpoint reassigned or removed |
| 253 | + cp->>+ep: Endpoint reassigned |
| 254 | + loop For every object associated with reassigned/removed endpoint |
| 255 | + ep->>s3: Remove data from Storage |
| 256 | + end |
| 257 | + ep->>-cp: Cleanup complete |
| 258 | + end |
| 259 | +``` |
| 260 | + |
| 261 | +### Scalability (if relevant) |
| 262 | + |
| 263 | +Provisionally: As this service is going to be part of compute startup, this |
| 264 | +service should be able to quickly respond to all requests. Therefore this |
| 265 | +service is deployed to every AZ we host Computes in, and Computes communicate |
| 266 | +(generally) only to the EPUFS endpoint of the AZ they're hosted in. |
| 267 | + |
| 268 | +Local caching of frequently restarted endpoints' data or metadata may be |
| 269 | +needed for best performance. However, due to the regional nature of stored |
| 270 | +data but zonal nature of the service deployment, we should be careful when we |
| 271 | +implement any local caching, as it is possible that computes in AZ 1 will |
| 272 | +update data originally written and thus cached by AZ 2. Cache version tests |
| 273 | +and invalidation is therefore required if we want to roll out caching to this |
| 274 | +service, which is too broad a scope for an MVC. This is why caching is left |
| 275 | +out of scope for this RFC, and should be considered separately after this RFC |
| 276 | +is implemented. |
| 277 | + |
| 278 | +### Security implications (if relevant) |
| 279 | +This service must be able to authenticate users at least by Tenant ID, |
| 280 | +Timeline ID and Endpoint ID. This will use the existing JWT infrastructure of |
| 281 | +Compute, which will be upgraded to the extent needed to support Timeline- and |
| 282 | +Endpoint-based claims. |
| 283 | + |
| 284 | +The service requires unlimited access to (a prefix of) a blob storage bucket, |
| 285 | +and thus must be hosted outside the Compute VM sandbox. |
| 286 | + |
| 287 | +A service that generates pre-signed request URLs for Compute to download the |
| 288 | +data from that URL is likely problematic, too: Compute would be able to write |
| 289 | +unlimited data to the bucket, or exfiltrate this signed URL to get read/write |
| 290 | +access to specific objects in this bucket, which would still effectively give |
| 291 | +users access to the S3 bucket (but with improved access logging). |
| 292 | + |
| 293 | +There may be a use case for transferring data associated with one endpoint to |
| 294 | +another endpoint (e.g. to make one endpoint warm its caches with the state of |
| 295 | +another endpoint), but that's not currently in scope, and specific needs may |
| 296 | +be solved through out-of-line communication of data or pre-signed URLs. |
| 297 | + |
| 298 | +### Unresolved questions (if relevant) |
| 299 | +Caching of files is not in the implementation scope of the document, but |
| 300 | +should at some future point be considered to maximize performance. |
| 301 | + |
| 302 | +## Alternative implementation (if relevant) |
| 303 | +Several ideas have come up to solve this issue: |
| 304 | + |
| 305 | +### Use AUXfile |
| 306 | +One prevalent idea was to WAL-log the files using our AUXfile mechanism. |
| 307 | + |
| 308 | +Benefits: |
| 309 | + |
| 310 | ++ We already have this storage mechanism |
| 311 | + |
| 312 | +Demerits: |
| 313 | + |
| 314 | +- It isn't available on read replicas |
| 315 | +- Additional WAL will be consumed during shutdown and after the shutdown |
| 316 | + checkpoint, which needs PG modifications to work without panics. |
| 317 | +- It increases the data we need to manage in our versioned storage, thus |
| 318 | + causing higher storage costs with higher retention due to duplication at |
| 319 | + the storage layer. |
| 320 | + |
| 321 | +### Sign URLs for read/write operations, instead of proxying them |
| 322 | + |
| 323 | +Benefits: |
| 324 | + |
| 325 | ++ The service can be implemented with a much reduced IO budget |
| 326 | + |
| 327 | +Demerits: |
| 328 | + |
| 329 | +- Users could get access to these signed credentials |
| 330 | +- Not all blob storage services may implement URL signing |
| 331 | + |
| 332 | +### Give endpoints each their own directly accessed block volume |
| 333 | + |
| 334 | +Benefits: |
| 335 | + |
| 336 | ++ Easier to integrate for PostgreSQL |
| 337 | + |
| 338 | +Demerits: |
| 339 | + |
| 340 | +- Little control on data size and contents |
| 341 | +- Potentially problematic as we'd need to store data all across the pgdata |
| 342 | + directory. |
| 343 | +- EBS is not a good candidate |
| 344 | + - Attaches in 10s of seconds, if not more; i.e. too cold to start |
| 345 | + - Shared EBS volumes are a no-go, as you'd have to schedule the endpoint |
| 346 | + with users of the same EBS volumes, which can't work with VM migration |
| 347 | + - EBS storage costs are very high (>80$/kilotenant when using a |
| 348 | + volume/tenant) |
| 349 | + - EBS volumes can't be mounted across AZ boundaries |
| 350 | +- Bucket per endpoint is unfeasible |
| 351 | + - S3 buckets are priced at $20/month per 1k, which we could better spend |
| 352 | + on developers. |
| 353 | + - Allocating service accounts takes time (100s of ms), and service accounts |
| 354 | + are a limited resource, too; so they're not a good candidate to allocate |
| 355 | + on a per-endpoint basis. |
| 356 | + - Giving credentials limited to prefix has similar issues as the pre-signed |
| 357 | + URL approach. |
| 358 | + - Bucket DNS lookup will fill DNS caches and put pressure on DNS lookup |
| 359 | + much more than our current systems would. |
| 360 | +- Volumes bound by hypervisor are unlikely |
| 361 | + - This requires significant investment and increased software on the |
| 362 | + hypervisor. |
| 363 | + - It is unclear if we can attach volumes after boot, i.e. for pooled |
| 364 | + instances. |
| 365 | + |
| 366 | +### Put the files into a table |
| 367 | + |
| 368 | +Benefits: |
| 369 | + |
| 370 | + + Mostly already available in PostgreSQL |
| 371 | + |
| 372 | +Demerits: |
| 373 | + |
| 374 | + - Uses WAL |
| 375 | + - Can't be used after shutdown checkpoint |
| 376 | + - Needs a RW endpoint, and table & catalog access to write to this data |
| 377 | + - Gets hit with DB size limitations |
| 378 | + - Depending on user acces: |
| 379 | + - Inaccessible: |
| 380 | + The user doesn't have control over database size caused by |
| 381 | + these systems. |
| 382 | + - Accessible: |
| 383 | + The user can corrupt these files and cause the system to crash while |
| 384 | + user-corrupted files are present, thus increasing on-call overhead. |
| 385 | + |
| 386 | +## Definition of Done (if relevant) |
| 387 | + |
| 388 | +This project is done if we have: |
| 389 | + |
| 390 | +- One S3 bucket equivalent per region, which stores this per-endpoint data. |
| 391 | +- A new service endpoint in at least every AZ, which indirectly grants |
| 392 | + endpoints access to the data stored for these endpoints in these buckets. |
| 393 | +- Compute writes & reads temp-data at shutdown and startup, respectively, for |
| 394 | + at least the pg_prewarm or lfc_prewarm state files. |
| 395 | +- Cleanup of endpoint data is triggered when the endpoint is deleted or is |
| 396 | + detached from its current timeline. |
0 commit comments