Skip to content

Commit 1d49eef

Browse files
authored
RFC: Endpoint Persistent Unlogged Files Storage (#9661)
## Summary A design for a storage system that allows storage of files required to make Neon's Endpoints have a better experience at or after a reboot. ## Motivation Several systems inside PostgreSQL (and Neon) need some persistent storage for optimal workings across reboots and restarts, but still work without. Examples are the cumulative statistics file in `pg_stat/global.stat`, `pg_stat_statements`' `pg_stat/pg_stat_statements.stat`, and `pg_prewarm`'s `autoprewarm.blocks`. We need a storage system that can store and manage these files for each Endpoint. [GH rendered file](https://github.com/neondatabase/neon/blob/MMeent/rfc-unlogged-file/docs/rfcs/040-Endpoint-Persistent-Unlogged-Files-Storage.md) Part of https://github.com/neondatabase/cloud/issues/24225
1 parent 6c77638 commit 1d49eef

File tree

1 file changed

+396
-0
lines changed

1 file changed

+396
-0
lines changed
Lines changed: 396 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,396 @@
1+
# Memo: Endpoint Persistent Unlogged Files Storage
2+
Created on 2024-11-05
3+
Implemented on N/A
4+
5+
## Summary
6+
A design for a storage system that allows storage of files required to make
7+
Neon's Endpoints have a better experience at or after a reboot.
8+
9+
## Motivation
10+
Several systems inside PostgreSQL (and Neon) need some persistent storage for
11+
optimal workings across reboots and restarts, but still work without.
12+
Examples are the query-level statistics files of `pg_stat_statements` in
13+
`pg_stat/pg_stat_statements.stat`, and `pg_prewarm`'s `autoprewarm.blocks`.
14+
We need a storage system that can store and manage these files for each
15+
Endpoint, without necessarily granting users access to an unlimited storage
16+
device.
17+
18+
## Goals
19+
- Store known files for Endpoints with reasonable persistence.
20+
_Data loss in this service, while annoying and bad for UX, won't lose any
21+
customer's data._
22+
23+
## Non Goals (if relevant)
24+
- This storage system does not need branching, file versioning, or other such
25+
features. The files are as ephemeral to the timeline of the data as the
26+
Endpoints that host the data.
27+
- This storage system does not need to store _all_ user files, only 'known'
28+
user files.
29+
- This storage system does not need to be hosted fully inside Computes.
30+
_Instead, this will be a separate component similar to Pageserver,
31+
SafeKeeper, the S3 proxy used for dynamically loaded extensions, etc._
32+
33+
## Impacted components
34+
- Compute needs new code to load and store these files in its lifetime.
35+
- Control Plane needs to consider this new storage system when signalling
36+
the deletion of an Endpoint, Timeline, or Tenant.
37+
- Control Plane needs to consider this new storage system when it resets
38+
or re-assigns an endpoint's timeline/branch state.
39+
40+
A new service is created: the Endpoint Persistent Unlogged Files Storage
41+
service. This could be integrated in e.g. Pageserver or Control Plane, or a
42+
separately hosted service.
43+
44+
## Proposed implementation
45+
Endpoint-related data files are managed by a newly designed service (which
46+
optionally is integrated in an existing service like Pageserver or Control
47+
Plane), which stores data directly into S3 or any blob storage of choice.
48+
49+
Upon deletion of the Endpoint, or reassignment of the endpoint to a different
50+
branch, this ephemeral data is dropped: the data stored may not match the
51+
state of the branch's data after reassignment, and on endpoint deletion the
52+
data won't have any use to the user.
53+
54+
Compute gets credentials (JWT token with Tenant, Timeline & Endpoint claims)
55+
which it can use to authenticate to this new service and retrieve and store
56+
data associated with this endpoint. This limited scope reduces leaks of data
57+
across endpoints and timeline resets, and limits the ability of endpoints to
58+
mess with other endpoints' data.
59+
60+
The path of this endpoint data in S3 is initially as follows:
61+
62+
s3://<regional-epufs-bucket>/
63+
tenants/
64+
<hex-tenant-id>/
65+
tenants/
66+
<hex-timeline-id>/
67+
endpoints/
68+
<endpoint-id>/
69+
pgdata/
70+
<file_path_in_pgdatadir>
71+
72+
For other blob storages an equivalent or similar path can be constructed.
73+
74+
### Reliability, failure modes and corner cases (if relevant)
75+
Reliability is important, but not critical to the workings of Neon. The data
76+
stored in this service will, when lost, reduce performance, but won't be a
77+
cause of permanent data loss - only operational metadata is stored.
78+
79+
Most, if not all, blob storage services have sufficiently high persistence
80+
guarantees to cater our need for persistence and uptime. The only concern with
81+
blob storages is that the access latency is generally higher than local disk,
82+
but for the object types stored (cache state, ...) I don't think this will be
83+
much of an issue.
84+
85+
### Interaction/Sequence diagram (if relevant)
86+
87+
In these diagrams you can replace S3 with any persistent storage device of
88+
choice, but S3 is chosen as representative name: The well-known and short name
89+
of AWS' blob storage. Azure Blob Storage should work too, but it has a much
90+
longer name making it less practical for the diagrams.
91+
92+
Write data:
93+
94+
```http
95+
POST /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
96+
Host: epufs.svc.neon.local
97+
98+
<<<
99+
100+
200 OK
101+
{
102+
"version": "<opaque>", # opaque file version token, changes when the file contents change
103+
"size": <bytes>,
104+
}
105+
```
106+
107+
```mermaid
108+
sequenceDiagram
109+
autonumber
110+
participant co as Compute
111+
participant ep as EPUFS
112+
participant s3 as Blob Storage
113+
114+
co-->ep: Connect with credentials
115+
co->>+ep: Store Unlogged Persistent File
116+
opt is authenticated
117+
ep->>s3: Write UPF to S3
118+
end
119+
ep->>-co: OK / Failure / Auth Failure
120+
co-->ep: Cancel connection
121+
```
122+
123+
Read data: (optional with cache-relevant request parameters, e.g. If-Modified-Since)
124+
```http
125+
GET /tenants/<tenant-id>/timelines/<tl-id>/endpoints/<endpoint-id>/pgdata/<the-pgdata-path>
126+
Host: epufs.svc.neon.local
127+
128+
<<<
129+
130+
200 OK
131+
132+
<file data>
133+
```
134+
135+
```mermaid
136+
sequenceDiagram
137+
autonumber
138+
participant co as Compute
139+
participant ep as EPUFS
140+
participant s3 as Blob Storage
141+
142+
co->>+ep: Read Unlogged Persistent File
143+
opt is authenticated
144+
ep->>+s3: Request UPF from storage
145+
s3->>-ep: Receive UPF from storage
146+
end
147+
ep->>-co: OK(response) / Failure(storage, auth, ...)
148+
```
149+
150+
Compute Startup:
151+
```mermaid
152+
sequenceDiagram
153+
autonumber
154+
participant co as Compute
155+
participant ps as Pageserver
156+
participant ep as EPUFS
157+
participant es as Extension server
158+
159+
note over co: Bind endpoint ep-xxx
160+
par Get basebackup
161+
co->>+ps: Request basebackup @ LSN
162+
ps-)ps: Construct basebackup
163+
ps->>-co: Receive basebackup TAR @ LSN
164+
and Get startup-critical Unlogged Persistent Files
165+
co->>+ep: Get all UPFs of endpoint ep-xxx
166+
ep-)ep: Retrieve and gather all UPFs
167+
ep->>-co: TAR of UPFs
168+
and Get startup-critical extensions
169+
loop For every startup-critical extension
170+
co->>es: Get critical extension
171+
es->>co: Receive critical extension
172+
end
173+
end
174+
note over co: Start compute
175+
```
176+
177+
CPlane ops:
178+
```http
179+
DELETE /tenants/<tenant-id>/timelines/<timeline-id>/endpoints/<endpoint-id>
180+
Host: epufs.svc.neon.local
181+
182+
<<<
183+
184+
200 OK
185+
{
186+
"tenant": "<tenant-id>",
187+
"timeline": "<timeline-id>",
188+
"endpoint": "<endpoint-id>",
189+
"deleted": {
190+
"files": <count>,
191+
"bytes": <count>,
192+
},
193+
}
194+
```
195+
196+
```http
197+
DELETE /tenants/<tenant-id>/timelines/<timeline-id>
198+
Host: epufs.svc.neon.local
199+
200+
<<<
201+
202+
200 OK
203+
{
204+
"tenant": "<tenant-id>",
205+
"timeline": "<timeline-id>",
206+
"deleted": {
207+
"files": <count>,
208+
"bytes": <count>,
209+
},
210+
}
211+
```
212+
213+
```http
214+
DELETE /tenants/<tenant-id>
215+
Host: epufs.svc.neon.local
216+
217+
<<<
218+
219+
200 OK
220+
{
221+
"tenant": "<tenant-id>",
222+
"deleted": {
223+
"files": <count>,
224+
"bytes": <count>,
225+
},
226+
}
227+
```
228+
229+
```mermaid
230+
sequenceDiagram
231+
autonumber
232+
participant cp as Control Plane
233+
participant ep as EPUFS
234+
participant s3 as Blob Storage
235+
236+
alt Tenant deleted
237+
cp-)ep: Tenant deleted
238+
loop For every object associated with removed tenant
239+
ep->>s3: Remove data of deleted tenant from Storage
240+
end
241+
opt
242+
ep-)cp: Tenant cleanup complete
243+
end
244+
alt Timeline deleted
245+
cp-)ep: Timeline deleted
246+
loop For every object associated with removed timeline
247+
ep->>s3: Remove data of deleted timeline from Storage
248+
end
249+
opt
250+
ep-)cp: Timeline cleanup complete
251+
end
252+
else Endpoint reassigned or removed
253+
cp->>+ep: Endpoint reassigned
254+
loop For every object associated with reassigned/removed endpoint
255+
ep->>s3: Remove data from Storage
256+
end
257+
ep->>-cp: Cleanup complete
258+
end
259+
```
260+
261+
### Scalability (if relevant)
262+
263+
Provisionally: As this service is going to be part of compute startup, this
264+
service should be able to quickly respond to all requests. Therefore this
265+
service is deployed to every AZ we host Computes in, and Computes communicate
266+
(generally) only to the EPUFS endpoint of the AZ they're hosted in.
267+
268+
Local caching of frequently restarted endpoints' data or metadata may be
269+
needed for best performance. However, due to the regional nature of stored
270+
data but zonal nature of the service deployment, we should be careful when we
271+
implement any local caching, as it is possible that computes in AZ 1 will
272+
update data originally written and thus cached by AZ 2. Cache version tests
273+
and invalidation is therefore required if we want to roll out caching to this
274+
service, which is too broad a scope for an MVC. This is why caching is left
275+
out of scope for this RFC, and should be considered separately after this RFC
276+
is implemented.
277+
278+
### Security implications (if relevant)
279+
This service must be able to authenticate users at least by Tenant ID,
280+
Timeline ID and Endpoint ID. This will use the existing JWT infrastructure of
281+
Compute, which will be upgraded to the extent needed to support Timeline- and
282+
Endpoint-based claims.
283+
284+
The service requires unlimited access to (a prefix of) a blob storage bucket,
285+
and thus must be hosted outside the Compute VM sandbox.
286+
287+
A service that generates pre-signed request URLs for Compute to download the
288+
data from that URL is likely problematic, too: Compute would be able to write
289+
unlimited data to the bucket, or exfiltrate this signed URL to get read/write
290+
access to specific objects in this bucket, which would still effectively give
291+
users access to the S3 bucket (but with improved access logging).
292+
293+
There may be a use case for transferring data associated with one endpoint to
294+
another endpoint (e.g. to make one endpoint warm its caches with the state of
295+
another endpoint), but that's not currently in scope, and specific needs may
296+
be solved through out-of-line communication of data or pre-signed URLs.
297+
298+
### Unresolved questions (if relevant)
299+
Caching of files is not in the implementation scope of the document, but
300+
should at some future point be considered to maximize performance.
301+
302+
## Alternative implementation (if relevant)
303+
Several ideas have come up to solve this issue:
304+
305+
### Use AUXfile
306+
One prevalent idea was to WAL-log the files using our AUXfile mechanism.
307+
308+
Benefits:
309+
310+
+ We already have this storage mechanism
311+
312+
Demerits:
313+
314+
- It isn't available on read replicas
315+
- Additional WAL will be consumed during shutdown and after the shutdown
316+
checkpoint, which needs PG modifications to work without panics.
317+
- It increases the data we need to manage in our versioned storage, thus
318+
causing higher storage costs with higher retention due to duplication at
319+
the storage layer.
320+
321+
### Sign URLs for read/write operations, instead of proxying them
322+
323+
Benefits:
324+
325+
+ The service can be implemented with a much reduced IO budget
326+
327+
Demerits:
328+
329+
- Users could get access to these signed credentials
330+
- Not all blob storage services may implement URL signing
331+
332+
### Give endpoints each their own directly accessed block volume
333+
334+
Benefits:
335+
336+
+ Easier to integrate for PostgreSQL
337+
338+
Demerits:
339+
340+
- Little control on data size and contents
341+
- Potentially problematic as we'd need to store data all across the pgdata
342+
directory.
343+
- EBS is not a good candidate
344+
- Attaches in 10s of seconds, if not more; i.e. too cold to start
345+
- Shared EBS volumes are a no-go, as you'd have to schedule the endpoint
346+
with users of the same EBS volumes, which can't work with VM migration
347+
- EBS storage costs are very high (>80$/kilotenant when using a
348+
volume/tenant)
349+
- EBS volumes can't be mounted across AZ boundaries
350+
- Bucket per endpoint is unfeasible
351+
- S3 buckets are priced at $20/month per 1k, which we could better spend
352+
on developers.
353+
- Allocating service accounts takes time (100s of ms), and service accounts
354+
are a limited resource, too; so they're not a good candidate to allocate
355+
on a per-endpoint basis.
356+
- Giving credentials limited to prefix has similar issues as the pre-signed
357+
URL approach.
358+
- Bucket DNS lookup will fill DNS caches and put pressure on DNS lookup
359+
much more than our current systems would.
360+
- Volumes bound by hypervisor are unlikely
361+
- This requires significant investment and increased software on the
362+
hypervisor.
363+
- It is unclear if we can attach volumes after boot, i.e. for pooled
364+
instances.
365+
366+
### Put the files into a table
367+
368+
Benefits:
369+
370+
+ Mostly already available in PostgreSQL
371+
372+
Demerits:
373+
374+
- Uses WAL
375+
- Can't be used after shutdown checkpoint
376+
- Needs a RW endpoint, and table & catalog access to write to this data
377+
- Gets hit with DB size limitations
378+
- Depending on user acces:
379+
- Inaccessible:
380+
The user doesn't have control over database size caused by
381+
these systems.
382+
- Accessible:
383+
The user can corrupt these files and cause the system to crash while
384+
user-corrupted files are present, thus increasing on-call overhead.
385+
386+
## Definition of Done (if relevant)
387+
388+
This project is done if we have:
389+
390+
- One S3 bucket equivalent per region, which stores this per-endpoint data.
391+
- A new service endpoint in at least every AZ, which indirectly grants
392+
endpoints access to the data stored for these endpoints in these buckets.
393+
- Compute writes & reads temp-data at shutdown and startup, respectively, for
394+
at least the pg_prewarm or lfc_prewarm state files.
395+
- Cleanup of endpoint data is triggered when the endpoint is deleted or is
396+
detached from its current timeline.

0 commit comments

Comments
 (0)