-
Notifications
You must be signed in to change notification settings - Fork 707
Storage release 2025-06-06 10:55 UTC #12163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…fig. (#12087) Also moves it from `String` to `Url`.
libs/pqproto is designed for safekeeper/pageserver with maximum throughput. proxy only needs it for handshakes/authentication where throughput is not a concern but memory efficiency is. For this reason, we switch to using read_exact and only allocating as much memory as we need to. All reads return a `&'a [u8]` instead of a `Bytes` because accidental sharing of bytes can cause fragmentation. Returning the reference enforces all callers only hold onto the bytes they absolutely need. For example, before this change, `pqproto` was allocating 8KiB for the initial read `BytesMut`, and proxy was holding the `Bytes` in the `StartupMessageParams` for the entire connection through to passthrough.
…12048) Precursor to neondatabase/cloud#28333. We want per-endpoint configuration for rate limits, which will be distributed via the `GetEndpointAccessControl` API. This lays some of the ground work. 1. Allow the endpoint rate limiter to accept a custom leaky bucket config on check. 2. Remove the unused auth rate limiter, as I don't want to think about how it fits into this. 3. Refactor the caching of `GetEndpointAccessControl`, as it adds friction for adding new cached data to the API. That third one was rather large. I couldn't find any way to split it up. The core idea is that there's now only 2 cache APIs. `get_endpoint_access_controls` and `get_role_access_controls`. I'm pretty sure the behaviour is unchanged, except I did a drive by change to fix #8989 because it felt harmless. The change in question is that when a password validation fails, we eagerly expire the role cache if the role was cached for 5 minutes. This is to allow for edge cases where a user tries to connect with a reset password, but the cache never expires the entry due to some redis related quirk (lag, or misconfiguration, or cplane error)
## Problem If the wal receiver is cancelled, there's a 50% chance that it will ingest yet more WAL. ## Summary of Changes Always check cancellation first.
## Problem Imports can end up allocating too much. ## Summary of Changes Nerf them a bunch and add some logs.
## Problem The page service logic asserts that a tracing span is present with tenant/timeline/shard IDs. An initial gRPC page service implementation thus requires a tracing span. Touches #11728. ## Summary of changes Adds an `ObservabilityLayer` middleware that generates a tracing span and decorates it with IDs from the gRPC metadata. This is a minimal implementation to address the tracing span assertion. It will be extended with additional observability in later PRs.
## Problem We need gRPC support in Pagebench to benchmark the new gRPC Pageserver implementation. Touches #11728. ## Summary of changes Adds a `Client` trait to make the client transport swappable, and a gRPC client via a `--protocol grpc` parameter. This must also specify the connstring with the gRPC port: ``` pagebench get-page-latest-lsn --protocol grpc --page-service-connstring grpc://localhost:51051 ``` The client is implemented using the raw Tonic-generated gRPC client, to minimize client overhead.
A smaller version of #12066 that is somewhat easier to review. Now that I've been using https://crates.io/crates/top-type-sizes I've found a lot more of the low hanging fruit that can be tweaks to reduce the memory usage. Some context for the optimisations: Rust's stack allocation in futures is quite naive. Stack variables, even if moved, often still end up taking space in the future. Rearranging the order in which variables are defined, and properly scoping them can go a long way. `async fn` and `async move {}` have a consequence that they always duplicate the "upvars" (aka captures). All captures are permanently allocated in the future, even if moved. We can be mindful when writing futures to only capture as little as possible. TlsStream is massive. Needs boxing so it doesn't contribute to the above issue. ## Measurements from `top-type-sizes`: ### Before ``` 10328 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} align=8 6120 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8 ``` ### After ``` 4040 {async block@proxy::proxy::task_main::{closure#0}::{closure#0}} 4704 {async fn body of proxy::proxy::handle_client<proxy::protocol2::ChainRW<tokio::net::TcpStream>>()} align=8 ```
## Problem We should expose the page service over gRPC. Requires #12093. Touches #11728. ## Summary of changes This patch adds an initial page service implementation over gRPC. It ties in with the existing `PageServerHandler` request logic, to avoid the implementations drifting apart for the core read path. This is just a bare-bones functional implementation. Several important aspects have been omitted, and will be addressed in follow-up PRs: * Limited observability: minimal tracing, no logging, limited metrics and timing, etc. * Rate limiting will currently block. * No performance optimization. * No cancellation handling. * No tests. I've only done rudimentary testing of this, but Pagebench passes at least.
…12102) ## Problem fix #12101; this is a quick hack and we need better API in the future. In `get_db_size`, we call `get_reldir_size` for every relation. However, we do the same deserializing the reldir directory thing for every relation. This creates huge CPU overhead. ## Summary of changes Get and deserialize the reldir v1 key once and use it across all get_rel_size requests. --------- Signed-off-by: Alex Chi Z <[email protected]>
## Problem We should use the full host name for computes, according to neondatabase/cloud#26005 , but now a truncated host name is used. ## Summary of changes The URL for REMOTE_ONNX is rewritten using the FQDN.
## Problem We print a backtrace in an info level log every 10 seconds while waiting for the import data to land in the bucket. ## Summary of changes The backtrace is not useful. Remove it.
## Problem The gRPC `page_api` domain types used smallvecs to avoid heap allocations in the common case where a single page is requested. However, this is pointless: the Protobuf types use a normal vec, and converting a smallvec into a vec always causes a heap allocation anyway. ## Summary of changes Use a normal `Vec` instead of a `SmallVec` in `page_api` domain types.
## Problem Setting `max_batch_size` to anything higher than `Timeline::MAX_GET_VECTORED_KEYS` will cause runtime error. We should rather fail fast at startup if this is the case. ## Summary of changes * Create `max_get_vectored_keys` as a new configuration (default to 32); * Validate `max_batch_size` against `max_get_vectored_keys` right at config parsing and validation. Closes #11994
## Problem Currently, `page_api` domain types validate message invariants both when converting Protobuf → domain and domain → Protobuf. This is annoying for clients, because they can't use stream combinators to convert streamed requests (needed for hot path performance), and also performs the validation twice in the common case. Blocks #12099. ## Summary of changes Only validate the Protobuf → domain type conversion, i.e. on the receiver side, and make domain → Protobuf infallible. This is where it matters -- the Protobuf types are less strict than the domain types, and receivers should expect all sorts of junk from senders (they're not required to validate anyway, and can just construct an invalid message manually). Also adds a missing `impl From<CheckRelExistsRequest> for proto::CheckRelExistsRequest`.
## Problem This is already the default in production and in our test suite. ## Summary of changes Set the default proto to interpreted to reduce friction when spinning up new regions or cells.
JsonResponse::error() properly logs an error message which can be read in the compute logs. invalid_status() was not going through that helper function, thus not logging anything. Signed-off-by: Tristan Partin <[email protected]>
## Problem Part of #11813 ## Summary of changes Add a counter on the feature evaluation outcome and we will set up alerts for too many failed evaluations in the future. Signed-off-by: Alex Chi Z <[email protected]>
…#11820) Adds a `manifest.yml` file that contains the default settings for compute. Currently, it comes from cplane code [here](https://github.com/neondatabase/cloud/blob/0cda3d4b014765c4e2e551a1d56efc095b004e77/goapp/controlplane/internal/pkg/compute/computespec/pg_settings.go#L110). Related RFC: https://github.com/neondatabase/neon/blob/main/docs/rfcs/038-independent-compute-release.md Related Issue: neondatabase/cloud#11698
#12109) ## Problem Inbetween adding the TLS config for compute-ctl, and adding the TLS config in controlplane, we switched from using a provision flag to a bind flag. This happened to work in all of my testing in preview regions as they have no VM pool, so each bind was also a provision. However, in staging I found that the TLS config is still only processed during provision, even though it's only sent on bind. ## Summary of changes * Add a new feature flag value, `tls_experimental`, which tells postgres/pgbouncer/local_proxy to use the TLS certificates on bind. * compute_ctl on provision will be told where the certificates are, instead of being told on bind.
`safekeepers_cmp` was added by #8840 to make changes of the safekeeper set order independent: a `sk1,sk2,sk3` specifier changed to `sk3,sk1,sk2` should not cause a walproposer restart. However, this check didn't support generations, in the sense that it would see the `g#123:` as part of the first safekeeper in the list, and if the first safekeeper changes, it would also restart the walproposer. Therefore, parse the generation properly and make it not be a part of the generation. This PR doesn't add a specific test, but I have confirmed locally that `test_safekeepers_reconfigure_reorder` is fixed with the changes of PR #11712 applied thanks to this PR. Part of #11670
## Problem Investigate crash of online_advisor in image check ## Summary of changes --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
## Problem I believe in all environments we now specify either required/rejected for proxy-protocol V2 as required. We no longer rely on the supported flow. This means we no longer need to keep around read bytes incase they're not in a header. While I designed ChainRW to be fast (the hot path with an empty buffer is very easy to branch predict), it's still unnecessary. ## Summary of changes * Remove the ChainRW wrapper * Refactor how we read the proxy-protocol header using read_exact. Slightly worse perf but it's hardly significant. * Don't try and parse the header if it's rejected.
## Problem Part of #11813 In PostHog UI, we need to create the properties before using them as a filter. We report all variants automatically when we start the pageserver. In the future, we can report all real tenants instead of fake tenants (we do that now to save money + we don't need real tenants in the UI). ## Summary of changes * Collect `region`, `availability_zone`, `pageserver_id` properties and use them in the feature evaluation. * Report 10 fake tenants on each pageserver startup. --------- Signed-off-by: Alex Chi Z <[email protected]>
Split the modules responsible for passing data and connecting to compute from auth and waking for PGLB. This PR just moves files. The waking is going to get removed from pglb after this.
## Problem It will be useful to understand what kind of queries our clients are executed. And one of the most important characteristic of query is query execution time - at least it allows to distinguish OLAP and OLTP queries. Also monitoring query execution time can help to detect problem with performance (assuming that workload is more or less stable). ## Summary of changes Add query execution time histogram. --------- Co-authored-by: Konstantin Knizhnik <[email protected]>
## Problem This PR is part of larger computes support activity: https://www.notion.so/neondatabase/Larger-computes-114f189e00478080ba01e8651ab7da90 Epic: neondatabase/cloud#19010 In case of planned node restart, we are going to 1. create new read-only replica 2. capture LFC state at primary 3. use this state to prewarm replica 4. stop old primary 5. promote replica to primary Steps 1-3 are currently implemented and support from compute side. This PR provides compute level implementation of replica promotion. Support replica promotion ## Summary of changes Right now replica promotion is done in three steps: 1. Set safekeepers list (now it is empty for replica) 2. Call `pg_promote()` top promote replica 3. Update endpoint setting to that it ids not more treated as replica. May be all this three steps should be done by some function in compute_ctl. But right now this logic is only implement5ed in test. Postgres submodules PRs: neondatabase/postgres#648 neondatabase/postgres#649 neondatabase/postgres#650 neondatabase/postgres#651 --------- Co-authored-by: Matthias van de Meent <[email protected]> Co-authored-by: Konstantin Knizhnik <[email protected]>
## Problem We support two ingest protocols on the pageserver: vanilla and interpreted. Interpreted has been the only protocol in use for a long time. ## Summary of changes * Remove the ingest handling of the vanilla protocol * Remove tenant and pageserver configuration for it * Update all tests that tweaked the ingest protocol ## Compatibility Backward compatibility: * The new pageserver version can read the existing pageserver configuration and it will ignore the unknown field. * When the tenant config is read from the storcon db or from the pageserver disk, the extra field will be ignored. Forward compatiblity: * Both the pageserver config and the tenant config map missing fields to their default value. I'm not aware of any tenant level override that was made for this knob.
## Problem The new gRPC page service protocol supports client-side batches. The current libpq protocol only does best-effort server-side batching. To compare these approaches, Pagebench should support submitting contiguous page batches, similar to how Postgres will submit them (e.g. with prefetches or vectored reads). ## Summary of changes Add a `--batch-size` parameter specifying the size of contiguous page batches. One batch counts as 1 RPS and 1 queue depth. For the libpq protocol, a batch is submitted as individual requests and we rely on the server to batch them for us. This will give a realistic comparison of how these would be processed in the wild (e.g. when Postgres sends 100 prefetch requests). This patch also adds some basic validation of responses.
## Problem The same problem, fixed in #11857, but for the image `neon-extesions-test` ## Summary of changes The config file was added to use our library
…-safekeepers (#12143) ## Problem - `test_basebackup_cache` fails in #11712 because after the timelines on safekeepers are managed by storage controller, they do contain proper start_lsn and the compute_ctl tool sends the first basebackup request with this LSN. - `Failed to prepare basebackup` log messages during timeline initialization, because the timeline is not yet in the global timeline map. - Relates to neondatabase/cloud#29353 ## Summary of changes - Account for `timeline_onto_safekeepers` storcon's option in the test. - Do not trigger basebackup prepare during the timeline initialization.
## Problem After introducing a naive downtime calculation for the Postgres process inside compute in #11346, I noticed that some amount of computes regularly report short downtime. After checking some particular cases, it looks like all of them report downtime close to the end of the life of the compute, i.e., when the control plane calls a `/terminate` and we are waiting for Postgres to exit. Compute monitor also produces a lot of error logs because Postgres stops accepting connections, but it's expected during the termination process. ## Summary of changes Regularly check the compute status inside the main compute monitor loop and exit gracefully when the compute is in some terminal or soon-to-be-terminal state. --------- Co-authored-by: Tristan Partin <[email protected]>
neon_local's timeline import subcommand creates timelines manually, but doesn't create them on the safekeepers. If a test then tries to open an endpoint to read from the timeline, it will error in the new world with `--timelines-onto-safekeepers`. Therefore, if that flag is enabled, create the timelines on the safekeepers. Note that this import functionality is different from the fast import feature (#10188, #11801). Part of #11670 As well as part of #11712
Add a test to ensure timeline metrics are fully cleaned up after offloading. Signed-off-by: Alex Chi Z <[email protected]>
## Problem Makes it easier to debug AWS permission issues (i.e., storage scrubber) ## Summary of changes Install awscliv2 into the docker image. Signed-off-by: Alex Chi Z <[email protected]>
…#12096) ## Problem Removed nodes can re-add themselves on restart if not properly tombstoned. We need a mechanism (e.g. soft-delete flag) to prevent this, especially in cases where the node is unreachable. More details there: #12036 ## Summary of changes - Introduced `NodeLifecycle` enum to represent node lifecycle states. - Added a string representation of `NodeLifecycle` to the `nodes` table. - Implemented node removal using a tombstone mechanism. - Introduced `/debug/v1/tombstone*` handlers to manage the tombstone state.
## Problem PGLB/Neonkeeper needs to separate the concerns of connecting to compute, and authenticating to compute. Additionally, the code within `connect_to_compute` is rather messy, spending effort on recovering the authentication info after wake_compute. ## Summary of changes Split `ConnCfg` into `ConnectInfo` and `AuthInfo`. `wake_compute` only returns `ConnectInfo` and `AuthInfo` is determined separately from the `handshake`/`authenticate` process. Additionally, `ConnectInfo::connect_raw` is in-charge or establishing the TLS connection, and the `postgres_client::Config::connect_raw` is configured to use `NoTls` which will force it to skip the TLS negotiation. This should just work.
8c9706f
to
e52313d
Compare
skyzh
approved these changes
Jun 6, 2025
This release will not go into prod; still creating a release so that we can deploy it to preprod and fix any surprises before the real release next week. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.