Skip to content

Download SLRU segments on demand #6151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Jan 31, 2024
Merged

Download SLRU segments on demand #6151

merged 33 commits into from
Jan 31, 2024

Conversation

knizhnik
Copy link
Contributor

Problem

See https://github.com/neondatabase/cloud/issues/8673

Summary of changes

Download missed SLRU segments from page server

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist
@knizhnik knizhnik requested review from a team as code owners December 15, 2023 14:53
@knizhnik knizhnik requested review from lubennikovaav and hlinnaka and removed request for a team December 15, 2023 14:53
Copy link

github-actions bot commented Dec 15, 2023

2376 tests run: 2263 passed, 0 failed, 113 skipped (full report)


Flaky tests (10)

Postgres 16

  • test_ts_of_lsn_api: debug
  • test_gc_cutoff: release
  • test_empty_branch_remote_storage_upload: debug

Postgres 15

  • test_crafted_wal_end[last_wal_record_crossing_segment]: release
  • test_statvfs_pressure_usage: debug
  • test_gc_cutoff: release

Postgres 14

  • test_statvfs_pressure_min_avail_bytes: debug
  • test_lfc_resize: debug
  • test_emergency_mode: debug
  • test_wal_restore_initdb: release

Code coverage (full report)

  • functions: 54.4% (11203 of 20582 functions)
  • lines: 81.5% (63000 of 77310 lines)

The comment gets automatically updated with the latest test results
4e0f67a at 2024-01-31T18:59:01.514Z :recycle:
@knizhnik knizhnik force-pushed the ondemand_slru_download branch from daa8847 to d958870 Compare December 16, 2023 09:04
@knizhnik
Copy link
Contributor Author

Open questions:

  1. Should we always download SLRU segments on demand or only if it is large enough? Will on-demand downloading of SLRU segments have some negative impact on performance and PS load? I suspect long startup time with huge CLOG/multixacts is caused not by sending/extracting basebackup, but by creating this basebackup at PS. I do not know some fast way to calculate SLRU size except iterating all its segments (which is the same as extracting them). For CLOG it may be possible to remember last truncate LSN and compare it with last record LSN. But it is less clear what to do with multixacts (which according to reports in pgsql conference also can be large enough).
  2. How to determine request LSN for SLRU segment? Right now last flush LSN is used. But actually we need LSN of basebackup. But do not know how to get it at compute. Requesting last flush LSN can cause unnecessary waits fort LSN at PS.
@knizhnik knizhnik force-pushed the ondemand_slru_download branch from bde5da5 to fbbf134 Compare December 16, 2023 20:23
@knizhnik
Copy link
Contributor Author

I used the following script to generate a lot of transactions and tuple versions (with different xmins):

CREATE TABLE t (x integer PRIMARY KEY, x integer);
INSERT INTO t VALUES (1, 0);
CREATE PROCEDURE updating() $$
DECLARE
  i integer;
BEGIN
FOR i IN 1..100000000 LOOP
  UPDATE t SET x = x + 1 WHERE pk=1;
  COMMIT;
END LOOP;
END
$$ LANGUAGE plpgsql;

Size of pg_xact directory is 25Mb.
Restart of compute takes 0.367 seconds.
Subsequent iteration through the whole table:

select sum(x) from t;
Time: 279.548 ms

So it seems to be fast enough, although subsequent execution takes just 0.666 ms.
I do not think that it is because of caching data in shared buffers, because of ring buffer policy.

@knizhnik
Copy link
Contributor Author

Many tests are failed in this PR because of two reasons:

  1. check_restored_datadir_content found differences in SLRUs. It obviously explained by on-demand download. Should we just filter this SLRU files?
  2. Looks like at read-only replicas GetFlushRecPtr() returns 0/0 LSN. So the question how to calculate LSN for retrieving SLRUs becomes even more critical.
@knizhnik knizhnik force-pushed the ondemand_slru_download branch 4 times, most recently from e476e3c to a5daf3a Compare December 22, 2023 06:43
@knizhnik knizhnik force-pushed the ondemand_slru_download branch 2 times, most recently from ba1b96a to de822ea Compare December 26, 2023 12:34
Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch to postgres can be made a little smaller, if you remove the new 'SlruCtlData.kind' field. You can pass the path to smgr_read_slru_segment() instead, and have the extension figure out which SLRU the path corresponds to.

@knizhnik
Copy link
Contributor Author

Please do that as a separate PR

It is already merged.

@knizhnik knizhnik force-pushed the ondemand_slru_download branch from 98c3ce8 to de7a71a Compare December 27, 2023 17:03
@knizhnik
Copy link
Contributor Author

The patch to postgres can be made a little smaller, if you remove the new 'SlruCtlData.kind' field. You can pass the path to smgr_read_slru_segment() instead, and have the extension figure out which SLRU the path corresponds to.

Done

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on the postgres parts at neondatabase/postgres#331 (review)

@knizhnik knizhnik force-pushed the ondemand_slru_download branch 2 times, most recently from d79abb4 to f4a1c9f Compare January 17, 2024 12:39
@knizhnik knizhnik force-pushed the ondemand_slru_download branch from 2670eaf to 6ae587e Compare January 29, 2024 08:08
@knizhnik
Copy link
Contributor Author

To be able to merge this PR I also need correspondent Postgres PR's too be approved and merged:
neondatabase/postgres#331
neondatabase/postgres#332
neondatabase/postgres#333

@knizhnik knizhnik merged commit 9a9d9be into main Jan 31, 2024
@knizhnik knizhnik deleted the ondemand_slru_download branch January 31, 2024 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants