PostgreSQL not keeping most things cached between queries

Question

I have a PostgreSQL table with millions of rows. The table is around 130GB.

I'm filtering this table on a tsvector column. The tsvector column has a GIN index. The index is around 6GB.

Indexes:
    "packages_text_search_vector_idx" gin (text_search_vector)

The following query is very slow.

SELECT
 name
FROM
  packages
WHERE
  text_search_vector @@ to_tsquery('value');

After using EXPLAIN (ANALYZE, BUFFERS), it appears that this query is slow due to the amount of data that is being retrieved from disk, rather than from the cache.

                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=19651.30..6514076.08 rows=2297168 width=504) (actual time=443.159..8388.212 rows=2293550 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=348 read=2094387
   I/O Timings: read=4294.699
   ->  Parallel Bitmap Heap Scan on packages  (cost=18651.30..6283359.28 rows=957153 width=504) (actual time=441.121..7950.227 rows=764517 loops=3)
         Recheck Cond: (text_search_vector @@ to_tsquery('value'::text))
         Rows Removed by Index Recheck: 2009576
         Heap Blocks: exact=5371 lossy=676087
         Buffers: shared hit=348 read=2094387
         I/O Timings: read=4294.699
         ->  Bitmap Index Scan on packages_text_search_vector_idx  (cost=0.00..18077.01 rows=2297168 width=0) (actual time=432.139..432.140 rows=2293550 loops=1)
               Index Cond: (text_search_vector @@ to_tsquery('value'::text))
               Buffers: shared hit=4 read=617
               I/O Timings: read=2.083
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.147 ms
 Execution Time: 8530.372 ms
(19 rows)

I would expect Postgres to leave this data buffered/cached after the query is finished, so that subsequent executions of the same query would be much faster. However, this doesn't seem to be the case. The query takes about the same amount of time each time I run it, and the Buffers: line in the EXPLAIN output still says it is performing around 2 millions shared buffer reads each time.

Am I misunderstanding something? Or are there some parameters that I should tune to get better performance? This is running on an AWS RDS instance with 64GB of RAM.

work_mem: 4MB
shared_buffers: 16284208kB (~16GB)
effective_cache_size: 32568416kB (~32GB)

jjanes · Accepted Answer · 2023-10-11 02:38:51Z

The total number of buffers accessed in the query is so close to the setting of shared_buffers, that I suspect this was maliciously constructed.

When running highly stylized queries like this which are designed to almost exactly fill shared_buffers, it is pretty easy to get into a situation where a buffer is evicted immediately before it would have been accessed again, leading to "pessimal" performance of the buffer cache. I suspect this is what is happening here.

Usually the OS cache would prevent this from being an actual performance problem as the shared_buffers misses would still be hits at the OS level, and why it doesn't do that here is an RDS specific question rather than a community PostgreSQL question.

Note that only half the time is actually spent in IO. I suspect that quite a bit of the rest of going to rechecking tuples, which is needed because your work_mem is much too low for this type of query causing the bitmap to overflow and go lossy.

Stack Exchange Network

PostgreSQL not keeping most things cached between queries

1 Answer 1

Hot Network Questions

PostgreSQL not keeping most things cached between queries

1 Answer 1

Related

Hot Network Questions