I have a PostgreSQL table with millions of rows. The table is around 130GB.
I'm filtering this table on a tsvector column. The tsvector column has a GIN index. The index is around 6GB.
Indexes:
"packages_text_search_vector_idx" gin (text_search_vector)
The following query is very slow.
SELECT
name
FROM
packages
WHERE
text_search_vector @@ to_tsquery('value');
After using EXPLAIN (ANALYZE, BUFFERS), it appears that this query is slow due to the amount of data that is being retrieved from disk, rather than from the cache.
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=19651.30..6514076.08 rows=2297168 width=504) (actual time=443.159..8388.212 rows=2293550 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=348 read=2094387
I/O Timings: read=4294.699
-> Parallel Bitmap Heap Scan on packages (cost=18651.30..6283359.28 rows=957153 width=504) (actual time=441.121..7950.227 rows=764517 loops=3)
Recheck Cond: (text_search_vector @@ to_tsquery('value'::text))
Rows Removed by Index Recheck: 2009576
Heap Blocks: exact=5371 lossy=676087
Buffers: shared hit=348 read=2094387
I/O Timings: read=4294.699
-> Bitmap Index Scan on packages_text_search_vector_idx (cost=0.00..18077.01 rows=2297168 width=0) (actual time=432.139..432.140 rows=2293550 loops=1)
Index Cond: (text_search_vector @@ to_tsquery('value'::text))
Buffers: shared hit=4 read=617
I/O Timings: read=2.083
Planning:
Buffers: shared hit=1
Planning Time: 0.147 ms
Execution Time: 8530.372 ms
(19 rows)
I would expect Postgres to leave this data buffered/cached after the query is finished, so that subsequent executions of the same query would be much faster. However, this doesn't seem to be the case. The query takes about the same amount of time each time I run it, and the Buffers: line in the EXPLAIN output still says it is performing around 2 millions shared buffer reads each time.
Am I misunderstanding something? Or are there some parameters that I should tune to get better performance? This is running on an AWS RDS instance with 64GB of RAM.
work_mem: 4MB
shared_buffers: 16284208kB (~16GB)
effective_cache_size: 32568416kB (~32GB)