Join performance in postgres

Question

I have 2 tables scheduling_flownode and xact_message with weak relation between them. I am trying to execute the following query

set search_path='ad_96d5be';
explain analyze 
SELECT f.id, f.target_object_id 
FROM "scheduling_flownode" f, 
     "xact_message" m 
where f.target_object_id = m.id 
and f.root_node=True 
AND f.state=1 
and m.state=4 
and m.templatelanguage_id IN (17, 18, 19, 20, 21, 22, 23, 24);

On execution, I get the following query plan

  Gather  (cost=252701.26..1711972.04 rows=374109 width=8) (actual time=17737.908..164181.063 rows=441130 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=35705 read=1346425, temp read=18190 written=18148
   ->  Hash Join  (cost=251701.26..1673561.14 rows=155879 width=8) (actual time=18805.587..163991.468 rows=147043 loops=3)
         Hash Cond: (f.target_object_id = m.id)
         Buffers: shared hit=35705 read=1346425, temp read=18190 written=18148
         ->  Parallel Bitmap Heap Scan on scheduling_flownode f  (cost=124367.21..1523127.76 rows=2061083 width=8) (actual time=963.910..155466.840 rows=1642157 loops=3)
               Recheck Cond: (state = 1)
               Rows Removed by Index Recheck: 44
               Filter: root_node
               Rows Removed by Filter: 12406874
               Heap Blocks: exact=10570 lossy=427078
               Buffers: shared read=1328631
               ->  Bitmap Index Scan on "root-node-and-state"  (cost=0.00..123130.57 rows=4946600 width=0) (actual time=955.044..955.045 rows=4926472 loops=1)
                     Index Cond: ((root_node = true) AND (state = 1))
                     Buffers: shared read=13464
         ->  Hash  (cost=120677.64..120677.64 rows=405712 width=4) (actual time=7124.131..7124.131 rows=441128 loops=3)
               Buckets: 131072  Batches: 8  Memory Usage: 2966kB
               Buffers: shared hit=35591 read=17793, temp written=3384
               ->  Bitmap Heap Scan on xact_message m  (cost=7893.56..120677.64 rows=405712 width=4) (actual time=61.307..6925.456 rows=441128 loops=3)
                     Recheck Cond: (state = 4)
                     Filter: (templatelanguage_id = ANY ('{17,18,19,20,21,22,23,24}'::integer[]))
                     Rows Removed by Filter: 4
                     Heap Blocks: exact=16585
                     Buffers: shared hit=35591 read=17793
                     ->  Bitmap Index Scan on "state-index"  (cost=0.00..7792.13 rows=421826 width=0) (actual time=58.781..58.781 rows=441132 loops=3)
                           Index Cond: (state = 4)
                           Buffers: shared hit=2420 read=1209
 Planning time: 1.382 ms
 Execution time: 164289.481 ms
(31 rows)

scheduling_flownode here has over 400,00,000 entries and xact_message has about 50,00,000 rows. Working on postgres 10, am I wrong to believe that this much load should be handled easily by postgres ? If it can, am I doing something wrong with the query here ?

The statistics are way off. Does running analyze scheduling_flownode; change anything? — user330315
– user330315, Commented Jul 4, 2019 at 14:51
It depends on your hardware, your other load, and your indexing. — Joel Coehoorn
– Joel Coehoorn, Commented Jul 4, 2019 at 15:04
Hardware includes AWS RDS server 8GB RAM, 200 GB Memory and indexes are as per the query plan. There's absolutely no other load on the system. Single query running — Varsha Teckchandani
– Varsha Teckchandani, Commented Jul 4, 2019 at 15:06
\d scheduling_flownode, \d xact_message are needed. I would say that target_object_id should be indexed, also your execution plan looks strange - what is (root_node AND (date_part('month'::text, created) = '6'::double precision))??? Please call explain once again, with provided query. — Michał Zaborowski
– Michał Zaborowski, Commented Jul 4, 2019 at 15:16
Updated the comment to remove datepart. Sorry for the confusion. Just been trying too many things. Also, there's an index on target_object_id already but postgres isnt using it — Varsha Teckchandani
– Varsha Teckchandani, Commented Jul 4, 2019 at 15:25

isapir · Accepted Answer · 2019-07-05 03:01:29Z

2

You didn't show what indexes you have, but I strongly recommend that your indexes will cover all of the columns that you are filtering on.

In Postgres 11 that can be done by using COVERING indexes, so for example on table scheduling_flownode you will have an index like:

CREATE INDEX ix_scheduling_flownode_target_object_id 
  ON scheduling_flownode(target_object_id) 
    INCLUDE (state, root_node);

In Postgres 10 simply include the columns in the index:

CREATE INDEX ix_scheduling_flownode_target_object_id 
  ON scheduling_flownode(target_object_id, state, root_node);

Do the same for table xact_message with templatelanguage_id and state.

answered Jul 5, 2019 at 3:01

isapir

24k16 gold badges127 silver badges123 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Varsha Teckchandani Over a year ago

We added this query but it seems that postgres doesnt use it while querying

Collectives™ on Stack Overflow

Join performance in postgres

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related