Understand/Optimize SQL query in Postgresql

Question

so I have a query and I'm running it on the worst case scenario and it's taking 10-12 minutes. if I remove the time check on the where query it goes down to 20-30 seconds so I was wondering how to optimize it?

I tried adding index on the conversion of the timestamp to time but it didn't really help... the rs tables (register_status) has over 70 million rows, register_date around 280k and the cp one less than 1k.

The idea of the query is getting all the results of CP grouped by status over a period of dates, included in a time range. And that's the worst case scenario, so that's the first date in the database and if the user select the whole day as time range. Query is the following:

explain analyze SELECT
    COUNT(rs.status) filter (where rs.status = 'Occ') as total_occ,
    COUNT(rs.status) filter (where rs.status = 'Part') as total_part,
    COUNT(rs.status) filter (where rs.status = 'OOS') as total_oos,
    COUNT(rs.status) filter (where rs.status = 'OOC') as total_ooc,
    cp.id as charge_point_id,
    cp.address,
    cp.type as charge_point_type,
    cp.latitude,
    cp.longitude 
FROM register_date rd 
    inner join register_status rs on rs.fk_register_date = rd.id
    inner join charge_point cp on cp.id = rs.fk_charge_point
WHERE 
 rd.date::date >= '2016-11-01' and rd.date::date <= '2019-08-01'
 AND
 rd.date::time >= time '00:00' AND rd.date::time <= time '23:59'
group by cp.id

And the EXPLAIN ANALYZE result is the following one, I can see a lot of space usage...

"Finalize GroupAggregate  (cost=34412.78..34536.10 rows=780 width=124) (actual time=689440.380..699740.172 rows=813 loops=1)"
"  Group Key: cp.id"
"  ->  Gather Merge  (cost=34412.78..34519.27 rows=722 width=124) (actual time=689421.445..699736.996 rows=1579 loops=1)"
"        Workers Planned: 1"
"        Workers Launched: 1"
"        ->  Partial GroupAggregate  (cost=33412.77..33438.04 rows=722 width=124) (actual time=649515.576..659674.461 rows=790 loops=2)"
"              Group Key: cp.id"
"              ->  Sort  (cost=33412.77..33414.57 rows=722 width=96) (actual time=649496.720..654001.697 rows=24509314 loops=2)"
"                    Sort Key: cp.id"
"                    Sort Method: external merge  Disk: 2649104kB"
"                    Worker 0:  Sort Method: external merge  Disk: 2652840kB"
"                    ->  Nested Loop  (cost=0.56..33378.49 rows=722 width=96) (actual time=1.343..504948.423 rows=24509314 loops=2)"
"                          ->  Parallel Seq Scan on register_date rd  (cost=0.00..6443.69 rows=4 width=4) (actual time=0.021..294.724 rows=139760 loops=2)"
"                                Filter: (((date)::date >= '2016-11-01'::date) AND ((date)::date <= '2019-08-01'::date) AND ((date)::time without time zone >= '00:00:00'::time without time zone) AND ((date)::time without time zone <= '23:59:00'::time without time zone))"
"                          ->  Nested Loop  (cost=0.56..6725.90 rows=780 width=100) (actual time=0.077..3.574 rows=175 loops=279519)"
"                                ->  Seq Scan on charge_point cp  (cost=0.00..21.80 rows=780 width=92) (actual time=0.002..0.077 rows=813 loops=279519)"
"                                ->  Index Only Scan using register_status_fk_charge_point_fk_register_date_status_key on register_status rs  (cost=0.56..8.58 rows=1 width=12) (actual time=0.004..0.004 rows=0 loops=227248947)"
"                                      Index Cond: ((fk_charge_point = cp.id) AND (fk_register_date = rd.id))"
"                                      Heap Fetches: 49018627"
"Planning Time: 0.506 ms"
"Execution Time: 700065.010 ms"

So you have a timestamp column called date.

wildplasser
– wildplasser

2019-12-04 22:40:32 +00:00
Commented Dec 4, 2019 at 22:40 — wildplasser
– wildplasser, Commented Dec 4, 2019 at 22:40

Gordon Linoff · Accepted Answer · 2019-12-04 22:37:40Z

3

This might be faster with a lateral join:

SELECT cp.*, rd.*
FROM charge_point cp CROSS JOIN LATERAL
     (SELECT COUNT(*) filter (where rs.status = 'Occ') as total_occ,
             COUNT(*) filter (where rs.status = 'Part') as total_part,
             COUNT(*) filter (where rs.status = 'OOS') as total_oos,
             COUNT(*) filter (where rs.status = 'OOC') as total_ooc,
      FROM register_date rd JOIN
           register_status rs 
           ON rs.fk_register_date = rd.id
      WHERE cp.id = rs.fk_charge_point AND
            rd.date >= '2016-11-01' and
            rd.date < '2019-08-01' + interval '1 day'
     ) rd;

Indexes on register_date(fk_charge_point, date) and register_status(id, status) are recommended.

Note that I changed the date comparisons so they are more index-friendly. I see no reason to filter by time, so I removed those conditions.

edited Dec 4, 2019 at 22:37

answered Dec 4, 2019 at 19:46

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

wildplasser Over a year ago

Note: ) cp; [ plus: I dont think you need the LATERAL here, only the not-splitting of the timestamps would suffice, IMO]

user3107720 Over a year ago

I do need the time, because it could be needed only the results where the time of the date is between 5am to 6pm for example, so as I said I'm running the query in the worst case scenario, in case the user selects from 00:00 to 23:59 , which wouldn't make sense, but still possible.

Gordon Linoff Over a year ago

@user3107720 . . . Then add the condition back in. The lateral join should still use the index for the charge point and date.

user3107720 · Accepted Answer · 2019-12-04 22:34:44Z

I used Gordon's approach to develop a new query and it ended up being way faster, from 10-12 min to 20-40 seconds:

SELECT cp.*, rd.* from charge_point cp cross join lateral
    (select
        COUNT(rs.status) filter (where rs.status = 'Occ') as total_occ,
        COUNT(rs.status) filter (where rs.status = 'Part') as total_part,
        COUNT(rs.status) filter (where rs.status = 'OOS') as total_oos,
        COUNT(rs.status) filter (where rs.status = 'OOC') as total_ooc,
        rs.fk_charge_point as cpid
    FROM register_date rd 
        inner join register_status rs on rs.fk_register_date = rd.id
    WHERE 
        rd.date::date >= '2019-02-01' and rd.date::date <= '2019-08-01'
        AND
        rd.date::time >= time '00:00' AND rd.date::time <= time '23:59'
    group by rs.fk_charge_point) rd
where cp.id = rd.cpid

I still need to check if adding any indexes will make it even faster, but it's looking good so far

Collectives™ on Stack Overflow

Understand/Optimize SQL query in Postgresql

2 Answers 2

3 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Related