Index on JSONB not improving query speed

Question

table structure

   Column   |              Type              | Collation | Nullable |              Default
------------+--------------------------------+-----------+----------+------------------------------------
 id         | integer                        |           | not null | nextval('events_id_seq'::regclass)
 event_name | character varying(255)         |           | not null |
 data       | jsonb                          |           | not null |

Indexes:
    "events_pkey" PRIMARY KEY, btree (id)
    "events_expr_idx" btree ((data -> 'program_id'::text))
    "idxginp" gin (data)

what 'data' might look like

{  
   "label":"Take Survey Button",
   "sent_at":"2018-07-25",
   "user_id":123456,
   "order_id":7654321,
   "event_time":"2018-07-25 00:09:41",
   "program_id":4,
   "user_agent":"Mozilla/5.0 ...",
   "destination":"http://www.google.com",
   "communication_kind":"email",
   "communication_name":"welcome"
}

Query that I need to run:

select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
      from "events"
      where "event_name" = 'communication_link_clicked'
        and data @> '{"communication_kind": "email"}'
        and data @> '{"communication_name": "program_welcome"}'
        and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
        and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";

This query on my small VM takes around 1 minute.

Records on this table: ~4 million. Records where data ->> program_id = 4 (~1 million.)

I've tried creating individual indexes for the columns included in the where condition but I din't see major improvements.

Should I convert the json attributes used in the WHERE clause to recular columns? or Am I just missing more indexes and tune my query?

EXPLAIN:

EXPLAIN (ANALYZE, BUFFERS) select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
      from "events"
      where "event_name" = 'communication_link_clicked'
        and data @> '{"communication_kind": "email"}'
        and data @> '{"communication_name": "program_welcome"}'
        and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
        and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";




GroupAggregate  (cost=104.24..104.28 rows=1 width=40) (actual time=43735.844..43735.844 rows=1 loops=1)
  Group Key: ((events.data ->> 'label'::text))
  Buffers: shared hit=839 read=143489, temp read=210 written=210
  ->  Unique  (cost=104.24..104.25 rows=1 width=64) (actual time=43719.789..43731.616 rows=46772 loops=1)
        Buffers: shared hit=839 read=143489, temp read=210 written=210
        ->  Sort  (cost=104.24..104.25 rows=1 width=64) (actual time=43719.787..43724.865 rows=55219 loops=1)
              Sort Key: ((events.data ->> 'label'::text)), ((events.data ->> 'user_id'::text))
              Sort Method: external merge  Disk: 1680kB
              Buffers: shared hit=839 read=143489, temp read=210 written=210
              ->  Bitmap Heap Scan on events  (cost=84.05..104.23 rows=1 width=64) (actual time=448.903..43328.010 rows=55219 loops=1)
                    Recheck Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb))
                    Rows Removed by Index Recheck: 1434250
                    Filter: (((event_name)::text = 'communication_link_clicked'::text) AND (((data ->> 'sent_at'::text))::date >= '2018-07-25'::date) AND (((data ->> 'sent_at'::text))::date > 'program_id'::text))::integer = 4))
                    Rows Removed by Filter: 264849
                    Heap Blocks: exact=42839 lossy=99688
                    Buffers: shared hit=839 read=143489
                    ->  Bitmap Index Scan on idxginp  (cost=0.00..84.05 rows=5 width=0) (actual time=437.235..437.235 rows=320068 loops=1)
                          Index Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb))
                          Buffers: shared hit=839 read=962
Planning time: 0.890 ms
Execution time: 43736.182 ms

@jjanes added explain

Guillermo Mansilla
– Guillermo Mansilla

2019-05-22 17:28:33 +00:00
Commented May 22, 2019 at 17:28 — Guillermo Mansilla
– Guillermo Mansilla, Commented May 22, 2019 at 17:28

Community · Accepted Answer · 2020-06-15 09:05:18Z

So it is using the jsonb index. But your work_mem is not large enough to hold the full bitmap, so it goes "lossy" resulting in extra work, see:

Rows Removed by Index Recheck: 1434250

Heap Blocks: exact=42839 lossy=99688

Increasing the size of your work_mem setting should improve this.

Your functional index cannot be used, because of a mismatch between how you define it and how you try to use it.

((data -> 'program_id'::text))

CAST ((data ->> 'program_id') as int) = 4

One uses "->>", one uses "->". And one casts its output to int, and the other leaves it as jsonb. You should define the index over the same expression as you use in the query, if you want it to be usable. But since 1/4 of the table meets that criterion, the index is probably not going to be helpful in the first place.

To maximize the efficiency here, you might want a compound index over event_name and program_id:

create index ON events (event_name, CAST ((data ->> 'program_id') as int));

That way you get the 1/4 selectivity for free if you are already using the index on event_name. It is hard to say if that is worthwhile or not, as we don't know how selective the event_name = 'communication_link_clicked' is. Your current query plan is not using that index, but maybe improving the selectivity by 4 fold would make it worthwhile.

But the best solution is probably not to use JSONB in the first place. Your JSON structure appears to be flat (based on one example), couldn't you just map those fields into real PostgreSQL columns? You kind of suggest that, but what is the cost in doing so in terms of application complexity? Then the planner would have better statistics to work with. Top level JSON attributes do not get the same kind of statistics on that as real columns do.

Yeah, I'm thinking converting this to regular columns, but wouldn't I have the same problem? (program_id = 4 representing a quarter of the total data) — Guillermo Mansilla
– Guillermo Mansilla, Commented May 22, 2019 at 19:08
Update: after adding a compound index with all columns (except sent_at) in the where condition and removing sent_at from the WHERE I saw some improvement, it went from 40 seconds to 4 seconds. — Guillermo Mansilla
– Guillermo Mansilla, Commented May 22, 2019 at 19:28
Update 2: After adding a new index for (data ->> 'sent_at') and using the new compound index for all columns in the WHERE clause the query went from ~40 seconds to 653ms — Guillermo Mansilla
– Guillermo Mansilla, Commented May 22, 2019 at 20:09
Regarding having the same problem, the planner can see that the problem exists and so make an informed decision. That is usually going to better than having to make decisions in the dark (although sometimes a blind pig finds an acorn). — jjanes
– jjanes, Commented May 23, 2019 at 12:54

Stack Exchange Network

Index on JSONB not improving query speed

1 Answer 1

Hot Network Questions

Index on JSONB not improving query speed

1 Answer 1

Related

Hot Network Questions