DEV Community

Franck Pachot for MongoDB

Posted on • Edited on

No Index for LIKE on JSONB with Array in the Path (GIN limitation)

Here is an example where using PostgreSQL as a document database will quickly fail: indexing. Either use an RDBMS as a relational database, or use a document database. In the previous post, we learned that B-Tree indexes can be created without an array in the JSON path, but this approach may lead to reduced performance since it relies on an expression-based index. However, when you use a document database, you embed some arrays, and the indexing possibilities are even more limited when emulating a document database with SQL and JSONB.

Here is an example where users may have more than one e-mail address:

create table users (
  id bigserial primary key,
  data jsonb not null
);

insert into users (data) values (
 jsonb_build_object(  
    'name', 'Homer Simpson',  
    'email', jsonb_build_array(  
      '[email protected]',  
      '[email protected]',  
      '[email protected]'  
    )  
  )  
 );

INSERT INTO users (data)   
SELECT   
  jsonb_build_object(  
    'name', 'u' || n::text,  
    'email', jsonb_build_array(  
      'u' || n::text || '@compuserve.com'  
    )  
  )  
FROM generate_series(1, 1000000) n;  
Enter fullscreen mode Exit fullscreen mode

PostgreSQL has a JSON operator @> to find the document that contains a value in an array:

SELECT *   
FROM users  
WHERE data->'email' @> '"[email protected]"'
;  
Enter fullscreen mode Exit fullscreen mode

An expression index on (data->'email') cannot be used for such query, as the indexed value would be the whole array. In order to index each item, I need to create an inverted index:

CREATE INDEX idx_users_data_email ON users USING GIN (
 (data->'email') jsonb_path_ops
);  
Enter fullscreen mode Exit fullscreen mode

I check that the index can be used for a query by e-mail:

set enable_seqscan to off;

explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *   
FROM users  
WHERE data->'email' @> '"[email protected]"'
;  

                                          QUERY PLAN                                            
-------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.users (actual time=0.014..0.015 rows=1 loops=1)
   Output: id, data
   Recheck Cond: ((users.data -> 'email'::text) @> '"[email protected]"'::jsonb)
   Heap Blocks: exact=1
   Buffers: shared hit=3
   ->  Bitmap Index Scan on idx_users_data_email (actual time=0.006..0.006 rows=1 loops=1)
         Index Cond: ((users.data -> 'email'::text) @> '"[email protected]"'::jsonb)
         Buffers: shared hit=2
 Planning:
   Buffers: shared hit=27
 Planning Time: 0.169 ms
 Serialization: time=0.005 ms  output=1kB  format=text
 Execution Time: 0.034 ms
(13 rows)
Enter fullscreen mode Exit fullscreen mode

This is fast because I was looking for an exact match. However, indexes are supposed to optimize more than that. Querying this for a partial match where I know only the prefix is extremely complex to write, and cannot use the index:

set enable_seqscan to off;

explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *   
FROM users  
WHERE EXISTS (  
  SELECT 1  
  FROM jsonb_array_elements_text(data->'email') AS email  
  WHERE email LIKE 'donutlover@%'  
);

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Seq Scan on public.users (actual time=45.032..844.109 rows=1 loops=1)
   Output: users.id, users.data
   Filter: EXISTS(SubPlan 1)
   Rows Removed by Filter: 1000000
   Buffers: shared hit=12346
   SubPlan 1
     ->  Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.001..0.001 rows=0 loops=1000001)
           Function Call: jsonb_array_elements_text((users.data -> 'email'::text))
           Filter: (email.value ~~ 'donutlover@%'::text)
           Rows Removed by Filter: 1
 Planning Time: 0.071 ms
 Serialization: time=0.007 ms  output=1kB  format=text
 Execution Time: 844.523 ms
(13 rows)
Enter fullscreen mode Exit fullscreen mode

If you want to index such query in PostgreSQL you need to forget about the JSONB structure and process it as simple text, with a text search index using trigrams:

CREATE EXTENSION IF NOT EXISTS pg_trgm;  
CREATE INDEX idx_users_data_email_trgm ON users USING GIN (
 (data->>'email') gin_trgm_ops
); 
Enter fullscreen mode Exit fullscreen mode

The JSON operator ->> returns a text, and the gin_trgm_ops extracts trigrams from it to index with an inverted index. I can pre-filter on the array expanded as text, with data->>'email', and search a the '%"donutlover@%' pattern with a LIKE. This uses the index but may have false positives, so I need to add my EXISTS with JSONB_ARRAY_ELEMENTS_TEXT to re-check before returning the result. Finally, here is the query that can find a user that has an e-mail starting with "donutlover", and its execution plan:

explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *  
FROM users  
WHERE data->>'email' LIKE '%"donutlover@%'
AND EXISTS (  
  SELECT 1  
  FROM jsonb_array_elements_text(data->'email') AS email  
  WHERE email LIKE 'donutlover@%'  
); 

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.users (actual time=0.035..0.036 rows=1 loops=1)
   Output: users.id, users.data
   Recheck Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text)
   Filter: EXISTS(SubPlan 1)
   Heap Blocks: exact=1
   Buffers: shared hit=13
   ->  Bitmap Index Scan on idx_users_data_email_trgm (actual time=0.015..0.015 rows=1 loops=1)
         Index Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text)
         Buffers: shared hit=12
   SubPlan 1
     ->  Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.007..0.007 rows=1 loops=1)
           Function Call: jsonb_array_elements_text((users.data -> 'email'::text))
           Filter: (email.value ~~ 'donutlover@%'::text)
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.102 ms
 Serialization: time=0.003 ms  output=1kB  format=text
 Execution Time: 0.060 ms
(18 rows)

Enter fullscreen mode Exit fullscreen mode

PostgreSQL can logically function as a document database, but nobody wants to get there given the complexity to index and query it. If you use a SQL database, such data must be normalized, requiring two tables for the One-to-Many relationship between users and their emails. Regular indexes are necessary for effective LIKE predicates. The query will utilize a join, with the query planner ideally starting with the appropriate table if filters are applied to both.

There are valid reasons to prefer a document model: its simplicity, alignment with business entities and application objects, and the ease of querying a flexible structure. The previous solution is at the opposite of the expected simplicity. In MongoDB, you don't even need to know if the "email" field is a single value or an array:

db.users.insertMany([
{  
  "_id": 1,  
  "data": {  
    "name": "Homer Simpson",  
    "email": [  
      "[email protected]",  
      "[email protected]",  
      "[email protected]"  
    ]  
  }  
},  
{  
  "_id": 2,  
  "data": {  
    "name": "Marge Simpson",  
    "email": "[email protected]"
  }  
}
]);

// Insert one million
const bulkUsers = [];  
for (let n = 3; n <= 1000002; n++) {  
  bulkUsers.push({  
    _id: n,  
    data: {  
      name: "u" + n,  
      email: [  
        "u" + n + "@compuserve.com"  
      ]  
    }  
  });  
  // Insert in batches of 10,000 for efficiency  
  if (bulkUsers.length === 10000) {  
    db.users.insertMany(bulkUsers);  
    bulkUsers.length = 0; // Clear the array  
  }  
}  
// Insert any remaining documents  
if (bulkUsers.length > 0) {  
  db.users.insertMany(bulkUsers);  
}  

Enter fullscreen mode Exit fullscreen mode

I create a simple index:


db.users.createIndex({ "data.email": 1 });  

Enter fullscreen mode Exit fullscreen mode

The query is straightforward, whether it involves an exact search or a regular expression.

db.users.find({  
  "data.email": "[email protected]" 
});

[
  {
    _id: 1,
    data: {
      name: 'Homer Simpson',
      email: [
        '[email protected]',
        '[email protected]',
        '[email protected]'
      ]
    }
  }
]

db.users.find({  
  "data.email": {  
    $regex: "^donutlover@"  
  }  
});

[
  {
    _id: 1,
    data: {
      name: 'Homer Simpson',
      email: [
        '[email protected]',
        '[email protected]',
        '[email protected]'
      ]
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

The execution plan exhibits the fastest access to the document:

...
 executionSuccess: true,
  nReturned: 1,
  executionTimeMillis: 1,
  totalKeysExamined: 2,
  totalDocsExamined: 1,
...
      stage: 'IXSCAN',
      nReturned: 1,
      executionTimeMillisEstimate: 0,
      works: 3,
      keyPattern: { 'data.email': 1 },
      indexName: 'data.email_1',
      isMultiKey: true,
      multiKeyPaths: { 'data.email': [ 'data.email' ] },
      direction: 'forward',
      indexBounds: {
        'data.email': [
          '["donutlover@", "donutloverA")',
          '[/^donutlover@/, /^donutlover@/]'
        ]
      },
      keysExamined: 2,
      seeks: 2,
...
Enter fullscreen mode Exit fullscreen mode

With MongoDB, you don't have to choose between regular and inverted indexes or deal with their limitations. A single index on { 'data.email': 1 } can handle both scalar values and arrays. For arrays, MongoDB recognizes this as a multi-key (isMultiKey: true) and retrieves documents containing values that meet the filter criteria. This index can be used for equality and range queries, and regular expressions with a known prefix are automatically optimized by the query planner into index bounds.

When you hear that JSONB transforms PostgreSQL into a document database, consider trying simple queries beyond just equality predicates. Adding a MongoDB API on top of an SQL database addresses syntax complexity, but it does not resolve the limitations of the underlying indexes.

Top comments (0)