Here is an example where using PostgreSQL as a document database will quickly fail: indexing. Either use an RDBMS as a relational database, or use a document database. In the previous post, we learned that B-Tree indexes can be created without an array in the JSON path, but this approach may lead to reduced performance since it relies on an expression-based index. However, when you use a document database, you embed some arrays, and the indexing possibilities are even more limited when emulating a document database with SQL and JSONB.
Here is an example where users may have more than one e-mail address:
create table users (
id bigserial primary key,
data jsonb not null
);
insert into users (data) values (
jsonb_build_object(
'name', 'Homer Simpson',
'email', jsonb_build_array(
'[email protected]',
'[email protected]',
'[email protected]'
)
)
);
INSERT INTO users (data)
SELECT
jsonb_build_object(
'name', 'u' || n::text,
'email', jsonb_build_array(
'u' || n::text || '@compuserve.com'
)
)
FROM generate_series(1, 1000000) n;
PostgreSQL has a JSON operator @>
to find the document that contains a value in an array:
SELECT *
FROM users
WHERE data->'email' @> '"[email protected]"'
;
An expression index on (data->'email')
cannot be used for such query, as the indexed value would be the whole array. In order to index each item, I need to create an inverted index:
CREATE INDEX idx_users_data_email ON users USING GIN (
(data->'email') jsonb_path_ops
);
I check that the index can be used for a query by e-mail:
set enable_seqscan to off;
explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *
FROM users
WHERE data->'email' @> '"[email protected]"'
;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.users (actual time=0.014..0.015 rows=1 loops=1)
Output: id, data
Recheck Cond: ((users.data -> 'email'::text) @> '"[email protected]"'::jsonb)
Heap Blocks: exact=1
Buffers: shared hit=3
-> Bitmap Index Scan on idx_users_data_email (actual time=0.006..0.006 rows=1 loops=1)
Index Cond: ((users.data -> 'email'::text) @> '"[email protected]"'::jsonb)
Buffers: shared hit=2
Planning:
Buffers: shared hit=27
Planning Time: 0.169 ms
Serialization: time=0.005 ms output=1kB format=text
Execution Time: 0.034 ms
(13 rows)
This is fast because I was looking for an exact match. However, indexes are supposed to optimize more than that. Querying this for a partial match where I know only the prefix is extremely complex to write, and cannot use the index:
set enable_seqscan to off;
explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *
FROM users
WHERE EXISTS (
SELECT 1
FROM jsonb_array_elements_text(data->'email') AS email
WHERE email LIKE 'donutlover@%'
);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Seq Scan on public.users (actual time=45.032..844.109 rows=1 loops=1)
Output: users.id, users.data
Filter: EXISTS(SubPlan 1)
Rows Removed by Filter: 1000000
Buffers: shared hit=12346
SubPlan 1
-> Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.001..0.001 rows=0 loops=1000001)
Function Call: jsonb_array_elements_text((users.data -> 'email'::text))
Filter: (email.value ~~ 'donutlover@%'::text)
Rows Removed by Filter: 1
Planning Time: 0.071 ms
Serialization: time=0.007 ms output=1kB format=text
Execution Time: 844.523 ms
(13 rows)
If you want to index such query in PostgreSQL you need to forget about the JSONB structure and process it as simple text, with a text search index using trigrams:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX idx_users_data_email_trgm ON users USING GIN (
(data->>'email') gin_trgm_ops
);
The JSON operator ->>
returns a text, and the gin_trgm_ops
extracts trigrams from it to index with an inverted index. I can pre-filter on the array expanded as text, with data->>'email'
, and search a the '%"donutlover@%'
pattern with a LIKE. This uses the index but may have false positives, so I need to add my EXISTS with JSONB_ARRAY_ELEMENTS_TEXT to re-check before returning the result. Finally, here is the query that can find a user that has an e-mail starting with "donutlover", and its execution plan:
explain (analyze, verbose, buffers, costs off, serialize text)
SELECT *
FROM users
WHERE data->>'email' LIKE '%"donutlover@%'
AND EXISTS (
SELECT 1
FROM jsonb_array_elements_text(data->'email') AS email
WHERE email LIKE 'donutlover@%'
);
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.users (actual time=0.035..0.036 rows=1 loops=1)
Output: users.id, users.data
Recheck Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text)
Filter: EXISTS(SubPlan 1)
Heap Blocks: exact=1
Buffers: shared hit=13
-> Bitmap Index Scan on idx_users_data_email_trgm (actual time=0.015..0.015 rows=1 loops=1)
Index Cond: ((users.data ->> 'email'::text) ~~ '%"donutlover@%'::text)
Buffers: shared hit=12
SubPlan 1
-> Function Scan on pg_catalog.jsonb_array_elements_text email (actual time=0.007..0.007 rows=1 loops=1)
Function Call: jsonb_array_elements_text((users.data -> 'email'::text))
Filter: (email.value ~~ 'donutlover@%'::text)
Planning:
Buffers: shared hit=1
Planning Time: 0.102 ms
Serialization: time=0.003 ms output=1kB format=text
Execution Time: 0.060 ms
(18 rows)
PostgreSQL can logically function as a document database, but nobody wants to get there given the complexity to index and query it. If you use a SQL database, such data must be normalized, requiring two tables for the One-to-Many relationship between users and their emails. Regular indexes are necessary for effective LIKE predicates. The query will utilize a join, with the query planner ideally starting with the appropriate table if filters are applied to both.
There are valid reasons to prefer a document model: its simplicity, alignment with business entities and application objects, and the ease of querying a flexible structure. The previous solution is at the opposite of the expected simplicity. In MongoDB, you don't even need to know if the "email" field is a single value or an array:
db.users.insertMany([
{
"_id": 1,
"data": {
"name": "Homer Simpson",
"email": [
"[email protected]",
"[email protected]",
"[email protected]"
]
}
},
{
"_id": 2,
"data": {
"name": "Marge Simpson",
"email": "[email protected]"
}
}
]);
// Insert one million
const bulkUsers = [];
for (let n = 3; n <= 1000002; n++) {
bulkUsers.push({
_id: n,
data: {
name: "u" + n,
email: [
"u" + n + "@compuserve.com"
]
}
});
// Insert in batches of 10,000 for efficiency
if (bulkUsers.length === 10000) {
db.users.insertMany(bulkUsers);
bulkUsers.length = 0; // Clear the array
}
}
// Insert any remaining documents
if (bulkUsers.length > 0) {
db.users.insertMany(bulkUsers);
}
I create a simple index:
db.users.createIndex({ "data.email": 1 });
The query is straightforward, whether it involves an exact search or a regular expression.
db.users.find({
"data.email": "[email protected]"
});
[
{
_id: 1,
data: {
name: 'Homer Simpson',
email: [
'[email protected]',
'[email protected]',
'[email protected]'
]
}
}
]
db.users.find({
"data.email": {
$regex: "^donutlover@"
}
});
[
{
_id: 1,
data: {
name: 'Homer Simpson',
email: [
'[email protected]',
'[email protected]',
'[email protected]'
]
}
}
]
The execution plan exhibits the fastest access to the document:
...
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 1,
totalKeysExamined: 2,
totalDocsExamined: 1,
...
stage: 'IXSCAN',
nReturned: 1,
executionTimeMillisEstimate: 0,
works: 3,
keyPattern: { 'data.email': 1 },
indexName: 'data.email_1',
isMultiKey: true,
multiKeyPaths: { 'data.email': [ 'data.email' ] },
direction: 'forward',
indexBounds: {
'data.email': [
'["donutlover@", "donutloverA")',
'[/^donutlover@/, /^donutlover@/]'
]
},
keysExamined: 2,
seeks: 2,
...
With MongoDB, you don't have to choose between regular and inverted indexes or deal with their limitations. A single index on { 'data.email': 1 }
can handle both scalar values and arrays. For arrays, MongoDB recognizes this as a multi-key (isMultiKey: true
) and retrieves documents containing values that meet the filter criteria. This index can be used for equality and range queries, and regular expressions with a known prefix are automatically optimized by the query planner into index bounds.
When you hear that JSONB transforms PostgreSQL into a document database, consider trying simple queries beyond just equality predicates. Adding a MongoDB API on top of an SQL database addresses syntax complexity, but it does not resolve the limitations of the underlying indexes.
Top comments (0)