Franck Pachot for MongoDB

Posted on Jun 15

Queries on JSON 👉🏻 compound indexes (Equality, Sort, Range)

#mongodb #document #postgres #json

In a blog post titled New Benchmarks Show Postgres Dominating MongoDB in Varied Workloads, EDB claims that Postgres outperforms MongoDB in document-based data tests. While I generally find such marketing benchmarks useless, they highlight mistakes made by vendors who compare a database where they are experts with one they don't know and have no desire to learn. This provides an opportunity to remind some basics of on efficient index and query design.

There are four queries in this benchmark, with data loaded from github archive from 2015. This data is in JSON, which makes it suitable to test queries on documents, and queries are OLAP style, using aggregation pipelines in MongoDB.

Load documents to a collection

I used the same method to load data in a small lab to reproduce the queries:

for file in http://data.gharchive.org/2015-{01..12}-{01..31}-{0..23}.json.gz
 do
  wget -q -o /dev/null -O - $file |
  gunzip -c |
  mongoimport --collection="github2015" 
 done

Here is an example of one document:

db.github2015.find().limit(1);

[
  {
    _id: ObjectId('684ee281d8d90a383a078112'),
    id: '2489368070',
    type: 'PushEvent',
    actor: {
      id: 9152315,
      login: 'davidjhulse',
      gravatar_id: '',
      url: 'https://api.github.com/users/davidjhulse',
      avatar_url: 'https://avatars.githubusercontent.com/u/9152315?'
    },
    repo: {
      id: 28635890,
      name: 'davidjhulse/davesbingrewardsbot',
      url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot'                                                                      
    },
    payload: {
      push_id: 536740396,
      size: 1,
      distinct_size: 1,
      ref: 'refs/heads/master',
      head: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
      before: '86ffa724b4d70fce46e760f8cc080f5ec3d7d85f',
      commits: [
        {
          sha: 'a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81',
          author: {
            email: '[email protected]',
            name: 'davidjhulse'
          },
          message: 'Altered BingBot.jar\n\nFixed issue with multiple account support',                                                         
          distinct: true,
          url: 'https://api.github.com/repos/davidjhulse/davesbingrewardsbot/commits/a9b22a6d80c1e0bb49c1cf75a3c075b642c28f81'                 
        }
      ]
    },
    public: true,
    created_at: '2015-01-01T00:00:00Z'
  }
]

This dataset is ideal for testing a document database because:

documents have an average size of three kilobytes, with some up to two megabytes.
it features a structured format with sub-documents like "actor" and "payload."
it contains arrays, such as "payload.commit," necessitating multi-key or inverted indexes.

I will test the four queries used by the benchmark and include an additional one that queries the array "payload.commit," which the benchmark overlooked. MongoDB's multi-key indexes significantly outperform all SQL databases in this regard (see the Multi-key Indexes series). While many vendor benchmarks limit their test coverage to create a favorable impression, my objective is to transparently showcase indexing best practices.

Single-field indexes

Here are the indexes that were created for the vendor benchmark:

 db.github2015.createIndex( {"type":1} )
 db.github2015.createIndex( {"repo.name":1} )
 db.github2015.createIndex( {"payload.action":1} )
 db.github2015.createIndex( {"actor.login":1} )
 db.github2015.createIndex( {"payload.issue.comments":1} )

Seeing the index definitions, I already know why they got better results on PostgreSQL for some queries. Complex queries rarely filter or sort on a single field, and a well-designed database should have compound indexes. Without the right compound indexes, PostgreSQL can combine multiple indexes with bitmap scans, which is not ideal, as it doesn't preserve the ordering to cover sorts. However, it can be used to reduce the number of indexes created, as they negatively impact the vacuum process.

The document model in MongoDB offers the advantage of having all important fields consolidated within a single document. This allows for the use of a compound index that can effectively handle equality, range, and sort order. These indexes can be applied to both scalar values and arrays.

Benchmark queries

I ran benchmark queries in my lab to educate on indexing practices. I created the appropriate indexes, and the best starting point for documentation is The ESR (Equality, Sort, Range) Guideline.

Query a) Repositories order by most open issues quantity

The query is:

db.github2015.aggregate([  
  { $match: { $and: [ { type: "IssuesEvent"} , { "payload.action" : "opened" } ] }},  
  { $group: { _id: "$repo.name", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
]).explain("executionStats")

Index according to the ESR guideline:

Equality: "type" and "payload.action", with the less selective first (better compression)
Sort: "repo.name", as grouping is faster on sorted keys

db.github2015.createIndex(
 { "payload.action": 1, "type": 1, "repo.name": 1 }
)

In my test, the query scanned nearly three hundred thousands index keys and returned in 700 milliseconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 72361,                                                                                                                     
          executionTimeMillis: 777,                                                                                                             
          totalKeysExamined: 278489,                                                                                                            
          totalDocsExamined: 0,

In the vendor's benchmark results, MongoDB outperformed PostgreSQL for this query, even if it had to fetch half of the documents due to an index being used for a single filter. This suggests how MongoDB can be significantly faster than PostgreSQL with the appropriate indexing.

Query b) Return git event type order by quantity from major to minor

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

Index according to the ESR guideline:

Sort: "repo.name", as grouping is faster on sorted keys

db.github2015.createIndex({ "type": 1 })

This index was created by the benchmark, however it is important to know that the query planner will not consider the index without a filter on the key prefix (this may be improved in the future if SERVER-13197 is implemented).

The query should simply add an unbounded range on the sort key:

db.github2015.aggregate([  
  { $match: { "type": { $gt: MinKey}  }},  
  { $group: { _id: "$type", total: { $sum: 1 } } },  
  { $sort: { total: -1 } }  
])

In my test, the query aggregated ten million keys in 4.5 seconds:

        executionStats: {
          executionSuccess: true,
          nReturned: 14,
          executionTimeMillis: 4585,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

I enhanced performance significantly by optimizing the query without creating in new index. MongoDB provides users with greater control over data access than PostgreSQL, even allowing for the use of query planner hints to improve efficiency. In this case, it is sufficient to add a {$gt: MinKey} or { $lt: MaxKey}.

Query c) Return the top 10 most active actors

The query is:

db.github2015.aggregate([  
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

Index according to the ESR guideline:

Sort: "repo.name", as grouping is faster on sorted keys

db.github2015.createIndex({ "actor.login": 1 })

As seen above, we need to add an unbounded range filter to get the idnex considered by the query planner:

db.github2015.aggregate([  
  { $match: { "actor.login": { $gt: MinKey}  }}, 
  { $group: { _id: "$actor.login", events: { $sum: 1 } } },  
  { $sort: { events: -1 } },  
  { $limit: 10 }  
])

In my test, the query aggregated ten million keys in 13 seconds:

        executionStats: {                                                                                                                       
          executionSuccess: true,                                                                                                               
          nReturned: 727312,
          executionTimeMillis: 13498,
          totalKeysExamined: 9480600,
          totalDocsExamined: 0,

If you need to optimize this access pattern further, use the computed design pattern, and increment a login counter in an actor's collection, leveraging the atomic $inc operator.

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

The query is:

db.github2015.aggregate([  
  { $match: { "type": "PushEvent", "payload.issue.comments": { $gt : 2 } } },  
  { $group: { _id: "$repo.name", avg: { $avg: "$payload.issue.comments" } } },  
  { $sort: { avg: -1 } }  
])

Index according to the ESR guideline:

Equality: "type"
Sort: "repo.name", as grouping is faster on sorted keys
Range: "payload.issue.comments"

db.github2015.createIndex(
 { "type": 1, "repo.name": 1, "payload.issue.comments": 1  }
)

I don't know if it was done on purpose for the benchmark, but the dataset has no "PushEvent" with a "payload.issue" so the query returns an empty result.
For this particular case, as there's no keys to sort, it is better to place the range key before the sort key:

db.github2015.createIndex(
 { "type": 1, "payload.issue.comments": 1, "repo.name": 1 }
)

The index immediately finds that there are no keys for those bounds:


        executionStats: {
          executionSuccess: true,
          nReturned: 0,
          executionTimeMillis: 1,
          totalKeysExamined: 0,
          totalDocsExamined: 0,

The query may mistakenly expects a "PushEvent" to be an "IssuesEvent," as it looks for "payload.issue" comments. To eliminate this ambiguity, MongoDB offers schema validation (see "Schema Later" considered harmful).

Ad-Hoc queries for OLAP

While acceptable performance can be achieved with the right index, it may not be the optimal solution for OLAP use cases. Instead, consider creating a single Atlas Search index to handle all related queries. For further guidance, refer to my previous post: Search Index for Reporting.

Atlas Search Indexes are maintained asynchronously on a dedicated node, ensuring no impact on the operational database. They provide a near-real-time state without the complexity of streaming changes to another database.

Another Query: Recent push events by user's commit

The recent push by a user's commit is a relevant use-case for this dataset. However, the benchmark did not run any queries on "commits" since it is an array. While PostgreSQL supports JSON, it cannot be directly compared to a document database like MongoDB that handles non-scalar fields natively. If you run the benchmark on PostgreSQL, try this query:

SELECT   
 data->'repo'->>'name' as repo_name,  
 data->'payload'->'commits' as commits_info  
FROM "github2015"  
WHERE   
 data->>'type' = 'PushEvent'  
 AND data->'payload'->'commits' @> '[{"author": {"name": "[email protected]"}}]'  
ORDER BY   
 data->>'created_at' DESC  
LIMIT 5;

You can explore various indexes, such as a GIN index, but you will never find one that directly retrieves the five documents needed for the result. If you do, please correct me and show the execution plan in a comment.

On MongoDB, the same index guideline applies:

Equality: "type" (low selectivity) and "payload.commits.author.name" (will be multi-key)
Sort: "created_at" (a must for pagination query)

db.github2015.createIndex({  
  "type": 1,  
  "payload.commits.author.name": 1,  
  "created_at": -1  
})

The query is simple and doesn't even need an aggregation pipeline:

db.github2015.find({      
  "type": "PushEvent",      
  "payload.commits.author.name": "[email protected]"      
}, {  
  "repo.name": 1,  
  "payload.commits.author.name": 1,  
  "payload.commits.message": 1,  
  _id: 0  
}).sort({      
  "created_at": -1      
}).limit(5)

The execution statistics indicate that only 5 documents have been read, which is the minimum required for the results:

  executionStats: {                                                                                                                                                                                          
    executionSuccess: true,                                                                                                                                                                                  
    nReturned: 5,                                                                                                                                                                                            
    executionTimeMillis: 0,                                                                                                                                                                                  
    totalKeysExamined: 5,                                                                                                                                                                                    
    totalDocsExamined: 5,                                                                                                                                                                                    
    executionStages: {

The execution plan illustrates how a single seek has read five index entries by utilizing a multi-key index. It applies the index bounds for both the equality filters and the sort order, ensuring an efficient retrieval process:

  stage: 'IXSCAN',          
  nReturned: 5,          
  executionTimeMillisEstimate: 0,          
...       
  isMultiKey: true,          
  multiKeyPaths: {          
    type: [],          
    'payload.commits.author.name': [ 'payload.commits' ],          
    created_at: []          
  },          
...                                                                                                                                                                                        
  direction: 'forward',
  indexBounds: {
    type: [ '["PushEvent", "PushEvent"]' ],
    'payload.commits.author.name': [ '["[email protected]", "[email protected]"]' ],
    created_at: [ '[MaxKey, MinKey]' ]
  },
  keysExamined: 5,
  seeks: 1, 
  dupsTested: 5,

Conclusion

Benchmarks are often biased because vendors optimize tests for their own databases, neglecting others. While their performance claims and time comparisons lack value, looking at the queries and execution plans may serve an educational role by highlighting common design mistakes they made for the databases they are comparing to.

In this post, I applied the The ESR (Equality, Sort, Range) Guideline, examined the execution plan, and tested queries relevant to a document model to demonstrate the power of MongoDB in querying JSON data. After replacing two indexes and improving two queries, you should notice that MongoDB performs significantly faster than PostgreSQL on those four queries.

DEV Community

Queries on JSON 👉🏻 compound indexes (Equality, Sort, Range)

Load documents to a collection

Single-field indexes

Benchmark queries

Query a) Repositories order by most open issues quantity

Query b) Return git event type order by quantity from major to minor

Query c) Return the top 10 most active actors

Query d) Return repositories that have more than two comments and a specific event type, order by average comments from major to minor

Ad-Hoc queries for OLAP

Another Query: Recent push events by user's commit

Conclusion

Top comments (0)