Sort on Array with Multi-Key Index

#mongodb #document #database #index

In the previous post, we discussed how MongoDB indexes retrieve documents ordered by a specific field, using published dates from a collection of YouTube video statistics as an example. Unlike many databases that allow only a single value per document, MongoDB's flexible schema also supports indexing within nested arrays.

The dataset I imported was not designed for searching on the video views, only to store them per video. It contains two arrays: one for days ("day.data') and another for corresponding daily view counts ("views.daily.data"). This schema was intended to retrieve all stats for a specific video at once. However, this blog series aims to demonstrate how a MongoDB document model can support a variety of additional use cases through secondary indexes, without modifying the collection schema.

Access pattern: videos with the highest daily views

This is straightforward, just create an index on the array of daily views data, using dot notation to define the path:

db.youstats.createIndex({ 
 "views.daily.data": -1      , // for Sort on maximum daily view
 commentsNumber: 1           , // for additional filter on comments
});

I created a descending index since most of my queries focus on the highest number of views. MongoDB indexes can be scanned both forward and backward.
When the indexed field is an array, it has multiple keys per document, but sorting must use one key. The semantic is easy:

A descending sort orders documents by the highest value in the array.
An ascending sort orders documents by the lowest value in the array.

Another way to think about it is that the index scan uses the first key encountered for each document and skips the others. On a forward scan with a descending index, it is the greatest value.

To find the Top-3 documents by daily views, here is a simple query:

db.youstats.find({},{
  author: 1               ,
  title: 1                ,
  duration: 1             ,
  type: 1                 ,
  publishedDate: 1        ,
  commentsNumber: 1       ,
  // "views.daily": 1        ,
  // "day": 1                ,
}).sort({
  "views.daily.data": -1 
}).limit(3)

[
  {
    _id: 'ASO_zypdnsQ',
    commentsNumber: 1122436,
    author: 'officialpsy',
    publishedDate: '2013-04-13T11:59:04.000Z',
    title: 'PSY - GENTLEMAN M/V',
    duration: 234,
    type: 'video/3gpp'
  },
  {
    _id: 'My2FRPA3Gf8',
    commentsNumber: 1125301,
    author: 'MileyCyrusVEVO',
    publishedDate: '2013-09-09T16:00:38.000Z',
    title: 'Miley Cyrus - Wrecking Ball',
    duration: 222,
    type: 'application/x-shockwave-flash'
  },
  {
    _id: 'YoB8t0B4jx4',
    commentsNumber: 39370,
    author: 'SA Wardega',
    publishedDate: '2014-09-04T14:28:22.000Z',
    title: 'Mutant Giant Spider Dog (SA Wardega)',
    duration: 239,
    type: 'video/3gpp'
  }
]

The documents show-up when one of the daily view data in the array has the highest value. For example, the 3rd result was:

  {
    _id: 'YoB8t0B4jx4',
    commentsNumber: 39370,
    author: 'SA Wardega',
    publishedDate: '2014-09-04T14:28:22.000Z',
    title: 'Mutant Giant Spider Dog (SA Wardega)',
    day: {
      data: [
        Long('1409788800000'), Long('1409875200000'), Long('1409961600000'), Long('1410048000000'), Long('1410134400000'), Long('1410220800000'), Long('1410307200000'), Long('1410393600000'), Long('1410480000000'), Long('1410566400000'), Long('1410652800000'), Long('1410739200000'), Long('1410825600000'), Long('1410912000000'), Long('1410998400000'), Long('1411084800000'), Long('1411171200000'), Long('1411257600000'), Long('1411344000000'), Long('1411430400000'), Long('1411516800000'), Long('1411603200000'), Long('1411689600000'), Long('1411776000000'), Long('1411862400000'), Long('1411948800000'), Long('1412035200000'), Long('1412121600000'), Long('1412208000000'), Long('1412294400000'), Long('1412380800000'), Long('1412467200000'), Long('1412553600000'), Long('1412640000000'), Long('1412726400000'), Long('1412812800000'), Long('1412899200000'), Long('1412985600000'), Long('1413072000000'), Long('1413158400000'), Long('1413244800000'), Long('1413331200000'), Long('1413417600000'), Long('1413504000000'), Long('1413590400000'), Long('1413676800000'), Long('1413763200000'), Long('1413849600000'), Long('1413936000000'), Long('1414022400000'), Long('1414108800000'), Long('1414195200000'), Long('1414281600000'), Long('1414368000000')
      ]
    },
    duration: 239,
    type: 'video/3gpp',
    views: {
      daily: {
        data: [
          2964062, 17799094, 19526335, 14604160, 9606241, 5959851,  4090643,  3419126,  2907521, 2626169, 2195691,  1518943,  1251086,  1128994,  958318, 861349,   785797,   628364,   506154,  564079, 445417,   474349,   498093,   589038,  444256, 363379,   329318,   313375,   333627,  335226, 354050,   322087,   239715,   228562,  213420, 201771,   213078,   247715,   228587,  183759, 168511,   169992,   199282,   326091,  347602, 335237,   290271,   242939,   223959,  219971, 249009,   277773,   279301,   220609
        ]
      }
    }
  }

As this ran in a couple of milliseconds, it is obvious, and it didn't scan millions of documents and used the index efficiently. This is easy to check with explain():

db.youstats.find({},{
  author: 1               ,
  title: 1                ,
  duration: 1             ,
  type: 1                 ,
  publishedDate: 1        ,
  commentsNumber: 1       ,
  "views.daily": 1        ,
  "day": 1                ,
}).sort({
  "views.daily.data": -1 
}).limit(3).explain("executionStats").executionStats

...
  nReturned: 3,
  executionTimeMillis: 1,
  totalKeysExamined: 7,
  totalDocsExamined: 3,
...
    stage: 'LIMIT',
    nReturned: 3,
    limitAmount: 3,
...
      stage: 'PROJECTION_SIMPLE',
      nReturned: 3,
...
        stage: 'FETCH',
        nReturned: 3,
...
          stage: 'IXSCAN',
          nReturned: 3,
          keyPattern: { 'views.daily.data': -1, commentsNumber: 1 },
          isMultiKey: true,
          multiKeyPaths: {
            'views.daily.data': [ 'views.daily.data' ],
            commentsNumber: []
          },
          direction: 'forward',
          indexBounds: {
            'views.daily.data': [ '[MaxKey, MinKey]' ],
            commentsNumber: [ '[MinKey, MaxKey]' ]
          },
          keysExamined: 7,
          seeks: 1,
          dupsTested: 7,
          dupsDropped: 4
        }
      }
    }
  }
}

The index contains an entry for each "views.daily.data" of all documents, starting with the highest daily view. The first key examined returns the recordid of the document with the highest daily view. The next key may correspond to the same video, which may have a high view count on another day, or another one. To return 3 distinct documents, the scan must eliminate duplicate recordid. Here 4 duplicates were dropped so that 3+4=7 keys have been read in total.

MongoDB utilizes WiredTiger B-Tree indexes not only to locate specific values, but also to navigate efficiently through the ranges of index bounds. Even if we were less lucky and had to examine 365 keys per document, reading one thousand index entries is still fast, as only 3 documents have to be fetched.

Access pattern: videos with the lowest daily views and no comments

The same index can also identify videos with the lowest daily views. To exclude those with no data, I only include entries that are available:

db.youstats.find({
  "views.daily.data": { $exists: true } ,
  "commentsNumber": { $lt: 1 } ,
},{
  author: 1               ,
  title: 1                ,
  duration: 1             ,
  type: 1                 ,
  publishedDate: 1        ,
  commentsNumber: 1       ,
  "views.daily": 1        ,
  //"day": 1                ,
}).sort({
  "views.daily.data": 1 
}).limit(3)

This query may not yield useful results, as it returns videos with no views on a given day, even if they had many views previously. Remember, ascending sort retrieves the lowest field value in the array, while descending sort retrieves the highest.

In this series on indexing MongoDB collections, I aim to demonstrate that a document model can support more use cases than initially anticipated during schema design. Although I imported a dataset not optimized for specific queries, introducing a couple of secondary indexes can enhance performance for various access patterns.
What's crucial during schema design to ensure that all fields used for filtering are contained within the same document, allowing the index to retrieve the minimum set of documents efficiently.

For information on how the sort order behaves with different datatypes, please refer to the documentation: Comparison/Sort Order. This applies only to MongoDB, not to its emulations that claim compatibility. I tested Amazon DocumentDB, Azure CosmosDB, Oracle Database, and FerretDB, but none could effectively cover the sort operation with an index, and they all ended up scanning the entire collection for the queries presented, which is slower and cannot scale.

DEV Community

Sort on Array with Multi-Key Index

Access pattern: videos with the highest daily views

Access pattern: videos with the lowest daily views and no comments

Top comments (0)