Indexing for New Use Cases Within the MongoDB Document Model (tutorial)

#mongodb #database #index #performance

When designing a schema for MongoDB, it’s crucial to understand your domain access patterns. The document modeling approach shows its simplicity and efficiency over the relational model when building a database for a bounded context where main business objects and microservice access patterns are defined. Knowing which documents to manipulate allows you to determine what information to embed and what to reference by identifier.

The question arises: does this mean that integrating a new use case with the same database is difficult? Not at all. Just as relational databases require new secondary indexes for new use cases, MongoDB provides numerous indexing options on its document model. In this series, I will demonstrate how a collection designed for OLAP reporting, specifically YouTube video statistics, can efficiently support OLTP queries with a small set of indexes—without needing to alter the document model.

I selected a random dataset by searching for "mongoimport" on Kaggle, without a specific access pattern in mind. My goal is to show the diverse use cases it can serve with a couple of indexes, without changing the schema. Let's go ahead and initialize the lab.

I start a local atlas cluster with Atlas CLI

curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz | 
 tar -xzvf - &&
 alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas

atlas deployments setup  atlas --type local --port 27017 --force

I import the Youtube video statistics for 1 million videos made available on Kaggle by Mattia Zeni and Daniele Miorandi and Francesco De Pellegrini YOUStatAnalyzer: a Tool for Analysing the Dynamics of YouTube Content Popularity - 7th International Conference on Performance Evaluation Methodologies and Tools (Valuetools, Torino, Italy, December 2013)

The following downloads, unzips, cleans (duration and number of comments in integer rather than text), and imports the million videos:

curl -L -o youstatanalyzer1000k.zip\
  https://www.kaggle.com/api/v1/datasets/download/mattiazeni/youtube-video-statistics-1million-videos && 
unzip youstatanalyzer1000k.zip &&
rm -f youstatanalyzer1000k.zip &&
sed -E -i.bak 's/("duration" : |"commentsNumber" : )"([0-9]+)"(,)/\1\2\3/g' youstatanalyzer1000k.json &&
mongoimport --db yt --collection youstats --file youstatanalyzer1000k.json -j 10 --drop

Output:

% mongoimport --db yt --collection youstats --file youstatanalyzer1000k.json -j 10 --drop
2025-05-21T16:10:31.136+0200    connected to: mongodb://localhost/
2025-05-21T16:10:31.137+0200    dropping: yt.youstats
2025-05-21T16:10:34.136+0200    [........................] yt.youstats     301MB/29.3GB (1.0%)
2025-05-21T16:10:37.139+0200    [........................] yt.youstats     572MB/29.3GB (1.9%)
2025-05-21T16:10:40.145+0200    [........................] yt.youstats     713MB/29.3GB (2.4%)
2025-05-21T16:10:43.136+0200    [........................] yt.youstats     924MB/29.3GB (3.1%)
2025-05-21T16:10:46.136+0200    [........................] yt.youstats     1.18GB/29.3GB (4.0%)
2025-05-21T16:10:49.137+0200    [#.......................] yt.youstats     1.33GB/29.3GB (4.5%)
2025-05-21T16:10:52.136+0200    [#.......................] yt.youstats     1.51GB/29.3GB (5.1%)
2025-05-21T16:10:55.137+0200    [#.......................] yt.youstats     1.77GB/29.3GB (6.0%)
2025-05-21T16:10:58.136+0200    [#.......................] yt.youstats     2.01GB/29.3GB (6.8%)
...
2025-05-21T16:17:46.134+0200    [#######################.] yt.youstats     29.1GB/29.3GB (99.1%)
2025-05-21T16:17:49.135+0200    [#######################.] yt.youstats     29.2GB/29.3GB (99.6%)
2025-05-21T16:17:51.755+0200    [########################] yt.youstats     29.3GB/29.3GB (100.0%)
2025-05-21T16:17:51.755+0200    1006469 document(s) imported successfully. 0 document(s) failed to import.

Without any index, a query must scan the collection:

db.youstats.find().explain("executionStats").executionStats

 executionStats: {
    executionSuccess: true,
    nReturned: 1006469,
    executionTimeMillis: 75494,
    totalKeysExamined: 0,
    totalDocsExamined: 1006469,
    executionStages: {
      isCached: false,
      stage: 'COLLSCAN',
      nReturned: 1006469,
      executionTimeMillisEstimate: 74846,
      works: 1006470,
      advanced: 1006469,
      needTime: 0,
      needYield: 0,
      saveState: 4874,
      restoreState: 4874,
      isEOF: 1,
      direction: 'forward',
      docsExamined: 1006469
    }
  },

This query returns one million of documents (nReturned: 1006469), so it is expected to examine the same number of documents (totalDocsExamined: 1006469), and it takes one minute on my lab (executionTimeMillis: 75494). However, if I wanted to filter only a subset to return ten documents, it would still have to read the same because I have no index.

Without an index, the collection serves only two access patterns efficiently:

read all documents
find one document by "_id"

This was the last query I run that takes more than a second on this dataset. The next posts of this series will introduce ways to find documents without their "_id" and without scanning the whole collection.

DEV Community

Indexing for New Use Cases Within the MongoDB Document Model (tutorial)

Top comments (0)