When designing a schema for MongoDB, it’s crucial to understand your domain access patterns. The document modeling approach shows its simplicity and efficiency over the relational model when building a database for a bounded context where main business objects and microservice access patterns are defined. Knowing which documents to manipulate allows you to determine what information to embed and what to reference by identifier.
The question arises: does this mean that integrating a new use case with the same database is difficult? Not at all. Just as relational databases require new secondary indexes for new use cases, MongoDB provides numerous indexing options on its document model. In this series, I will demonstrate how a collection designed for OLAP reporting, specifically YouTube video statistics, can efficiently support OLTP queries with a small set of indexes—without needing to alter the document model.
I selected a random dataset by searching for "mongoimport" on Kaggle, without a specific access pattern in mind. My goal is to show the diverse use cases it can serve with a couple of indexes, without changing the schema. Let's go ahead and initialize the lab.
I start a local atlas cluster with Atlas CLI
curl https://fastdl.mongodb.org/mongocli/mongodb-atlas-cli_1.41.1_linux_arm64.tar.gz |
tar -xzvf - &&
alias atlas=$PWD/mongodb-atlas-cli_1.41.1_linux_arm64/bin/atlas
atlas deployments setup atlas --type local --port 27017 --force
I import the Youtube video statistics for 1 million videos made available on Kaggle by Mattia Zeni and Daniele Miorandi and Francesco De Pellegrini YOUStatAnalyzer: a Tool for Analysing the Dynamics of YouTube Content Popularity - 7th International Conference on Performance Evaluation Methodologies and Tools (Valuetools, Torino, Italy, December 2013)
The following downloads, unzips, cleans (duration and number of comments in integer rather than text), and imports the million videos:
curl -L -o youstatanalyzer1000k.zip\
https://www.kaggle.com/api/v1/datasets/download/mattiazeni/youtube-video-statistics-1million-videos &&
unzip youstatanalyzer1000k.zip &&
rm -f youstatanalyzer1000k.zip &&
sed -E -i.bak 's/("duration" : |"commentsNumber" : )"([0-9]+)"(,)/\1\2\3/g' youstatanalyzer1000k.json &&
mongoimport --db yt --collection youstats --file youstatanalyzer1000k.json -j 10 --drop
Output:
% mongoimport --db yt --collection youstats --file youstatanalyzer1000k.json -j 10 --drop
2025-05-21T16:10:31.136+0200 connected to: mongodb://localhost/
2025-05-21T16:10:31.137+0200 dropping: yt.youstats
2025-05-21T16:10:34.136+0200 [........................] yt.youstats 301MB/29.3GB (1.0%)
2025-05-21T16:10:37.139+0200 [........................] yt.youstats 572MB/29.3GB (1.9%)
2025-05-21T16:10:40.145+0200 [........................] yt.youstats 713MB/29.3GB (2.4%)
2025-05-21T16:10:43.136+0200 [........................] yt.youstats 924MB/29.3GB (3.1%)
2025-05-21T16:10:46.136+0200 [........................] yt.youstats 1.18GB/29.3GB (4.0%)
2025-05-21T16:10:49.137+0200 [#.......................] yt.youstats 1.33GB/29.3GB (4.5%)
2025-05-21T16:10:52.136+0200 [#.......................] yt.youstats 1.51GB/29.3GB (5.1%)
2025-05-21T16:10:55.137+0200 [#.......................] yt.youstats 1.77GB/29.3GB (6.0%)
2025-05-21T16:10:58.136+0200 [#.......................] yt.youstats 2.01GB/29.3GB (6.8%)
...
2025-05-21T16:17:46.134+0200 [#######################.] yt.youstats 29.1GB/29.3GB (99.1%)
2025-05-21T16:17:49.135+0200 [#######################.] yt.youstats 29.2GB/29.3GB (99.6%)
2025-05-21T16:17:51.755+0200 [########################] yt.youstats 29.3GB/29.3GB (100.0%)
2025-05-21T16:17:51.755+0200 1006469 document(s) imported successfully. 0 document(s) failed to import.
Without any index, a query must scan the collection:
db.youstats.find().explain("executionStats").executionStats
executionStats: {
executionSuccess: true,
nReturned: 1006469,
executionTimeMillis: 75494,
totalKeysExamined: 0,
totalDocsExamined: 1006469,
executionStages: {
isCached: false,
stage: 'COLLSCAN',
nReturned: 1006469,
executionTimeMillisEstimate: 74846,
works: 1006470,
advanced: 1006469,
needTime: 0,
needYield: 0,
saveState: 4874,
restoreState: 4874,
isEOF: 1,
direction: 'forward',
docsExamined: 1006469
}
},
This query returns one million of documents (nReturned
: 1006469), so it is expected to examine the same number of documents (totalDocsExamined
: 1006469), and it takes one minute on my lab (executionTimeMillis
: 75494). However, if I wanted to filter only a subset to return ten documents, it would still have to read the same because I have no index.
Without an index, the collection serves only two access patterns efficiently:
- read all documents
- find one document by "_id"
This was the last query I run that takes more than a second on this dataset. The next posts of this series will introduce ways to find documents without their "_id" and without scanning the whole collection.
Top comments (0)