faster $lookup after $group in MongoDB aggregation pipeline

#mongodb #lookup #join #performance

In the previous post One million $lookup challenge I said that $lookup is efficient at the end of an aggregation pipeline, not before the aggregation and I was referencing another post, Comparison of JOINS 👉🏻 aggregation pipeline and CTEs where I simply inverted the join ($lookup) and aggregation ($group) to fix a bad query. What if you have to lookup from millions of documents without an aggregation, like in the One million $lookup challenge?

I created a dimension table with one thousand documents, and a fact table with one million. The fact table has a "ref" field that references the "dimid" in the dimension table:

db.dim.drop();
db.fact.drop();

db.dim.insertMany(
    Array.from({ length: 1000 }, (_, i) => ({
        _id: i + 1,
        value: Math.random()
    }))
);

db.fact.insertMany(
    Array.from({ length: 1000000 }, () => ({
        ref: Math.ceil(Math.random() * 1000),
        value: Math.random()
    }))
);

A many-to-one relationship should embed values directly, eliminating the need for lookups within a document model. However, I diverged from this recommendation to build a demo illustrating that lookups are acceptable when dealing with a limited number of documents, such as after aggregation.

Lookup (IndexedLoopJoin): 10 seconds

Here is an aggregation pipeline with a lookup.

x=db.fact.aggregate([
    {
        $lookup: {
            from: "dim",
            localField: "ref",
            foreignField: "_id",
            as: "dim" ,
        }
    },
]).explain("executionStats")
;
print(x["executionStats"]["executionTimeMillis"]/1000+" seconds")

On this data, the operation completes in ten seconds. While the Index Nested Loop Join on one million documents with a scalar "ref" takes some time, performance can be improved by avoiding multiple lookups for the same value.

$group , $lookup , $unwind : 3 seconds

I group by the "ref" values prior to the join to minimize lookup operations, ensuring each value is processed only once:

x=db.fact.aggregate([
  { // one group per "ref" with an array of "facts" 
    $group: {
      _id: "$ref",
      facts: { $push: "$$ROOT" },
    }
  },
  { // join from the group
    $lookup: {
      from:          "dim",
      localField:    "_id",
      foreignField:  "_id",
      as:            "dim"
    }
  },
  { // un-group to get the  
    $unwind: "$facts"
  },
  { // project to original fields
    $project: {
      _id:           "$facts._id",
      ref:           "$facts.ref",
      value:         "$facts.value",
      dim:           "$dim",
    }
  }
]).explain("executionStats");
print(x["stages"][0]["$cursor"]
["executionStats"]["executionTimeMillis"]/1000+" seconds")

This takes three seconds to run. When only one nested loop per thousand documents is used, compared to one million, the lookup stage becomes faster. Although grouping and unwinding add some time, if grouping is already present, as is typical in an aggregation pipeline, performing the lookup after grouping is simply more efficient than doing it beforehand.

DEV Community

faster $lookup after $group in MongoDB aggregation pipeline

Lookup (IndexedLoopJoin): 10 seconds

$group , $lookup , $unwind : 3 seconds

Top comments (0)