Comparison of JOINS 👉🏻 aggregation pipeline and CTEs

#mongodb #lookup #join #aggregation

In a blog post titled Comparison of JOINS: MongoDB vs. PostgreSQL on EDB's site, Michael Stonebraker runs a join with $lookup in an aggregation pipeline to compare with a PostgreSQL join. I take this opportunity to discuss some effective design practices for working with a document database.

A common mistake vendors make is publishing comparisons between one database, where the author of the article is an expert, and another database they are unfamiliar with and unwilling to learn about. This leads to biased conclusions, as they contrast best practices from one database with a database where their application design is incorrect. In the EDB blog post, the following aggregation pipeline is used to claim that "JOINS are Brittle in MongoDB" - do you spot the problem?

db.employee.aggregate([           // this is a bad practice example
  {
    $lookup: {
      from: "department",
      localField: "department",
      foreignField: "_id",
      as: "dept"
    }
  },
  {
    $unwind: "$dept"
  },
  {
    $group: {
      "_id": "$dept.dname",
      "salary": { "$sum": "$salary" },
    }
  }
]);

This query reads the "employee" collection, which contains a "department" field referencing the _id field in the "department" collection. It performs a lookup into the "department" collection to retrieve additional information about each employee's department, specifically the department name ("dname"). After unwinding the "dept" array (from the lookup, in our case there's a single value because it's a many to one but MongoDB can embed a many-to-many as well), it groups the data by department name ("dept.dname") and calculates the total salary of employees for each department by summing up the salary field.

In a document model, the department name should be included as a field within the employee document instead of using a reference. While one might argue that normalizing it to another collection simplifies updates, this operation is infrequent. Additionally, any department renaming is likely part of a broader enterprise reorganization, which would prompt updates to the employee collection regardless.
The model does not account for departments without employees, as it is inherently tied to a specific business domain, HR in this case. It focuses on employees and does not share sensitive information like salary details with other domains. In this bounded context, the department information is an employee attribute.

But let's say that we need to avoid duplication and normalize it with a department relation that implements the dependency between the department surrogate key "_id" and the department name. Still, the department name must be unique, as it is the natural key, and this can be enforced with a unique index ({ dname: 1 }, { unique: true }). Knowing this, grouping by "_id" or grouping by "name" is the same. There's no need to lookup for the name for each employee. In an aggregation pipeline, it is better to aggregate first and lookup after, for each department rather than for each employee.

Here is the correct aggregation pipeline:

db.employee.aggregate([           // this is a good practice example
  {    
    $group: {  
      _id: "$department",  
      salary: { $sum: "$salary" }  
    }  
  },  
  {  
    $lookup: {  
      from: "department",  
      localField: "_id",  
      foreignField: "_id",  
      as: "dept"  
    }  
  },  
  {  
    $unwind: "$dept"  
  },  
  {  
    $project: {  
      _id: "$dept.dname",  
      salary: 1  
    }  
  }  
]);

For an expert in relational theory, coding the order of execution might seem surprising, as RDBMS are built to optimize access paths from declarative queries on a logical model. However, similar considerations apply to SQL databases, where those concerns are usually deferred until production, when data grows, and you need to analyze the execution plan and "tune" the query.

For example, the article compared with the following in PostgreSQL:

create table department (
    dname varchar    primary key
,   floor            integer
,   budget           integer
);

create table employee (
    ename varchar
,   age              integer
,   salary           integer
,   department       varchar references department(dname)
);

select dname, sum(salary)
from employee as e
            inner join
            department as d
            on e.department = d.dname
group by dname
;

There are two main differences between the queries they use in MongoDB and PostgreSQL. First, PostgreSQL utilizes a natural key instead of a surrogate key, which simplifies joins but does not resolve the issue of cascading the updates to the department names. This is equivalent to embedding or extended reference in MongoDB. While the author may have ignored it, MongoDB, like SQL, can reference and join columns other than the generated "_id", and secondary indexes makes it fast.
Second, MongoDB performs a left outer join for lookups, because they are lookups, not relational joins. However, the author used an inner join in the PostgreSQL example. Given the data, the outer join makes sense because you don't want to miss the salary because of the department missing. PostgreSQL does not optimize this either and executes the join prior to aggregation:

explain (costs off)
select dname, sum(salary)
from employee as e
            left outer join
            department as d
            on e.department = d.dname
group by dname
;

                         QUERY PLAN                          
-------------------------------------------------------------
 HashAggregate
   Group Key: d.dname
   ->  Hash Left Join
         Hash Cond: ((e.department)::text = (d.dname)::text)
         ->  Seq Scan on employee e
         ->  Hash
               ->  Seq Scan on department d

In PostgreSQL, you also need to look at the execution plan and change the query. For example, using the department name from the employee table instead of the one in the department table eliminates the join:

explain (costs off)
select
 department,
 sum(salary)
from employee as e
      left outer join
      department as d
      on e.department = d.dname
group by
 department
;

          QUERY PLAN          
------------------------------
 HashAggregate
   Group Key: e.department
   ->  Seq Scan on employee e

This is not different from MongoDB. In PostgreSQL as well:

You must understand the access path.
You accept some update complexity to eliminate joins.

The join was eliminated because no column is read from the inner table, because the natural key, department name, was chosen. If you query an additional column from departments, like "floor", the query becomes more complex as this column must be added to the GROUP BY clause even if the normalized model doesn't allow more than one floor per department, and it is joining before the aggregation:

explain (costs off)
select 
 department,
 floor,
 sum(salary)
from employee as e
      left outer join
      department as d
      on e.department = d.dname
group by 
 department,
 floor
;
                         QUERY PLAN                          
-------------------------------------------------------------
 HashAggregate
   Group Key: e.department, d.floor
   ->  Hash Left Join
         Hash Cond: ((e.department)::text = (d.dname)::text)
         ->  Seq Scan on employee e
         ->  Hash
               ->  Seq Scan on department d

In PostgreSQL, you can adopt a style similar to the MongoDB aggregation pipeline by declaring stages within a Common Table Expression (CTE) using a WITH clause. This approach executes the GROUP BY before the JOIN, making the code's intent clearer:

explain (costs off)
with employee_agg as (  
  select 
   department,
   sum(salary) as total_salary  
  from employee  
  group by department  
)  
select 
  d.dname as department_name, 
  d.floor as department_floor,
  ea.total_salary as total_salary  
from employee_agg as ea  
left outer join department as d  
on ea.department = d.dname; 

                       QUERY PLAN                       
--------------------------------------------------------
 Hash Right Join
   Hash Cond: ((d.dname)::text = (ea.department)::text)
   ->  Seq Scan on department d
   ->  Hash
         ->  Subquery Scan on ea
               ->  HashAggregate
                     Group Key: employee.department
                     ->  Seq Scan on employee

This method is more efficient as it aggregates before join. Using Common Table Expressions (CTEs) imitates the MongoDB aggregation pipeline, which provides greater control over data access optimization. Both are high-level languages that enable developers to decompose queries into logical steps effectively.

When writing an SQL query, I prefer to start with aggregations and projections in Common Table Expressions (CTEs) before performing natural joins. This method is valid as long as all projections are clearly defined, ensuring an organized and efficient query structure:

explain (costs off)
with "EmployeesPerDepartment" as (
  -- aggregate employee salary per department  
  select 
   department     as "DepartmentName",
   sum(salary)    as "TotalSalary"  
  from employee  
  group by department  
),  "Departments" as (
  -- project department to join to the aggregate
 select 
  dname         as "DepartmentName", 
  floor         as "DepartmentFloor"
 from department
)  
-- when all columns are defined, natural join is simple to write
 select 
  "DepartmentName", 
  "DepartmentFloor",
  "TotalSalary"  
from "EmployeesPerDepartment"
natural left join "Departments"
;
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Hash Right Join
   Hash Cond: ((department.dname)::text = (employee.department)::text)
   ->  Seq Scan on department
   ->  Hash
         ->  HashAggregate
               Group Key: employee.department
               ->  Seq Scan on employee

Because a SQL result is a single tabular result, it is possible to declare the projection (column aliases) to the final column name first. This eliminates the need for table aliases and complex join clauses. It is also easier to debug, running the intermediate steps

To conclude, it is true that joins in PostgreSQL are generally faster than lookups in MongoDB. This is because PostgreSQL is designed for normalized schemas, where a single business query can retrieve data from multiple tables, while MongoDB is optimized for document models that align with business domain entities, and flexible schemas.
MongoDB's $lookup operation goes beyond a simple equi-join. Since both local and foreign fields can contain arrays, it is more like a set intersection semi-join that matches documents when their respective field values have any overlapping elements.

In SQL databases, the challenge lies not in executing joins but in the complexity faced by developers when crafting optimal queries. To achieve acceptable response times, SQL databases must utilize multiple join algorithms. This requires the query planner to perform cost-based optimization, which heavily relies on accurate cardinality estimations. As the number of tables to join increases, so does the risk of obtaining a poor execution plan. This complexity impacting the developer and the optimizer can create the perception that joins are slow.

If MongoDB lookups are causing slow queries, consider improving your data model and aggregation pipelines first. Embed the one-to-one or one-to-many that belong to the same business object. Filter and aggregate before joining, and utilize multi-key indexes on well-designed document schemas.
Additionally, be aware that MongoDB's aggregation pipeline includes an optimization phase to reshape the pipeline for enhanced performance.

If you find the aggregation pipeline syntax complex, it's easy to learn and developers like it (other databases like Google BigQuery or DuckDB adopted a similar pipeline approach). It resembles Common Table Expressions in SQL, making it straightforward to test each stage. Additionally, you can use the Atlas UI or Compass to construct it with a wizard and view the output of each stage, as shown in the header of this post. But do not abuse it: the document model should avoid lookups on many documents, and aggregation pipelines should filter (on indexes) and aggregate first.

DEV Community

Comparison of JOINS 👉🏻 aggregation pipeline and CTEs

Top comments (0)