Franck Pachot for MongoDB

Posted on Jun 13

DuckDB to query MongoDB

#postgres #mongodb #duckdb #analytics

DuckDB is a high-performance analytical database, but it is single-process, so it cannot replace an operational database. MongoDB is a general-purpose database for operational data in a flexible format. DuckDB lacks a native connector to MongoDB, but you can combine pg_duckdb and mongo_fdw, two PostgreSQL extensions, to query MongoDB with DuckDB.

A PostgreSQL Docker container with DuckDB and Mongo_FDW

Here is an example. I built a pg_duckdb image, With is PostgreSQL with DuckDB as an extension, where I install mongo_fdw, the Foreign Data Wrapper to access MongoDB from PostgreSQL. Here is my dockerfile:

ARG pg_version=17

# build stage to compole the MongoDB Foreign Data Wrapper

FROM pgduckdb/pgduckdb:${pg_version}-main as build

ARG pg_version

ARG MONGO_FDW_VERSION=5_5_2
ARG MONGO_FDW_URL=https://github.com/EnterpriseDB/mongo_fdw/archive/REL-${MONGO_FDW_VERSION}.tar.gz
ARG SOURCE_FILES=/tmp/mongo_fdw

ENV PKG_CONFIG_PATH=${SOURCE_FILES}/mongo-c-driver/src/libmongoc/src:${SOURCE_FILES}/mongo-c-driver/src/libbson/src
ENV LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu
ENV MONGOC_INSTALL_DIR=${LD_LIBRARY_PATH}
ENV JSONC_INSTALL_DIR=${LD_LIBRARY_PATH}

USER root

# dependencies (compilation and runtime)
RUN apt-get update && apt-get install -y --no-install-recommends wget ca-certificates make gcc cmake pkg-config postgresql-server-dev-${pg_version} libssl-dev libzstd-dev libmongoc-dev libjson-c-dev libsnappy1v5 libmongocrypt0 
# compile mongo_fdw
WORKDIR /var/tmp
ADD ${MONGO_FDW_URL} /var/tmp
WORKDIR /var/tmp
RUN tar -zxvf $(basename ${MONGO_FDW_URL})
# install monfo_fdw
WORKDIR /var/tmp/mongo_fdw-REL-${MONGO_FDW_VERSION}
RUN ./autogen.sh && make USE_PGXS=1 && make USE_PGXS=1 install;

# add mongosh because it can be useful
WORKDIR /var/tmp
ADD https://downloads.mongodb.com/compass/mongosh-2.5.2-linux-x64.tgz /tmp
RUN tar -xvf /tmp/mongosh-2.5.2-linux-x64.tgz
RUN cp ./mongosh-2.5.2-linux-x64/bin/mongosh /usr/local/bin

# final stage to add mongo_fdw to pgduckdb

FROM pgduckdb/pgduckdb:${pg_version}-main

USER root

ARG pg_version
ARG extdir=/usr/share/postgresql/${pg_version}/extension
ARG extlibdir=/usr/lib/postgresql/${pg_version}/lib
ARG libdir=/usr/lib/x86_64-linux-gnu

COPY --from=build ${extdir}/mongo_fdw* ${extdir}/
COPY --from=build ${extlibdir}/mongo_fdw.so ${extlibdir}/
COPY --from=build ${libdir}/libmongoc-1.0.so.0.0.0 \
                  ${libdir}/libbson-1.0.so.0.0.0 \
                  ${libdir}/libmongocrypt.so.0.0.0 \
                  ${libdir}/libsnappy.so.1.1.9 \
                  ${libdir}/ 

RUN cd ${libdir} && \
    ln -sf libmongoc-1.0.so.0.0.0 libmongoc-1.0.so.0 && \
    ln -sf libmongoc-1.0.so.0 libmongoc-1.0.so && \
    ln -sf libbson-1.0.so.0.0.0 libbson-1.0.so.0 && \
    ln -sf libbson-1.0.so.0 libbson-1.0.so && \
    ln -sf libmongocrypt.so.0.0.0 libmongocrypt.so.0 && \
    ln -sf libmongocrypt.so.0 libmongocrypt.so && \
    ln -sf libsnappy.so.1.1.9 libsnappy.so.1 && \
    ln -sf libsnappy.so.1 libsnappy.so;

COPY --from=build /usr/local/bin /usr/local/bin

USER postgres

Start the databases

I started a MongoDB container:

docker run --name mongodb $e -d mongodb/mongodb-community-server:latest

I built the image and started a container from my image, with network link to mongodb:

docker build -t duckduckmongo . 

docker run --name duckduckmongo -d --link mongodb:mongodb -e POSTGRES_PASSWORD=postgres4mongo duckduckmongo

Create MongoDB collections

In this container, where I've installed the MongoDB Shell, I connected to test the connection to the MongoDB database:

docker exec -it duckduckmongo mongosh mongodb:27017/test

While there, I created a user:

db.createUser( { 
  user: "duckduckmongo",
  pwd: "mongo4postgres",
  roles: [ { role: "readWrite", db: "test" } ] 
} );

I created two collections with some data:

db.dim.insertMany( Array.from({ length: 42 }, (_, i) => (
 { 
   dimid: i + 1, 
   dimvalue: Math.random(),
   tags: [ "fr" , "de" , "it" ],
   coord: { x:1, y:-11 } 
})) );

db.fact.insertMany( Array.from({ length: 100000 }, () => (
 { 
   ref: Math.ceil(Math.random() * 42), 
   value: Math.random(),
}
)) );

From the field names, you can guess that I wanted to test some joins, with one collection referencing the other.

Create the Foreign Tables

I connected to PostgreSQL:

docker exec -it duckduckmongo psql -U postgres

I enabled the extensions:

create extension if not exists mongo_fdw;
create extension if not exists pg_duckdb;

I declared the Foreign Data Wrapper to connect to MongoDB:


CREATE SERVER "MongoDB server" FOREIGN DATA WRAPPER mongo_fdw
  OPTIONS ( address 'mongodb', port '27017' )
;

GRANT USAGE ON FOREIGN SERVER "MongoDB server" TO postgres;

CREATE USER MAPPING FOR postgres SERVER "MongoDB server"
  OPTIONS ( username 'duckduckmongo', password 'mongo4postgres' )
;

I declared the foreign tables that map to the MongoDB collections:


-- Note: first column of the table must be "_id" of type "name" according to the doc

CREATE FOREIGN TABLE dim ( 
   _id name,
   dimid int,
   dimvalue float,
   "coord.x" int,    -- mapping nested object fields to column
   "coord.y" int,
   tags text[]       -- mappint arrays as arrays
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'dim' )
;

CREATE FOREIGN TABLE fact ( 
   _id name, 
   ref int, 
   value float 
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'fact' )
;

Those foreign tables are like views and can be queried from PostgreSQL, and the Foreign Data Wrapper will query the MongoDB database server.

Query MongoDB from PostgreSQL with join

I run the following query, with a filter and a join:

select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

It takes several second and I can check the execution plan:

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Foreign Scan (actual time=16.379..7545.432 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Foreign Namespace: (test.fact) INNER JOIN (test.dim)
 Planning Time: 2.226 ms
 Execution Time: 7548.817 ms

The slowness occurs because the join in MongoDB uses a $lookup within an aggregation pipeline, which isn't optimized for fifty thousand documents.

In MongoDB, the $lookup operation can be slower than joins in an RDBMS due to its document model, which is designed for embedding related data and reducing the need for joins. The flexible schema allows fields to be scalars, arrays, or nested documents, making $lookup operations more complex than traditional equi-joins. A lookup is a key-intersection join, which requires implicit array traversal and matching across two sets of keys, resembling a semijoin but including all matching elements from the inner collection, typically as an array that can also be unwound to multiple documents.

When querying thousands of documents, it's more effective to run two separate queries instead of one $lookup. This can be done transparently through the Foreign Data Wrapper, which can avoid pushing down the join. A hash join is preferable for my query as it requires most rows from the "dim' collection. Note that MongoDB can also use a hash join for lookups but it is only effective if the table has fewer than ten thousand documents and temporary disk usage is allowed.

Query MongoDB collection but join in PostgreSQL

To execute the join in PostgreSQL, I disabled the join pushdown:

set mongo_fdw.enable_join_pushdown to false;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.524..102.797 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=0.923..90.954 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.575..0.575 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.488..0.552 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 2.341 ms
 Execution Time: 105.252 ms

This has pushed the filter to MongoDB but not the join, so it has read all documents from "dim" in 0.6 milliseconds, to build the hashed table, and fifty thousand documents from "fact" in 92 milliseconds, serving as the probe table for the hash join.

It's important to understand that the faster join performance in PostgreSQL compared to MongoDB is not due to the database engine itself, but reading a flexible document model through the limitation of a relational view. The foreign table definition declares the fields as scalar integers, so the join is an equi-join, while the document model permits arrays, and a lookup is an intersection between two sets rather than an equi join.

I disabled join pushdown for my session, but it can be an option of the foreign table declaration.

Query planner cardinalities

When not pushing down the joins, it is important to get good cardinality estimations because it uses the PostgreSQL cost-based optimizer. By default the MongoDB Foreign Data Wrapper uses the same constant for all tables:

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Hash Join  (cost=1062.50..2240.00 rows=5000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=25.00..1025.00 rows=1000 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=1025.00..1025.00 rows=1000 width=116)
         ->  Foreign Scan on dim  (cost=25.00..1025.00 rows=1000 width=116)
               Foreign Namespace: test.dim

I was fortunate to have the small table hashed, but it is not a guarantee, as both options have the same cost. To get more accurate cardinality estimations, I enabled the remote estimate feature, which requests a count estimation from the MongoDB database:


ALTER SERVER "MongoDB server" 
 OPTIONS (ADD use_remote_estimate 'true')
;

explain
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

                              QUERY PLAN                              
----------------------------------------------------------------------
 Hash Join  (cost=14.04..3887.03 rows=7000 width=192)
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on fact  (cost=5.00..3682.98 rows=33333 width=76)
         Foreign Namespace: test.fact
   ->  Hash  (cost=8.52..8.52 rows=42 width=116)
         ->  Foreign Scan on dim  (cost=5.00..8.52 rows=42 width=116)
               Foreign Namespace: test.dim

To determine the correct join order, even an estimated selectivity is sufficient. In a hash join, the smaller table should be the build table, while the larger one serves as the probe table.

Invoke DuckDB to query the foreign tables

This didn't go through DuckDB because DuckDB, here, is an extension within PostgreSQL. To run the same though DuckDB I force it:

set duckdb.force_execution to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Hash Join (actual time=1.598..102.117 rows=49889 loops=1)
   Output: fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
   Hash Cond: (fact.ref = dim.dimid)
   ->  Foreign Scan on public.fact (actual time=1.018..90.344 rows=49889 loops=1)
         Output: fact._id, fact.ref, fact.value
         Foreign Namespace: test.fact
   ->  Hash (actual time=0.558..0.558 rows=42 loops=1)
         Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
         Buckets: 1024  Batches: 1  Memory Usage: 15kB
         ->  Foreign Scan on public.dim (actual time=0.464..0.527 rows=42 loops=1)
               Output: dim._id, dim.dimid, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags
               Foreign Namespace: test.dim
 Planning Time: 25.667 ms
 Execution Time: 104.390 ms

The execution looks the same because it has actually been run by PostgreSQL - it is a PostgreSQL execution plan. Note that I got the following warning:

WARNING:  Unsupported Postgres type: No conversion to DuckDB available for type with oid=19)

To analyze the query prepared by the DuckDB extension, I enabled DuckDB debug messages:

set client_min_messages   to debug;
set duckdb.log_pg_explain to true;

explain (analyze, costs off, verbose)
select * from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;

DEBUG:  (PGDuckDB/DuckdbPrepare) Preparing: 
 SELECT fact._id, fact.ref, fact.value, dim._id, dim.dimid, dim.dimvalue , dim."coord.x", dim."coord.y", dim.tags
 FROM (pgduckdb.public.fact 
 JOIN pgduckdb.public.dim ON ((fact.ref = dim.dimid))) 
 WHERE (fact.value > (0.5)::double precision)

DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'ref', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'value', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: '_id', Type: "No conversion to DuckDB available for type with oid=19)"
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimid', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'dimvalue', Type: DOUBLE
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.x', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'coord.y', Type: INTEGER
DEBUG:  (DuckDB/SetTableInfo) Column name: 'tags', Type: VARCHAR[]

I ran the same without the "_id" columns which I don't need and get a DuckDB execution plan:

explain (analyze, costs off, verbose)
select fact.ref, fact.value, dim.dimid, dim.dimvalue, dim.dimvalue 
 , dim."coord.x", dim."coord.y", dim.tags
 from fact join dim on fact.ref=dim.dimid
 where fact.value>0.5
;
                                                                                                                                                                                                                                                                                                                                             QUERY PLAN                                                                                                                         
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Custom Scan (DuckDBScan) (actual time=0.001..0.002 rows=0 loops=1)
   Output: duckdb_scan.ref, duckdb_scan.value, duckdb_scan.dimid, duckdb_scan.dimvalue, duckdb_scan.dimvalue_1, duckdb_scan."coord.x", duckdb_scan."coord.y", duckdb_scan.tags
   DuckDB Execution Plan: 

 ┌─────────────────────────────────────┐
 │┌───────────────────────────────────┐│
 ││    Query Profiling Information    ││
 │└───────────────────────────────────┘│
 └─────────────────────────────────────┘
 EXPLAIN ANALYZE  SELECT fact.ref, fact.value, dim.dimid, dim.dimvalue, dim.dimvalue, dim."coord.x", dim."coord.y", dim.tags FROM (pgduckdb.public.fact JOIN pgduckdb.public.dim ON ((fact.ref = dim.dimid))) WHERE (fact.value > (0.5)::double precision)
 ┌────────────────────────────────────────────────┐
 │┌──────────────────────────────────────────────┐│
 ││              Total Time: 0.131s              ││
 │└──────────────────────────────────────────────┘│
 └────────────────────────────────────────────────┘
 ┌───────────────────────────┐
 │           QUERY           │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │      EXPLAIN_ANALYZE      │
 │    ────────────────────   │
 │           0 Rows          │
 │          (0.00s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         PROJECTION        │
 │    ────────────────────   │
 │            ref            │
 │           value           │
 │           dimid           │
 │          dimvalue         │
 │          dimvalue         │
 │          coord.x          │
 │          coord.y          │
 │            tags           │
 │                           │
 │         49889 Rows        │
 │          (0.00s)          │
 └─────────────┬─────────────┘
 ┌─────────────┴─────────────┐
 │         HASH_JOIN         │
 │    ────────────────────   │
 │      Join Type: INNER     │
 │                           │
 │        Conditions:        ├──────────────┐
 │        ref = dimid        │              │
 │                           │              │
 │         49889 Rows        │              │
 │          (0.01s)          │              │
 └─────────────┬─────────────┘              │
 ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
 │         TABLE_SCAN        ││         TABLE_SCAN        │
 │    ────────────────────   ││    ────────────────────   │
 │        Table: fact        ││         Table: dim        │
 │                           ││                           │
 │        Projections:       ││        Projections:       │
 │            ref            ││           dimid           │
 │           value           ││          dimvalue         │
 │                           ││          coord.x          │
 │     Filters: value>0.5    ││          coord.y          │
 │                           ││            tags           │
 │                           ││                           │
 │         49889 Rows        ││          42 Rows          │
 │          (0.12s)          ││          (0.00s)          │
 └───────────────────────────┘└───────────────────────────┘


 Planning Time: 4.371 ms
 Execution Time: 1.219 ms
(69 rows)

This is a DuckDB execution plan. It is not visible from the plan but the TABLE SCAN are foreign scans querying the MongoDB database.

Cache static tables as temporary tables

DuckDB does not cache data from its sources to ensure that queries reflect the consistent state of the remote database. If you need to query the same data multiple times, you can create a local temporary table for this purpose, which uses DuckDB columnar format:

postgres=# \timing on
Timing is on.

postgres=# \dconfig duckdb.force*

List of configuration parameters
       Parameter        | Value 
------------------------+-------
 duckdb.force_execution | off

postgres=# create temporary table "fact >0.5" using duckdb as 
select fact.ref, fact.value
 from fact
 where fact.value>0.5
;

CREATE TABLE AS
Time: 179.794 ms

postgres=# create temporary table "dim full" using duckdb as 
select dim.dimid, dim.dimvalue
 from dim 
;

CREATE TABLE AS
Time: 6.831 ms

postgres=# select fact.ref, fact.value, dim.dimid, dim.dimvalue
 from "fact >0.5" fact join "dim full" dim on fact.ref=dim.dimid
 where fact.value>0.5
;                                                                                                                                                                                                                                                 
ref |       value        | dimid |       dimvalue
-----+--------------------+-------+----------------------
  40 | 0.6335388780874929 |    40 |  0.13435523642571567
  31 |  0.760540408920245 |    31 |  0.43496303194768715
  40 | 0.7096153630036202 |    40 |  0.13435523642571567
  38 | 0.6904659480270525 |    38 |   0.9497622200643621
  14 | 0.8475573339736822 |    14 |  0.42215182933670214
  28 | 0.5476860990695016 |    28 |   0.5159051376729527
  35 | 0.9532386222160323 |    35 |   0.6724020854302464
  37 | 0.8393590904668813 |    37 |   0.8415113287385758
...
(49889 rows)

Time: 38.975 ms

Map the whole document

It is possible to map the whole document as a JSONB:


CREATE FOREIGN TABLE dim_json ( 
   "__doc" JSONB    -- you must use __ for the name
 ) SERVER "MongoDB server"
 OPTIONS ( database 'test', collection 'dim' )
;

However, if you go there, you probably want to query MongoDB directly rather than using JSON operators, like this to join:

select fact.ref, fact.value, dim."__doc"
 from fact 
 join dim_json dim 
 on fact.ref::int = (dim."__doc"->>'dimid')::int
 where fact.value>0.5
;

In conclusion, although there is no native connector between DuckDB and MongoDB (#17925), you can query MongoDB from DuckDB using the PostgreSQL Foreign Data Wrapper. You can optimize data access by pushing down filters (WHERE), aggregates (GROUP BY), pagination (ORDER BY and LIMIT), and joins. Depending on the data size, it may be preferable to join locally and store some data in columnar temporary tables.

Of course, there are other solutions if you want to query MongoDB with SQL: Atlas SQL Interface

DEV Community