Docs Menu
Docs Home
/
Database Manual
/

Natural Language to MongoDB Queries

This page provides guidance on how to generate MongoDB queries for your data from natural language using a large language model (LLM).

For example, consider the following natural language query to generated query in mongosh for the Atlas sample_mflix database:

Given the following natural language query:

Show me the genres and runtime of
10 movies from 2015 that have
the most comments

This generates the following mongosh code:

db.movies.aggregate([
{
$match: {
year: 2015,
},
},
{
$sort: {
num_mflix_comments: -1,
},
},
{
$limit: 10,
},
{
$project: {
_id: 0,
genres: 1,
runtime: 1,
},
},
]);

In addition to using LLMs out of the box, you can use the following tools built by MongoDB to generate MongoDB queries from natural language:

Models that perform well on general tasks typically also perform well at MongoDB query generation. When selecting an LLM to generate MongoDB queries, refer to popular benchmarks like MMLU-Pro and Chatbot Arena ELO to evaluate performance between models.

This section outlines effective strategies for prompting an LLM to generate MongoDB queries.

Note

The following prompting strategies are based on benchmarks created by MongoDB. To learn more, see our public benchmark of natural language to mongosh code on Hugging Face.

Your base prompt, also called the system prompt, should provide a clear overview of your task, including:

  • The type of query to generate.

  • Information about the expected output structure, such as the driver language or tool that executes the query.

The following base prompt example demonstrates how to generate a MongoDB read operation or aggregation for mongosh:

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`

To improve query quality, add the following guidance to your base prompt to provide the model with common tips for generating effective MongoDB queries:

Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate)
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.)
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible
4. Include sorting (.sort()) and limiting (.limit()), when appropriate, for result set management
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`.
8. For Decimal128 operations, prefer range queries over exact equality
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks

You can prompt the model to "think out loud" before generating the response to improve response quality. This technique, called chain of thought prompting, improves performance but increases generation time and costs.

To encourage the model to think step-by-step before generating the query, add the following text to your base prompt:

Think step by step about the code in the answer before providing it. In your thoughts, consider:
1. Which collections are relevant to the query.
2. Which query operation to use (find vs aggregate) and what specific operators ($match, $group, $project, etc.) are needed.
3. What fields are relevant to the query.
4. Which indexes you can use to improve performance.
5. What specific transformations or projections are required.
6. What data types are involved and how to handle them appropriately (ObjectId, Decimal128, Date, etc.).
7. What edge cases to consider (empty results, null values, missing fields).
8. How to handle any array fields that require special operators ($elemMatch, $all, $size).
9. Any other relevant considerations.

To significantly improve query quality, include a few representative sample documents from your collection. Two to three representative documents typically provide the model with sufficient context about the data structure.

When providing sample documents, follow these guidelines:

  • Use the BSON.EJSON.serialize() function to convert BSON documents to EJSON strings for the prompt.

  • Truncate long fields or deeply nested objects.

  • Exclude long string values.

  • For large arrays, like vector embeddings, include only a few elements.

Apply the following prompting best practices for specific use cases when generating MongoDB queries from natural language.

Include collection indexes in your prompt to encourage the LLM to generate more performant queries. MongoDB drivers and mongosh provide methods to get index information. For example, the Node.js driver provides the listIndexes() method to get indexes for your prompt.

Most LLM tools include the date in their system prompt. However, if you're using an LLM out of the box, the model does not know the current date or time. Therefore, when working with base models or building your own natural language to MongoDB tools, include the latest date in your prompt. Use the method for your programming language to get the current date as a string such as JavaScript's new Date().toString() or Python's str(datetime.now()).

Include annotated schemas of relevant database collections in your prompt. While no single representation method works best for all LLMs, some approaches are more effective than others.

We recommend representing collections using programming language-native types that describe data shape, such as TypeScript Types, Python Pydantic models, or Go structs. If you use MongoDB from these languages, you likely have the data shape defined already. To guide the LLM and reduce ambiguity, add comments to your prompt to describe each field.

The following example shows a TypeScript type for the sample_mflix.movies collection:

The following example demonstrates a complete prompt using the strategies described on this page for generating mongosh code from natural language.

Use the following system prompt example as a template for your MongoDB query generation tasks. The sample prompt includes the following components:

  • Task overview and expected output format

  • General MongoDB query authoring guidance

You are an expert data analyst experienced at using MongoDB.
Your job is to take information about a MongoDB database plus a natural language query and generate a MongoDB shell (mongosh) query to execute to retrieve the information needed to answer the natural language query.
Format the mongosh query in the following structure:
`db.<collection name>.find({/* query */})` or `db.<collection name>.aggregate({/* query */})`
Some general query-authoring tips:
1. Ensure proper use of MongoDB operators ($eq, $gt, $lt, etc.) and data types (ObjectId, ISODate).
2. For complex queries, use aggregation pipeline with proper stages ($match, $group, $lookup, etc.).
3. Consider performance by utilizing available indexes, avoiding $where and full collection scans, and using covered queries where possible.
4. Include sorting (.sort()) and limiting (.limit()) when appropriate for result set management.
5. Handle null values and existence checks explicitly with $exists and $type operators to differentiate between missing fields, null values, and empty arrays.
6. Do not include `null` in results objects in aggregation, e.g. do not include _id: null.
7. For date operations, NEVER use an empty new date object (e.g. `new Date()`). ALWAYS specify the date, such as `new Date("2024-10-24")`. Use the provided 'Latest Date' field to inform dates in queries.
8. For Decimal128 operations, prefer range queries over exact equality.
9. When querying arrays, use appropriate operators like $elemMatch for complex matching, $all to match multiple elements, or $size for array length checks.

Note

You might also add the chain-of-thought prompt to encourage step-by-step thinking before code generation.

Then, use the following user message template to provide the model with the necessary context about your database and your desired query:

Generate MongoDB Shell (mongosh) queries for the following database and natural language query:
## Database Information
Name: {{Database name}}
Description: {{database description}}
Latest Date: {{latest date}} (use this to inform dates in queries)
### Collections
#### Collection `{{collection name. Do for each collection you want to query over}}`
Description: {{collection description}}
Schema:
```
{{interpreted or annotated schema here}}
```
Example documents:
```
{{truncated example documents here}}
```
Indexes:
```
{{collection index descriptions here}}
```
Natural language query: {{Natural language query here}}

Back

SQL to MongoDB

On this page