Transform Documents into Data with Snowflake Document AI

9 min readMay 13, 2025

Unstructured data is everywhere — PDFs, emails, reports — and turning it into actionable insights has traditionally meant slow, manual work or complex scripts. In contrast, structured data is easy to analyze, automate, and use for decision-making.

That’s where Snowflake Document AI comes in. It transforms messy, unstructured documents into clean, structured data using a mix of machine learning, LLMs, and smart AI-powered processing.

In this post, we’ll walk through how Snowflake Document AI works, highlight its standout features, explore real-world use cases, and guide you through setup — so you can start extracting value, not just data.

Snowflake Document AI is an intelligent tool that turns unstructured documents — like PDFs, forms, and handwritten notes — into structured, usable data. Powered by Arctic-TILT, a document-focused large language model developed by Snowflake, it understands both text and layout, making it ideal for complex document processing.

With zero-shot extraction, it can pull information from unfamiliar document types right away. For more specific needs, fine-tuning lets you customize the model for your business, improving accuracy over time.

The process is simple: upload sample documents, define what to extract using natural language, fine-tune if needed, and scale up with automated workflows using Snowflake’s PREDICT function.

No ML background? No problem. Document AI is built for ease of use — just point, click, and extract.

Real-World Power: How Snowflake Document AI Adds Value

Snowflake Document AI simplifies the way organizations handle document-based data by bridging the gap between raw content and actionable insights. Its key strengths lie in:

Smart Data Structuring: It intelligently extracts text and layout-based information from PDFs, images, and forms using built-in OCR and LLM-powered models — no manual tagging needed.
End-to-End Automation: Once trained, the tool processes documents in bulk via pipelines, slashing the need for manual input and speeding up operations.
Easy for All: Business users configure with plain language; engineers integrate using SQL or Snowpark.

Whether it’s invoices, contracts, or handwritten notes, Document AI makes extracting meaning — and value — fast and scalable.

Document Processing in Snowflake

Quick Start: Setting Up Snowflake Document AI

Ready to put Snowflake Document AI to work? Here’s a streamlined step-by-step guide to get you started.

Prerequisites:

Snowflake Account.
Proper Access.
Documents in Hand.

Below are the steps:

1. Sign in to Snowflake/Snowsight

2. Environment Preparation

CREATE WAREHOUSE documentai_wh WITH WAREHOUSE_SIZE = 'Small';

CREATE DATABASE documentai_db;
CREATE SCHEMA documentai_db.documentai_schema;

3. Granting Required Roles and Privileges

First, create custom role documentai_role to prepare the Snowflake Document AI model build. To do so, make sure to use ACCOUNTADMIN role.

USE ROLE ACCOUNTADMIN; CREATE ROLE documentai_role

Note: Using the ACCOUNTADMIN role is not enough to have access to Snowflake Document AI. You must grant the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR database role and the required privileges to your account role.
GRANT DATABASE ROLE SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR TO ROLE documentai_role;
GRANT USAGE, OPERATE ON WAREHOUSE documentai_wh TO ROLE documentai_role;
GRANT USAGE ON DATABASE documentai_db TO ROLE documentai_role;
GRANT USAGE ON SCHEMA documentai_db.documentai_schema TO ROLE documentai_role;
GRANT CREATE STAGE ON SCHEMA documentai_db.documentai_schema TO ROLE documentai_role;
GRANT CREATE SNOWFLAKE.ML.DOCUMENT_INTELLIGENCE ON SCHEMA documentai_db.documentai_schema TO ROLE documentai_role;
GRANT CREATE STREAM, CREATE TABLE, CREATE TASK, CREATE VIEW ON SCHEMA documentai_db.documentai_schema TO ROLE documentai_role;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE documentai_role;
GRANT ROLE documentai_role TO USER <UserName>;

4. Upload Documents

Once you have granted roles and privileges, upload the documents.

5. Building Snowflake Document AI Model

First — Sign in to Snowsight using an account role that is granted the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR role.

Second — Navigate to AI & ML > Document AI

Third — Select a warehouse and Build.

Fourth — Enter a name for your model, select the location (database and schema), and click Create.

6. Uploading Documents to the Model

7. Identifying Key Data Points for Extraction

In this step, you’ll specify the data values or entities you want Snowflake Document AI to extract from your documents.

In the model build view, navigate to the “Build Details” tab and click on “Define values”.

For every piece of information you want to extract, provide a descriptive field name along with a natural language question that reflects what you’re looking for. Snowflake Document AI will use your questions to locate and extract the relevant data from your uploaded documents.

8. Validating and Assessing Model Output

Once Snowflake Document AI extracts data, review the results and take action:

Accept correct answers by selecting “Accept all and close.”
Manually correct any wrong answers.
Edit multiple answers by removing errors or adding the right ones.
If no answer appears, manually enter the correct value.

9. Train for Accuracy: Snowflake Document AI

If the results from Snowflake Document AI aren’t accurate enough, you can fine-tune the model using your own documents. This helps improve how well it understands and extracts the right information. To do that:

In the model build view, go to the “Build Details” tab and select “Train model” under “Model accuracy”.
Confirm the training process by selecting “Start training” in the dialog.

The training process may take some time, depending on the number of documents and values being trained.

Once the training is complete, you can re-evaluate the model’s performance by reviewing the results on a separate set of documents.

10. Publish the Model Build

Once you’re happy with the model’s accuracy and results, publish it for production use:

In the model build view, go to the “Build Details” tab.
Under “Model accuracy”, click “Publish version”.
Confirm by selecting “Publish” in the dialog box.

11. Extracting Information Using Document AI

With your model build published, you’re ready to start extracting data using the <model_build_name>!PREDICT function on documents in your stage.

Next up, we’ll explore advanced capabilities — so keep reading!

Congrats! If you’ve followed along, you now have a working Snowflake Document AI setup, ready to handle document processing efficiently.

12. Automating Document Extraction in Snowflake

With your model build published, you can now set up an automated processing pipeline using Snowflake’s streaming and task features.

Start by creating a Snowflake stage to store the documents. You can choose between an external or internal stage — here, we’ll use an internal stage.

Use the CREATE STAGE command to set it up.

CREATE OR REPLACE STAGE docai_stage DIRECTORY = (ENABLE = TRUE) ENCRYPTION = (TYPE = ‘SNOWFLAKE_SSE’);

Syntax to create Snowflake Stage — Snowflake Internal Stage — Snowflake Document AI

Now that our stage is created, let’s create a Snowflake stream on the Snowflake internal stage to monitor for new documents. To do this, use the CREATE STREAM command.

CREATE STREAM docai_stream ON STAGE docai_stage;

Syntax to create Snowflake Stream — Snowflake Internal Stage — Snowflake Document AI

Create a table named documentai_stage_table to hold details about the documents. Include columns like file_name, file_size, last_modified, snowflake_file_url, json_content, and any other fields you want to extract from the PDF files.

CREATE OR REPLACE TABLE documentai_stage_table (
file_name VARCHAR,
file_size VARIANT,
last_modified VARCHAR,
snowflake_file_url VARCHAR,
json_content VARCHAR
);

json_content column will include the extracted information in JSON format.

To process new documents in the stage, create a stream_new_document_task task:

CREATE OR REPLACE TASK stream_new_document_task
WAREHOUSE = documentai_wh
SCHEDULE = ‘1 minute’
COMMENT = ‘Continuously load new docs on stage.’
WHEN SYSTEM$STREAM_HAS_DATA(‘docai_stream’) AS
INSERT INTO documentai_stage_table (
SELECT RELATIVE_PATH AS file_name,
size AS file_size,
last_modified,
file_url AS snowflake_file_url,
DEMO_DOCUMENT_REPORT! PREDICT(
GET_PRESIGNED_URL(‘@DOCAI_STAGE’, RELATIVE_PATH),
2
) AS json_content
FROM docai_stream
WHERE METADATA$ACTION = ‘INSERT’
);

Newly created tasks are automatically suspended. So make sure to start the newly created task:

ALTER TASK stream_new_document_task RESUME;

13. Upload New Documents to Snowflake Internal Stage

It is now time to upload the new documents you want to process to the Snowflake internal stage you created. You can do this through Snowsight.

To do so, first head over to the homepage of Snowsight, navigate to Data, and then to Databases.

Then, select the documentai_db database, the documentai_schema, and the docai_stage stage.

After that, click on the + Files button. In the “Upload Your Files” dialog that appears, select the pdf document and click Upload.

File present in Snowflake Internal Stage

Check files by listing under stage.

ls @DOCAI_STAGE;

14. View Extracted Unstructured Data Into Structured Format

If you select the data it will give you the data in unstructured format.

SELECT DEMO_DOCUMENT_REPORT!PREDICT(get_presigned_url('@docai_stage', RELATIVE_PATH ), 2) as json from directory(@docai_stage);

Now Once task will be execute, your data will come to stage automatically.

SELECT * FROM documentai_stage_table;

To organize the extracted data in a structured format, run the following SQL to create a table named documentai_gold_table. This table will store the extracted values along with their confidence scores. Running this query may take a few minutes.

First, create a table that includes all the necessary values to store the extracted data. Here’s an example SQL statement to create the table:

CREATE OR REPLACE TABLE documentai_gold_table AS (
WITH temp AS (
    SELECT 
        Relative_path AS file_name,
        size AS file_size,
        last_modified,
        file_url AS snowflake_file_url,
        DEMO_DOCUMENT_REPORT!PREDICT(get_presigned_url('@docai_stage', RELATIVE_PATH), 2) AS json_content
    FROM directory(@docai_stage)
)
 SELECT
   file_name,
   file_size,
   last_modified,
   snowflake_file_url,
   json_content:__documentMetadata.ocrScore::FLOAT AS ocrScore,
   c.value:value::STRING AS Applicant_No,
   r.value:value::STRING AS Inventors_Name,
   g.value:value::STRING AS Patent_Date,
   p.value:value::STRING AS Patent_File_Date,
   y.value:value::STRING AS Patent_Name,
   z.value:value::STRING AS Patent_No
 FROM temp,
   LATERAL FLATTEN(INPUT => json_content:Applicant_No) c,
   LATERAL FLATTEN(INPUT => json_content:Inventors_Name) r,
   LATERAL FLATTEN(INPUT => json_content:Patent_Date) g,
   LATERAL FLATTEN(INPUT => json_content:Patent_File_Date) p,
   LATERAL FLATTEN(INPUT => json_content:Patent_Name) y,
   LATERAL FLATTEN(INPUT => json_content:Patent_No) z,
 GROUP BY ALL
);

First, we create a table to store all the values, and then we extract the relevant data and the total OCR score from the JSON into individual columns. The FLATTEN function is used to parse the json_content JSON into separate columns for easier viewing.

To verify the results, run the following SQL:

select * from documentai_gold_table;

And You’re All Set!

By following these steps, you’ll have a fully automated document processing pipeline powered by Snowflake Document AI, Streams, and Tasks. The system will continuously watch for new files, extract key data using your trained model, and load results into a table — ready for analysis or integration with other datasets.

This hands-free approach simplifies your workflow, eliminates manual effort, and unlocks insights from unstructured documents with speed and precision.

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science