Mastering XML Shredding in Snowflake (Part 1)

Config Driven XML Shredding in Snowflake

4 min readMay 20, 2025

Modern enterprises often deal with semi-structured data such as XML, especially in domains like finance, healthcare, insurance and manufacturing. While Snowflake now natively supports XML through its VARIANT and XML data types, building a scalable, maintainable, and extensible shredding framework still requires careful planning.

In Part 1 of this series, we’ll design a robust, config-driven XML shredding pipeline that handles nested structures, supports multiple schemas and uses stored procedures for dynamic execution.

Step 1: Load XML into RAW Layer as Variant

To preserve the original structure and simplify ingestion, we load raw XML as-is into a base table.

CREATE OR REPLACE TABLE RAW_XML (
  file_name STRING,
  load_time TIMESTAMP,
  xml_content VARIANT
);

INSERT INTO
  RAW_XML (file_name, load_time, xml_content)
SELECT
  'sample_dept.xml' AS file_name,
  CURRENT_TIMESTAMP() AS load_time,
  TO_VARIANT(PARSE_XML (
    '<dept>
  <dept_id>1</dept_id>
  <dept_name>Marketing</dept_name>
  <employee>
    <emp_id>101</emp_id>
    <emp_fname>Chandler</emp_fname>
    <emp_lname>Bing</emp_lname>
    <address>
      <street_1>123 Main St</street_1>
      <city>NY</city>
      <state>NY</state>
      <zipcode>10001</zipcode>
    </address>
  </employee>
</dept>'
  )) AS xml_content;

Pros:

Schema agnostic ingestion
No upfront parsing logic needed
Simple S3 to table flow

Cons:

Raw data is unreadable without further transformation
Potential for bloated storage if XML is verbose

Alternative: Use Snowpipe or COPY INTO XML support for high-volume automated ingestion.

We will save the final shredding result into the FINAL_TABLE (for now):

CREATE OR REPLACE TABLE FINAL_TABLE (
  dept_id INTEGER,
  dept_name STRING,
  emp_id STRING,
  emp_fname STRING,
  emp_lname STRING,
  street_1 STRING,
  city STRING,
  state STRING,
  zipcode STRING
);

Step 2: Handle Nested Structures via XMLGET

XML’s nested nature often maps poorly to flat relational models. We rely on recursive calls to XMLGET() for safe traversal.

SELECT
XMLGET(xml_content, 'order_id'):"$"::STRING AS order_id,
XMLGET(XMLGET(xml_content, 'customer'), 'customer_id'):"$"::STRING AS customer_id
FROM RAW_XML;

Tip: Always include :”$” to access the leaf node content.

Step 3: Define an XML_CONFIG Table

Instead of hardcoding XML paths into queries, we define a central configuration table:

CREATE OR REPLACE TABLE xml_config (
  id INT,
  xml_type STRING,
  tag_path STRING,      -- e.g. 'dept_id', 'employee.emp_fname'
  column_name STRING,   -- e.g. 'dept_id', 'emp_fname'
  data_type STRING      -- e.g. 'STRING', 'INTEGER'
);

INSERT INTO xml_config VALUES
(1, 'DEPT', 'dept_id', 'dept_id', 'INTEGER'),
(2, 'DEPT', 'dept_name', 'dept_name', 'STRING'),
(3, 'DEPT', 'employee.emp_id', 'emp_id', 'STRING'),
(4, 'DEPT', 'employee.emp_fname', 'emp_fname', 'STRING'),
(5, 'DEPT', 'employee.emp_lname', 'emp_lname', 'STRING'),
(6, 'DEPT', 'employee.address.street_1', 'street_1', 'STRING'),
(7, 'DEPT', 'employee.address.city', 'city', 'STRING'),
(8, 'DEPT', 'employee.address.state', 'state', 'STRING'),
(9, 'DEPT', 'employee.address.zipcode', 'zipcode', 'STRING');

Pros:

Externalizes logic from code
Simplifies maintenance and schema extension
Enables reuse across multiple files and sources

Cons:

Requires strong data stewardship to keep config synced
Doesn’t validate if paths are correct until runtime

Alternative: Use a metadata catalog or schema registry like AWS Glue or Collibra.

Step 4: Generate Dynamic SQL via Stored Procedure

We use a Python based stored procedure to read from xml_config, build XMLGET chains dynamically and execute the appropriate INSERT statement.

Key logic:

Parse tag_path into XMLGET() chains
Type-cast the result to data_type
Insert the result into FINAL_<xml_type>

Pros:

Flexible and dynamic
Supports many XML types with one codebase

Cons:

Harder to debug than static SQL

Alternative: Use Jinja templates to pre-generate and version control procedures per xml_type.

CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
    # Fetch config from xml_config
    config_df = session.sql(f"""
        SELECT
            tag_path,
            column_name,
            data_type
        FROM xml_config
        WHERE xml_type = '{xml_type_input}'
        ORDER BY id
    """).to_pandas()

    if config_df.empty:
        return f"No config found for xml_type: {xml_type_input}"

    def build_expr(row):
        parts = row["TAG_PATH"].split(".")
        expr = "xml_content"
        for part in parts:
            expr = f"XMLGET({expr}, '{part}')"
        return f"{expr}:\"$\"::{row['DATA_TYPE']} AS {row['COLUMN_NAME']}"

    select_clauses = [build_expr(row) for _, row in config_df.iterrows()]

    final_query = f"""
        INSERT INTO FINAL_TABLE
        SELECT {', '.join(select_clauses)}
        FROM RAW_XML
    """

    print("Executing query:\n" + final_query)
    session.sql(final_query).collect()

    return f"XML shredding complete for xml_type: {xml_type_input}"
$$;

CALL shred_xml('DEPT');

SELECT * FROM FINAL_TABLE;

Step 5: Support for Multiple XML Types

Since xml_config includes an xml_type, we reuse the same stored procedure across multiple schemas.

Each schema gets its own:

FINAL_<xml_type> table
Set of xml_config rows

This provides a scalable way to handle large multi-team XML ingestion.

INSERT INTO RAW_XML (file_name, load_time, xml_content)
SELECT
  'order_sample.xml' AS file_name,
  CURRENT_TIMESTAMP() AS load_time,
  TO_VARIANT(PARSE_XML (
    '<order>
      <order_id>2001</order_id>
      <order_date>2024-01-01</order_date>
      <customer>
        <customer_id>C123</customer_id>
        <customer_name>Monica Geller</customer_name>
      </customer>
      <total>459.99</total>
    </order>'
  )) AS xml_content;

INSERT INTO xml_config VALUES
(100, 'ORDER', 'order_id', 'order_id', 'INTEGER'),
(101, 'ORDER', 'order_date', 'order_date', 'STRING'),
(102, 'ORDER', 'customer.customer_id', 'customer_id', 'STRING'),
(103, 'ORDER', 'customer.customer_name', 'customer_name', 'STRING'),
(104, 'ORDER', 'total', 'total', 'FLOAT');

CREATE OR REPLACE TABLE FINAL_ORDER (
  order_id INTEGER,
  order_date STRING,
  customer_id STRING,
  customer_name STRING,
  total FLOAT
);

CREATE OR REPLACE TABLE FINAL_DEPT (
  dept_id INTEGER,
  dept_name STRING,
  emp_id STRING,
  emp_fname STRING,
  emp_lname STRING,
  street_1 STRING,
  city STRING,
  state STRING,
  zipcode STRING
);

--Modified stored proc to handle multiple types
CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
    config_df = session.sql(f"""
        SELECT
            tag_path,
            column_name,
            data_type
        FROM xml_config
        WHERE xml_type = '{xml_type_input}'
        ORDER BY id
    """).to_pandas()

    if config_df.empty:
        return f"No config found for xml_type: {xml_type_input}"

    # Build column list and select expressions
    column_list = []
    select_clauses = []

    for _, row in config_df.iterrows():
        col_name = row["COLUMN_NAME"]
        column_list.append(col_name)

        # Build XMLGET chain
        parts = row["TAG_PATH"].split(".")
        expr = "xml_content"
        for part in parts:
            expr = f"XMLGET({expr}, '{part}')"
        expr = f"{expr}:\"$\"::{row['DATA_TYPE']} AS {col_name}"

        select_clauses.append(expr)

    target_table = f'FINAL_{xml_type_input.upper()}'
    column_str = ", ".join(column_list)
    select_str = ", ".join(select_clauses)

    sql_query = f"""
        INSERT INTO {target_table} ({column_str})
        SELECT {select_str}
        FROM RAW_XML
        WHERE file_name ILIKE '%{xml_type_input.lower()}%'
    """

    print("Executing query:\n" + sql_query)
    session.sql(sql_query).collect()

    return f"XML shredding completed for type '{xml_type_input}' into '{target_table}'"
$$;

CALL shred_xml('DEPT');
CALL shred_xml('ORDER');

SELECT * FROM FINAL_DEPT;
SELECT * FROM FINAL_ORDER;

In part 2, we are going add audit logging to this process.

Mastering XML Shredding in Snowflake (Part 1)

Config Driven XML Shredding in Snowflake

Step 1: Load XML into RAW Layer as Variant

Step 2: Handle Nested Structures via XMLGET

Step 3: Define an XML_CONFIG Table

Step 4: Generate Dynamic SQL via Stored Procedure

Step 5: Support for Multiple XML Types

Written by Marcin Kulakowski

No responses yet