Sitemap

Mastering XML Shredding in Snowflake (Part 1)

Config Driven XML Shredding in Snowflake

4 min readMay 20, 2025

Modern enterprises often deal with semi-structured data such as XML, especially in domains like finance, healthcare, insurance and manufacturing. While Snowflake now natively supports XML through its VARIANT and XML data types, building a scalable, maintainable, and extensible shredding framework still requires careful planning.

In Part 1 of this series, we’ll design a robust, config-driven XML shredding pipeline that handles nested structures, supports multiple schemas and uses stored procedures for dynamic execution.

See red font for this end-to-end flow

Step 1: Load XML into RAW Layer as Variant

To preserve the original structure and simplify ingestion, we load raw XML as-is into a base table.

CREATE OR REPLACE TABLE RAW_XML (
file_name STRING,
load_time TIMESTAMP,
xml_content VARIANT
);
INSERT INTO
RAW_XML (file_name, load_time, xml_content)
SELECT
'sample_dept.xml' AS file_name,
CURRENT_TIMESTAMP() AS load_time,
TO_VARIANT(PARSE_XML (
'<dept>
<dept_id>1</dept_id>
<dept_name>Marketing</dept_name>
<employee>
<emp_id>101</emp_id>
<emp_fname>Chandler</emp_fname>
<emp_lname>Bing</emp_lname>
<address>
<street_1>123 Main St</street_1>
<city>NY</city>
<state>NY</state>
<zipcode>10001</zipcode>
</address>
</employee>
</dept>'
)) AS xml_content;

Pros:

  • Schema agnostic ingestion
  • No upfront parsing logic needed
  • Simple S3 to table flow

Cons:

  • Raw data is unreadable without further transformation
  • Potential for bloated storage if XML is verbose

Alternative: Use Snowpipe or COPY INTO XML support for high-volume automated ingestion.

We will save the final shredding result into the FINAL_TABLE (for now):

CREATE OR REPLACE TABLE FINAL_TABLE (
dept_id INTEGER,
dept_name STRING,
emp_id STRING,
emp_fname STRING,
emp_lname STRING,
street_1 STRING,
city STRING,
state STRING,
zipcode STRING
);

Step 2: Handle Nested Structures via XMLGET

XML’s nested nature often maps poorly to flat relational models. We rely on recursive calls to XMLGET() for safe traversal.

SELECT
XMLGET(xml_content, 'order_id'):"$"::STRING AS order_id,
XMLGET(XMLGET(xml_content, 'customer'), 'customer_id'):"$"::STRING AS customer_id
FROM RAW_XML;

Tip: Always include :”$” to access the leaf node content.

Step 3: Define an XML_CONFIG Table

Instead of hardcoding XML paths into queries, we define a central configuration table:

CREATE OR REPLACE TABLE xml_config (
id INT,
xml_type STRING,
tag_path STRING, -- e.g. 'dept_id', 'employee.emp_fname'
column_name STRING, -- e.g. 'dept_id', 'emp_fname'
data_type STRING -- e.g. 'STRING', 'INTEGER'
);
INSERT INTO xml_config VALUES
(1, 'DEPT', 'dept_id', 'dept_id', 'INTEGER'),
(2, 'DEPT', 'dept_name', 'dept_name', 'STRING'),
(3, 'DEPT', 'employee.emp_id', 'emp_id', 'STRING'),
(4, 'DEPT', 'employee.emp_fname', 'emp_fname', 'STRING'),
(5, 'DEPT', 'employee.emp_lname', 'emp_lname', 'STRING'),
(6, 'DEPT', 'employee.address.street_1', 'street_1', 'STRING'),
(7, 'DEPT', 'employee.address.city', 'city', 'STRING'),
(8, 'DEPT', 'employee.address.state', 'state', 'STRING'),
(9, 'DEPT', 'employee.address.zipcode', 'zipcode', 'STRING');

Pros:

  • Externalizes logic from code
  • Simplifies maintenance and schema extension
  • Enables reuse across multiple files and sources

Cons:

  • Requires strong data stewardship to keep config synced
  • Doesn’t validate if paths are correct until runtime

Alternative: Use a metadata catalog or schema registry like AWS Glue or Collibra.

Step 4: Generate Dynamic SQL via Stored Procedure

We use a Python based stored procedure to read from xml_config, build XMLGET chains dynamically and execute the appropriate INSERT statement.

Key logic:

  • Parse tag_path into XMLGET() chains
  • Type-cast the result to data_type
  • Insert the result into FINAL_<xml_type>

Pros:

  • Flexible and dynamic
  • Supports many XML types with one codebase

Cons:

  • Harder to debug than static SQL

Alternative: Use Jinja templates to pre-generate and version control procedures per xml_type.

CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
# Fetch config from xml_config
config_df = session.sql(f"""
SELECT
tag_path,
column_name,
data_type
FROM xml_config
WHERE xml_type = '{xml_type_input}'
ORDER BY id
""").to_pandas()

if config_df.empty:
return f"No config found for xml_type: {xml_type_input}"

def build_expr(row):
parts = row["TAG_PATH"].split(".")
expr = "xml_content"
for part in parts:
expr = f"XMLGET({expr}, '{part}')"
return f"{expr}:\"$\"::{row['DATA_TYPE']} AS {row['COLUMN_NAME']}"

select_clauses = [build_expr(row) for _, row in config_df.iterrows()]

final_query = f"""
INSERT INTO FINAL_TABLE
SELECT {', '.join(select_clauses)}
FROM RAW_XML
"""

print("Executing query:\n" + final_query)
session.sql(final_query).collect()

return f"XML shredding complete for xml_type: {xml_type_input}"
$$;
CALL shred_xml('DEPT');

SELECT * FROM FINAL_TABLE;

Step 5: Support for Multiple XML Types

Since xml_config includes an xml_type, we reuse the same stored procedure across multiple schemas.

Each schema gets its own:

  • FINAL_<xml_type> table
  • Set of xml_config rows

This provides a scalable way to handle large multi-team XML ingestion.

INSERT INTO RAW_XML (file_name, load_time, xml_content)
SELECT
'order_sample.xml' AS file_name,
CURRENT_TIMESTAMP() AS load_time,
TO_VARIANT(PARSE_XML (
'<order>
<order_id>2001</order_id>
<order_date>2024-01-01</order_date>
<customer>
<customer_id>C123</customer_id>
<customer_name>Monica Geller</customer_name>
</customer>
<total>459.99</total>
</order>'
)) AS xml_content;
INSERT INTO xml_config VALUES
(100, 'ORDER', 'order_id', 'order_id', 'INTEGER'),
(101, 'ORDER', 'order_date', 'order_date', 'STRING'),
(102, 'ORDER', 'customer.customer_id', 'customer_id', 'STRING'),
(103, 'ORDER', 'customer.customer_name', 'customer_name', 'STRING'),
(104, 'ORDER', 'total', 'total', 'FLOAT');
CREATE OR REPLACE TABLE FINAL_ORDER (
order_id INTEGER,
order_date STRING,
customer_id STRING,
customer_name STRING,
total FLOAT
);

CREATE OR REPLACE TABLE FINAL_DEPT (
dept_id INTEGER,
dept_name STRING,
emp_id STRING,
emp_fname STRING,
emp_lname STRING,
street_1 STRING,
city STRING,
state STRING,
zipcode STRING
);
--Modified stored proc to handle multiple types
CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
config_df = session.sql(f"""
SELECT
tag_path,
column_name,
data_type
FROM xml_config
WHERE xml_type = '{xml_type_input}'
ORDER BY id
""").to_pandas()

if config_df.empty:
return f"No config found for xml_type: {xml_type_input}"

# Build column list and select expressions
column_list = []
select_clauses = []

for _, row in config_df.iterrows():
col_name = row["COLUMN_NAME"]
column_list.append(col_name)

# Build XMLGET chain
parts = row["TAG_PATH"].split(".")
expr = "xml_content"
for part in parts:
expr = f"XMLGET({expr}, '{part}')"
expr = f"{expr}:\"$\"::{row['DATA_TYPE']} AS {col_name}"

select_clauses.append(expr)

target_table = f'FINAL_{xml_type_input.upper()}'
column_str = ", ".join(column_list)
select_str = ", ".join(select_clauses)

sql_query = f"""
INSERT INTO {target_table} ({column_str})
SELECT {select_str}
FROM RAW_XML
WHERE file_name ILIKE '%{xml_type_input.lower()}%'
"""

print("Executing query:\n" + sql_query)
session.sql(sql_query).collect()

return f"XML shredding completed for type '{xml_type_input}' into '{target_table}'"
$$;
CALL shred_xml('DEPT');
CALL shred_xml('ORDER');

SELECT * FROM FINAL_DEPT;
SELECT * FROM FINAL_ORDER;

In part 2, we are going add audit logging to this process.

--

--

Marcin Kulakowski
Marcin Kulakowski

Written by Marcin Kulakowski

Don't solve a problem, offer a better solution and show the art of the possible. Currently @ Snowflake.

No responses yet