Mastering XML Shredding in Snowflake (Part 1)
Config Driven XML Shredding in Snowflake
Modern enterprises often deal with semi-structured data such as XML, especially in domains like finance, healthcare, insurance and manufacturing. While Snowflake now natively supports XML through its VARIANT and XML data types, building a scalable, maintainable, and extensible shredding framework still requires careful planning.
In Part 1 of this series, we’ll design a robust, config-driven XML shredding pipeline that handles nested structures, supports multiple schemas and uses stored procedures for dynamic execution.
Step 1: Load XML into RAW Layer as Variant
To preserve the original structure and simplify ingestion, we load raw XML as-is into a base table.
CREATE OR REPLACE TABLE RAW_XML (
file_name STRING,
load_time TIMESTAMP,
xml_content VARIANT
);
INSERT INTO
RAW_XML (file_name, load_time, xml_content)
SELECT
'sample_dept.xml' AS file_name,
CURRENT_TIMESTAMP() AS load_time,
TO_VARIANT(PARSE_XML (
'<dept>
<dept_id>1</dept_id>
<dept_name>Marketing</dept_name>
<employee>
<emp_id>101</emp_id>
<emp_fname>Chandler</emp_fname>
<emp_lname>Bing</emp_lname>
<address>
<street_1>123 Main St</street_1>
<city>NY</city>
<state>NY</state>
<zipcode>10001</zipcode>
</address>
</employee>
</dept>'
)) AS xml_content;
Pros:
- Schema agnostic ingestion
- No upfront parsing logic needed
- Simple S3 to table flow
Cons:
- Raw data is unreadable without further transformation
- Potential for bloated storage if XML is verbose
Alternative: Use Snowpipe or COPY INTO XML support for high-volume automated ingestion.
We will save the final shredding result into the FINAL_TABLE (for now):
CREATE OR REPLACE TABLE FINAL_TABLE (
dept_id INTEGER,
dept_name STRING,
emp_id STRING,
emp_fname STRING,
emp_lname STRING,
street_1 STRING,
city STRING,
state STRING,
zipcode STRING
);
Step 2: Handle Nested Structures via XMLGET
XML’s nested nature often maps poorly to flat relational models. We rely on recursive calls to XMLGET() for safe traversal.
SELECT
XMLGET(xml_content, 'order_id'):"$"::STRING AS order_id,
XMLGET(XMLGET(xml_content, 'customer'), 'customer_id'):"$"::STRING AS customer_id
FROM RAW_XML;
Tip: Always include :”$” to access the leaf node content.
Step 3: Define an XML_CONFIG Table
Instead of hardcoding XML paths into queries, we define a central configuration table:
CREATE OR REPLACE TABLE xml_config (
id INT,
xml_type STRING,
tag_path STRING, -- e.g. 'dept_id', 'employee.emp_fname'
column_name STRING, -- e.g. 'dept_id', 'emp_fname'
data_type STRING -- e.g. 'STRING', 'INTEGER'
);
INSERT INTO xml_config VALUES
(1, 'DEPT', 'dept_id', 'dept_id', 'INTEGER'),
(2, 'DEPT', 'dept_name', 'dept_name', 'STRING'),
(3, 'DEPT', 'employee.emp_id', 'emp_id', 'STRING'),
(4, 'DEPT', 'employee.emp_fname', 'emp_fname', 'STRING'),
(5, 'DEPT', 'employee.emp_lname', 'emp_lname', 'STRING'),
(6, 'DEPT', 'employee.address.street_1', 'street_1', 'STRING'),
(7, 'DEPT', 'employee.address.city', 'city', 'STRING'),
(8, 'DEPT', 'employee.address.state', 'state', 'STRING'),
(9, 'DEPT', 'employee.address.zipcode', 'zipcode', 'STRING');
Pros:
- Externalizes logic from code
- Simplifies maintenance and schema extension
- Enables reuse across multiple files and sources
Cons:
- Requires strong data stewardship to keep config synced
- Doesn’t validate if paths are correct until runtime
Alternative: Use a metadata catalog or schema registry like AWS Glue or Collibra.
Step 4: Generate Dynamic SQL via Stored Procedure
We use a Python based stored procedure to read from xml_config, build XMLGET chains dynamically and execute the appropriate INSERT statement.
Key logic:
- Parse tag_path into XMLGET() chains
- Type-cast the result to data_type
- Insert the result into FINAL_<xml_type>
Pros:
- Flexible and dynamic
- Supports many XML types with one codebase
Cons:
- Harder to debug than static SQL
Alternative: Use Jinja templates to pre-generate and version control procedures per xml_type.
CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
# Fetch config from xml_config
config_df = session.sql(f"""
SELECT
tag_path,
column_name,
data_type
FROM xml_config
WHERE xml_type = '{xml_type_input}'
ORDER BY id
""").to_pandas()
if config_df.empty:
return f"No config found for xml_type: {xml_type_input}"
def build_expr(row):
parts = row["TAG_PATH"].split(".")
expr = "xml_content"
for part in parts:
expr = f"XMLGET({expr}, '{part}')"
return f"{expr}:\"$\"::{row['DATA_TYPE']} AS {row['COLUMN_NAME']}"
select_clauses = [build_expr(row) for _, row in config_df.iterrows()]
final_query = f"""
INSERT INTO FINAL_TABLE
SELECT {', '.join(select_clauses)}
FROM RAW_XML
"""
print("Executing query:\n" + final_query)
session.sql(final_query).collect()
return f"XML shredding complete for xml_type: {xml_type_input}"
$$;
CALL shred_xml('DEPT');
SELECT * FROM FINAL_TABLE;
Step 5: Support for Multiple XML Types
Since xml_config includes an xml_type, we reuse the same stored procedure across multiple schemas.
Each schema gets its own:
- FINAL_<xml_type> table
- Set of xml_config rows
This provides a scalable way to handle large multi-team XML ingestion.
INSERT INTO RAW_XML (file_name, load_time, xml_content)
SELECT
'order_sample.xml' AS file_name,
CURRENT_TIMESTAMP() AS load_time,
TO_VARIANT(PARSE_XML (
'<order>
<order_id>2001</order_id>
<order_date>2024-01-01</order_date>
<customer>
<customer_id>C123</customer_id>
<customer_name>Monica Geller</customer_name>
</customer>
<total>459.99</total>
</order>'
)) AS xml_content;
INSERT INTO xml_config VALUES
(100, 'ORDER', 'order_id', 'order_id', 'INTEGER'),
(101, 'ORDER', 'order_date', 'order_date', 'STRING'),
(102, 'ORDER', 'customer.customer_id', 'customer_id', 'STRING'),
(103, 'ORDER', 'customer.customer_name', 'customer_name', 'STRING'),
(104, 'ORDER', 'total', 'total', 'FLOAT');
CREATE OR REPLACE TABLE FINAL_ORDER (
order_id INTEGER,
order_date STRING,
customer_id STRING,
customer_name STRING,
total FLOAT
);
CREATE OR REPLACE TABLE FINAL_DEPT (
dept_id INTEGER,
dept_name STRING,
emp_id STRING,
emp_fname STRING,
emp_lname STRING,
street_1 STRING,
city STRING,
state STRING,
zipcode STRING
);
--Modified stored proc to handle multiple types
CREATE OR REPLACE PROCEDURE shred_xml(xml_type_input STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = '3.9'
PACKAGES = ('snowflake-snowpark-python', 'pandas')
HANDLER = 'shred'
AS
$$
def shred(session, xml_type_input):
config_df = session.sql(f"""
SELECT
tag_path,
column_name,
data_type
FROM xml_config
WHERE xml_type = '{xml_type_input}'
ORDER BY id
""").to_pandas()
if config_df.empty:
return f"No config found for xml_type: {xml_type_input}"
# Build column list and select expressions
column_list = []
select_clauses = []
for _, row in config_df.iterrows():
col_name = row["COLUMN_NAME"]
column_list.append(col_name)
# Build XMLGET chain
parts = row["TAG_PATH"].split(".")
expr = "xml_content"
for part in parts:
expr = f"XMLGET({expr}, '{part}')"
expr = f"{expr}:\"$\"::{row['DATA_TYPE']} AS {col_name}"
select_clauses.append(expr)
target_table = f'FINAL_{xml_type_input.upper()}'
column_str = ", ".join(column_list)
select_str = ", ".join(select_clauses)
sql_query = f"""
INSERT INTO {target_table} ({column_str})
SELECT {select_str}
FROM RAW_XML
WHERE file_name ILIKE '%{xml_type_input.lower()}%'
"""
print("Executing query:\n" + sql_query)
session.sql(sql_query).collect()
return f"XML shredding completed for type '{xml_type_input}' into '{target_table}'"
$$;
CALL shred_xml('DEPT');
CALL shred_xml('ORDER');
SELECT * FROM FINAL_DEPT;
SELECT * FROM FINAL_ORDER;
In part 2, we are going add audit logging to this process.