Parse nested XML in PySpark

Question

Am have issue in parsing below XML data in PySpark.

    <item name="Cake" ppu="0.55">
        <venue place="Bangalore" day="Friday">
            <batters>
                <batter name=Regular/>
                <batter name=Chocolate/>
                <batter name=Blueberry/>
            </batters>
            <topping id="5001">None</topping>
            <topping id="5002">Glazed</topping>
            <topping id="5005">Sugar</topping>
            <topping id="5006">Sprinkles</topping>
            <topping id="5003">Chocolate</topping>
            <topping id="5004">Maple</topping>
        </venue>
    </item>
    <item name="pizza" ppu="0.56"/>
        <batters>
            <batter place="Bangalore" name="Regular"/>
        <batters>
</items>

Am able to parse first set of item tag. But am unable to parse second tag. Any suggestion would be helpful.

So far i have tried below,

df = spark.read\
     .format("com.databricks.spark.xml")
     .option("rowTag", "item")\
     .option("valueTag", True)\
     .load("File.xml")

This is providing me only the schema of first tag. Am unable to define nested schema as well.

vladsiv · Accepted Answer · 2021-11-26 13:08:33Z

Your XML example is not formatted properly.

It should look like this:

<item name="Cake" ppu="0.55">
    <venue place="Bangalore" day="Friday">
        <batters>
            <batter name="Regular"/>
            <batter name="Chocolate"/>
            <batter name="Blueberry"/>
        </batters>
        <topping id="5001">None</topping>
        <topping id="5002">Glazed</topping>
        <topping id="5005">Sugar</topping>
        <topping id="5006">Sprinkles</topping>
        <topping id="5003">Chocolate</topping>
        <topping id="5004">Maple</topping>
    </venue>
</item>
<item name="pizza" ppu="0.56">
    <batters>
        <batter place="Bangalore" name="Regular"/>
    </batters>
</item>

Then to read it as one item one row and explode on batters:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.config(
    "spark.jars.packages", "com.databricks:spark-xml_2.12:0.13.0"
).getOrCreate()

df = (
    spark.read.format("xml")
    .option("rowTag", "item")
    .option("valueTag", True)
    .load("test.xml")
)
df.printSchema()
df = df.withColumn(
    "venue_batters",
    F.explode_outer(F.col("venue.batters.batter")),
)

Result:

+-----+----+--------------------+--------------------+-----------------+
|_name|_ppu|             batters|               venue|    venue_batters|
+-----+----+--------------------+--------------------+-----------------+
| Cake|0.55|                null|{Friday, Bangalor...|  {Regular, null}|
| Cake|0.55|                null|{Friday, Bangalor...|{Chocolate, null}|
| Cake|0.55|                null|{Friday, Bangalor...|{Blueberry, null}|
|pizza|0.56|{{Regular, Bangal...|                null|             null|
+-----+----+--------------------+--------------------+-----------------+

Thank you Vlad. How to flatten batters column to multiple rows? There is where am stuck.
@JimMacaulay You're welcome. I've updated and included an example how to explode batters in vanue, it's the same for batters in item (without venue). I hope this helps.

Collectives™ on Stack Overflow

Parse nested XML in PySpark

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related