Selecting nested columns from pyspark dataframe using spark-xml

Question

I am trying to select nested ArrayType from Pyspark Dataframe.

I want to select only items columns out of this dataframe. I dont know what wrong I am doing here.

XML:

<?xml version="1.0" encoding="utf-8"?>
<shiporder orderid="str1234">
  <orderperson>ABC</orderperson>
  <shipto>
    <name>XYZ</name>
    <address>305, Ram CHowk</address>
    <city>Pune</city>
    <country>IN</country>
  </shipto>
  <items>
  <item>
    <title>Clothing</title>
    <notes>
        <note>Brand:CK</note>
        <note>Size:L</note>
    </notes>
    <quantity>6</quantity>
    <price>208</price>
  </item>
  </items>
</shiporder>

Schema of dataframe.

root
 |-- _orderid: string (nullable = true)
 |-- items: struct (nullable = true)
 |    |-- item: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- notes: struct (nullable = true)
 |    |    |    |    |-- note: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- quantity: long (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |-- orderperson: string (nullable = true)
 |-- shipto: struct (nullable = true)
 |    |-- address: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- name: string (nullable = true)




df.show(truncate=False)
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|_orderid|items                                                                                        |orderperson  |shipto                         |
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|str1234 |[[[[[color:Brown, Size:12]], 82.0, 1, Footwear], [[[Brand:CK, Size:L]], 208.0, 6, Clothing]]]|Vikrant Chand|[305, Giotto, Irvine, US, Amit]|
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+

When I am selecting items column it is returning me null.

df.select([ 'items']).show()
+-----+
|items|
+-----+
| null|
+-----+

While select the same column with shipto(other nested column) solves the problem.

df.select([ 'items','shipto']).show()
+--------------------+--------------------+
|               items|              shipto|
+--------------------+--------------------+
|[[[[[color:Brown,...|[305, Giotto, Irv...|
+--------------------+--------------------+

just use df.select('items').show() without square bracket. — Anahcolus
– Anahcolus, Commented Jun 15, 2018 at 10:20
i tried both ways and it works for me so I can't say whats wrong — Anahcolus
– Anahcolus, Commented Jun 15, 2018 at 16:54
I have updated the question with XML dataset, can you please try with the same. I am not able to see column value using pyspark or scala-spark — Spark-Beginner
– Spark-Beginner, Commented Jun 15, 2018 at 21:53

Spark-Beginner · Accepted Answer · 2018-06-16 00:27:59Z

1

This was a bug in spark-xml which got fixed in 0.4.1

Issue-193

answered Jun 16, 2018 at 0:27

Spark-Beginner

1,3625 gold badges18 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Selecting nested columns from pyspark dataframe using spark-xml

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related