1

I am trying to select nested ArrayType from Pyspark Dataframe.

I want to select only items columns out of this dataframe. I dont know what wrong I am doing here.

XML:

<?xml version="1.0" encoding="utf-8"?>
<shiporder orderid="str1234">
  <orderperson>ABC</orderperson>
  <shipto>
    <name>XYZ</name>
    <address>305, Ram CHowk</address>
    <city>Pune</city>
    <country>IN</country>
  </shipto>
  <items>
  <item>
    <title>Clothing</title>
    <notes>
        <note>Brand:CK</note>
        <note>Size:L</note>
    </notes>
    <quantity>6</quantity>
    <price>208</price>
  </item>
  </items>
</shiporder>

Schema of dataframe.

root
 |-- _orderid: string (nullable = true)
 |-- items: struct (nullable = true)
 |    |-- item: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- notes: struct (nullable = true)
 |    |    |    |    |-- note: array (nullable = true)
 |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |-- price: double (nullable = true)
 |    |    |    |-- quantity: long (nullable = true)
 |    |    |    |-- title: string (nullable = true)
 |-- orderperson: string (nullable = true)
 |-- shipto: struct (nullable = true)
 |    |-- address: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- country: string (nullable = true)
 |    |-- name: string (nullable = true)




df.show(truncate=False)
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|_orderid|items                                                                                        |orderperson  |shipto                         |
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+
|str1234 |[[[[[color:Brown, Size:12]], 82.0, 1, Footwear], [[[Brand:CK, Size:L]], 208.0, 6, Clothing]]]|Vikrant Chand|[305, Giotto, Irvine, US, Amit]|
+--------+---------------------------------------------------------------------------------------------+-------------+-------------------------------+

When I am selecting items column it is returning me null.

df.select([ 'items']).show()
+-----+
|items|
+-----+
| null|
+-----+

While select the same column with shipto(other nested column) solves the problem.

df.select([ 'items','shipto']).show()
+--------------------+--------------------+
|               items|              shipto|
+--------------------+--------------------+
|[[[[[color:Brown,...|[305, Giotto, Irv...|
+--------------------+--------------------+
6
  • just use df.select('items').show() without square bracket. Commented Jun 15, 2018 at 10:20
  • @RameshMaharjan I have tried. No luck Commented Jun 15, 2018 at 16:41
  • 1
    i tried both ways and it works for me so I can't say whats wrong Commented Jun 15, 2018 at 16:54
  • I have updated the question with XML dataset, can you please try with the same. I am not able to see column value using pyspark or scala-spark Commented Jun 15, 2018 at 21:53
  • 1
    Fixed it by upgrading the spark-xml version to 0.4.1 Commented Jun 16, 2018 at 0:28

1 Answer 1

1

This was a bug in spark-xml which got fixed in 0.4.1

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.