Acessing nested columns in pyspark dataframe

Question

I have an xml document that looks like this:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Position>
    <Search>
        <Location>
            <Region>OH</Region>
            <Country>us</Country>
            <Longitude>-816071</Longitude>
            <Latitude>415051</Latitude>
        </Location>
    </Search>
</Position>

I read it into a dataframe:

df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='Position').load('1.xml')

I can see 1 column:

df.columns
['Search']

print df.select("Search")
DataFrame[Search: struct<Location:struct<Country:string,Latitude:bigint,Longitude:bigint,Region:string>>]

How do I access the nested columns. ex Location.Region?

Can you post a sample row of the dataframe that you get.

Gaurav Dhama
– Gaurav Dhama

2017-02-15 04:22:06 +00:00
Commented Feb 15, 2017 at 4:22 — Gaurav Dhama
– Gaurav Dhama, Commented Feb 15, 2017 at 4:22
This was very helpful thankyou

lakshmi
– lakshmi

2018-02-08 20:42:31 +00:00
Commented Feb 8, 2018 at 20:42 — lakshmi
– lakshmi, Commented Feb 8, 2018 at 20:42

Prasad Khode · Accepted Answer · 2017-02-15 05:59:32Z

14

you can do something like below:

df.select("Search.Location.*").show()

output:

+-------+--------+---------+------+
|Country|Latitude|Longitude|Region|
+-------+--------+---------+------+
|     us|  415051|  -816071|    OH|
+-------+--------+---------+------+

edited Feb 15, 2017 at 5:59

answered Feb 15, 2017 at 5:51

Prasad Khode

6,77712 gold badges47 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Acessing nested columns in pyspark dataframe

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related