1

I have a dataframe with 2 columns and I got below array by doing df.collect().

array = [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]

Now I want to get an output array like below.

new_array = ['Alice', 'Bob']

Could anyone please let me know how to extract above output using pyspark. Any help would be appreciated.

Thanks

2 Answers 2

3
# Creating the base dataframe.
values = [('Alice',10),('Bob',15)]
df = sqlContext.createDataFrame(values,['name','age'])
df.show()
    +-----+---+
    | name|age|
    +-----+---+
    |Alice| 10|
    |  Bob| 15|
    +-----+---+

df.collect()
    [Row(name='Alice', age=10), Row(name='Bob', age=15)]

# Use list comprehensions to create a list.
new_list = [row.name for row in df.collect()]
print(new_list)
    ['Alice', 'Bob']
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for your reply. When I do df.collect() I'm getting array like [Row(name=u'Alice', age=10), Row(name=u'Bob', age=15)]. So when used row.name I'm getting u['Alice', u'Bob'] instead of ['Alice', 'Bob']
It's the same thing. Don't worry, all good. u does not have any effect on data- it is just an explicit representation of unicode object (not byte array).
Oh ok. Thank you
0

I see two columns name and age in the df. Now, you want only the name column to be displayed.

You can select it like:

df.select("name").show()

This will show you only the names.

Tip: Also, you df.show() instead of df.collect(). That will show you in tabular form instead of row(...)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.