PySpark create new column from existing column with a list of values

Question

I've got a DataFrame like this:

from pyspark.sql import SparkSession
from pyspark import Row

spark = SparkSession.builder \
    .appName('DataFrame') \
    .master('local[*]') \
    .getOrCreate()

df = spark.createDataFrame([Row(a=1, b='', c=['0', '1'], d='foo'),
                            Row(a=2, b='', c=['0', '1'], d='bar'),
                            Row(a=3, b='', c=['0', '1'], d='foo')])

|  a|  b|     c|  d|
+---+---+------+---+
|  1|   |[0, 1]|foo|
|  2|   |[0, 1]|bar|
|  3|   |[0, 1]|foo|
+---+---+------+---+

I would like to create column "e" with first element of "c" column and "f" column with second element of "c" column", to look like this:

|a  |b  |c     |d  |e  |f  |
+---+---+------+---+---+---+
|1  |   |[0, 1]|foo|0  |1  |
|2  |   |[0, 1]|bar|0  |1  |
|3  |   |[0, 1]|foo|0  |1  |
+---+---+------+---+---+---+

Possible duplicate of How to extract an element from a array in pyspark — pault
– pault, Commented Aug 22, 2019 at 14:22

Pierre Gourseaud · Accepted Answer · 2019-08-22 08:48:16Z

2

df = spark.createDataFrame([Row(a=1, b='', c=['0', '1'], d='foo'),
                            Row(a=2, b='', c=['0', '1'], d='bar'),
                            Row(a=3, b='', c=['0', '1'], d='foo')])

df2 = df.withColumn('e', df['c'][0]).withColumn('f', df['c'][1])
df2.show()

+---+---+------+---+---+---+
|a  |b  |c     |d  |e  |f  |
+---+---+------+---+---+---+
|1  |   |[0, 1]|foo|0  |1  |
|2  |   |[0, 1]|bar|0  |1  |
|3  |   |[0, 1]|foo|0  |1  |
+---+---+------+---+---+---+

edited Aug 22, 2019 at 8:48

answered Aug 22, 2019 at 8:39

Pierre Gourseaud

2,49715 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark create new column from existing column with a list of values

1 Answer 1

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Linked

Related