PySpark - Adding a Column from a list of values

Question

I have to add column to a PySpark dataframe based on a list of values.

a= spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])

I have a list called rating, which is a rating of each pet.

rating = [5,4,1]

I need to append the dataframe with a column called Rating, such that

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

I have done the following however it is returning only the first value in the list in the Rating Column

def add_labels():
    return rating.pop(0)

labels_udf = udf(add_labels, IntegerType())

new_df = a.withColumn('Rating', labels_udf()).cache()

out:

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     5|
| Mouse|  Cat|     5|
+------+-----+------+

Zoe - Save the data dump · Accepted Answer · 2021-11-14 10:38:34Z

29

from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql import Window

#sample data
a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
                               ["Animal", "Enemy"])
a.show()

#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])

#add 'sequential' index and join both dataframe to get the final result
a = a.withColumn("row_idx", row_number().over(Window.orderBy(monotonically_increasing_id())))
b = b.withColumn("row_idx", row_number().over(Window.orderBy(monotonically_increasing_id())))

final_df = a.join(b, a.row_idx == b.row_idx).\
             drop("row_idx")
final_df.show()

Input:

+------+-----+
|Animal|Enemy|
+------+-----+
|   Dog|  Cat|
|   Cat|  Dog|
| Mouse|  Cat|
+------+-----+

Output is:

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Cat|  Dog|     4|
|   Dog|  Cat|     5|
| Mouse|  Cat|     1|
+------+-----+------+

edited Nov 14, 2021 at 10:38

Zoe - Save the data dump

28.4k22 gold badges130 silver badges163 bronze badges

answered Jan 9, 2018 at 18:33

Prem

12k1 gold badge21 silver badges34 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Ankit Agrawal Over a year ago

@IgorS I agree with The generated ID is guaranteed to be monotonically increasing and unique , but it is not possible to give inconsistent result because the answer is not using monotonically_increasing_id() to directly compare the row; rather it is using it to generate consecutive row number starting from 1 using row_number() function.

user3322581 Over a year ago

this worked for me, just creating the dataframe from the rating list can be improved as: df = spark.createDataFrame(rating.astype(IntegerType), IntegerType())

The Singularity Over a year ago

Tried this answer on a larger dataframe and the columns got mismatched.

mkaran · Accepted Answer · 2018-01-09 11:41:24Z

As mentioned by @Tw UxTLi51Nus, if you can order the DataFrame, let's say, by Animal, without this changing your results, you can then do the following:

def add_labels(indx):
    return rating[indx-1] # since row num begins from 1
labels_udf = udf(add_labels, IntegerType())

a = spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])
a.createOrReplaceTempView('a')
a = spark.sql('select row_number() over (order by "Animal") as num, * from a')

a.show()


+---+------+-----+
|num|Animal|Enemy|
+---+------+-----+
|  1|   Dog|  Cat|
|  2|   Cat|  Dog|
|  3| Mouse|  Cat|
+---+------+-----+

new_df = a.withColumn('Rating', labels_udf('num'))
new_df.show()
+---+------+-----+------+
|num|Animal|Enemy|Rating|
+---+------+-----+------+
|  1|   Dog|  Cat|     5|
|  2|   Cat|  Dog|     4|
|  3| Mouse|  Cat|     1|
+---+------+-----+------+

And then drop the num column:

new_df.drop('num').show()
+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

Edit:

Another - but perhaps ugly and a bit inefficient - way, if you cannot sort by a column, is to go back to rdd and do the following:

a = spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])

# or create the rdd from the start:
# a = spark.sparkContext.parallelize([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")])

a = a.rdd.zipWithIndex()
a = a.toDF()
a.show()

+-----------+---+
|         _1| _2|
+-----------+---+
|  [Dog,Cat]|  0|
|  [Cat,Dog]|  1|
|[Mouse,Cat]|  2|
+-----------+---+

a = a.select(bb._1.getItem('Animal').alias('Animal'), bb._1.getItem('Enemy').alias('Enemy'), bb._2.alias('num'))

def add_labels(indx):
    return rating[indx] # indx here will start from zero

labels_udf = udf(add_labels, IntegerType())

new_df = a.withColumn('Rating', labels_udf('num'))

new_df.show()

+---------+--------+---+------+
|Animal   |   Enemy|num|Rating|
+---------+--------+---+------+
|      Dog|     Cat|  0|     5|
|      Cat|     Dog|  1|     4|
|    Mouse|     Cat|  2|     1|
+---------+--------+---+------+

(I would not recommend this if you have much data)

Hope this helps, good luck!

Biggus · Accepted Answer · 2019-05-05 17:16:22Z

I might be wrong, but I believe the accepted answer will not work. monotonically_increasing_id only guarantees that the ids will be unique and increasing, not that they will be consecutive. Hence using it on two different dataframes will likely create two very different columns, and the join will mostly return empty.

taking inspiration from this answer https://stackoverflow.com/a/48211877/7225303 to a similar question, we could change the incorrect answer to:

from pyspark.sql.window import Window as W
from pyspark.sql import functions as F

a= sqlContext.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],
                               ["Animal", "Enemy"])

a.show()

+------+-----+
|Animal|Enemy|
+------+-----+
|   Dog|  Cat|
|   Cat|  Dog|
| Mouse|  Cat|
+------+-----+



#convert list to a dataframe
rating = [5,4,1]
b = sqlContext.createDataFrame([(l,) for l in rating], ['Rating'])
b.show()

+------+
|Rating|
+------+
|     5|
|     4|
|     1|
+------+


a = a.withColumn("idx", F.monotonically_increasing_id())
b = b.withColumn("idx", F.monotonically_increasing_id())

windowSpec = W.orderBy("idx")
a = a.withColumn("idx", F.row_number().over(windowSpec))
b = b.withColumn("idx", F.row_number().over(windowSpec))

a.show()
+------+-----+---+
|Animal|Enemy|idx|
+------+-----+---+
|   Dog|  Cat|  1|
|   Cat|  Dog|  2|
| Mouse|  Cat|  3|
+------+-----+---+

b.show()
+------+---+
|Rating|idx|
+------+---+
|     5|  1|
|     4|  2|
|     1|  3|
+------+---+

final_df = a.join(b, a.idx == b.idx).drop("idx")

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

this is okay! but without partition this can cause serious performance degradation on a large dataset.

Anahcolus · Accepted Answer · 2018-01-09 17:21:59Z

You can convert your rating into rdd

rating = [5,4,1]
ratingrdd = sc.parallelize(rating)

And then convert your dataframe to rdd, attach each value of ratingrdd to rdd dataframe using zip and convert the zipped rdd to dataframe again

sqlContext.createDataFrame(a.rdd.zip(ratingrdd).map(lambda x: (x[0][0], x[0][1], x[1])), ["Animal", "Enemy", "Rating"]).show()

It should give you

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

akoeltringer · Accepted Answer · 2018-01-09 09:06:37Z

1

What you are trying to do does not work, because the rating list is in your driver's memory, whereas the a dataframe is in the executor's memory (the udf works on the executors too).

What you need to do is add the keys to the ratings list, like so:

ratings = [('Dog', 5), ('Cat', 4), ('Mouse', 1)]

Then you create a ratings dataframe from the list and join both to get the new colum added:

ratings_df = spark.createDataFrame(ratings, ['Animal', 'Rating'])
new_df = a.join(ratings_df, 'Animal')

answered Jan 9, 2018 at 9:06

akoeltringer

1,7213 gold badges20 silver badges35 bronze badges

2 Comments

Bryce Ramgovind Over a year ago

The problem is I cant put in a key. Its basically indexed in a specific order.

akoeltringer Over a year ago

Spark Dataframes do not guarantee a specific ordering, unless you call orderBy on it. So if you know the ordering of both the a dataframe and the ratings dataframe, you can come up with a key to combine them. If you do not know this, there is no way to combine the two dataframes...

bhargav3vedi · Accepted Answer · 2022-04-05 04:47:03Z

We can add new column to Pandas Data Frame, PySpark provides function to convert Spark Data Frame to Pandas Data Frame.

test_spark_df = spark.createDataFrame([(1,'A'), (2, 'B'), (3, 'C')], schema=['id', 'name'])
test_spark_df.show()

+---+----+
| id|name|
+---+----+
|  1|   A|
|  2|   B|
|  3|   C|
+---+----+

Convert this spark- df to pandas df.

new_pandas_df = test_spark_df.toPandas()
new_pandas_df['gender'] = ['M', 'F', 'M']
new_pandas_df

    id  name  gender
0   1   A     M
1   2   B     F
2   3   C     M

Convert this pandas df to spark df.

converted_spark_df = spark.createDataFrame(new_pandas_df)
converted_spark_df.show()

+---+----+------+
| id|name|gender|
+---+----+------+
|  1|   A|     M|
|  2|   B|     F|
|  3|   C|     M|
+---+----+------+

Converting between PySpark and Pandas creates inefficiently on the master node, as all the data needs to be collected from the different worker nodes. So for this problem the solution work but not scalable.

Jonathan Sales · Accepted Answer · 2021-10-27 13:32:03Z

Following the initial idea of using udf, you can do the following:

import pyspark.sql.functions as F

def add_labels(idx):
    lista = [5,4,1]
    return lista[idx]

a = spark.createDataFrame([("Dog", "Cat"), ("Cat", "Dog"), ("Mouse", "Cat")],["Animal", "Enemy"])
a = a.withColumn("idx", F.monotonically_increasing_id())
a.show()

+------+-----+---+
|Animal|Enemy|idx|
+------+-----+---+
|   Dog|  Cat|  0|
|   Cat|  Dog|  1|
| Mouse|  Cat|  2|
+------+-----+---+

labels_udf = F.udf(add_labels, IntegerType())
new_df = a.withColumn('Rating', labels_udf(F.col('idx'))).drop('idx')
new_df.show()

+------+-----+------+
|Animal|Enemy|Rating|
+------+-----+------+
|   Dog|  Cat|     5|
|   Cat|  Dog|     4|
| Mouse|  Cat|     1|
+------+-----+------+

Collectives™ on Stack Overflow

PySpark - Adding a Column from a list of values

7 Answers 7

3 Comments

Comments

1 Comment

Comments

2 Comments

1 Comment

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

3 Comments

Comments

1 Comment

Comments

2 Comments

1 Comment

Comments

Linked

Related