Pyspark: sum column values

Question

I have this RDD (showing two elements):

[['a', [1, 2]], ['b', [3, 0]]]

and I'd like to add up elements in the list based on the index, so to have a final result

[4, 2]

how would I achieve this? I know the presence of first element ('a'/'b') is irrelevant as I could strip it out with a map so the question becomes how to sum column values.

the.malkolm · Accepted Answer · 2016-03-04 16:26:19Z

2

$ pyspark
>>> x = [['a', [1, 2]], ['b', [3, 0]]]
>>> rdd = sc.parallelize(x)
>>> rdd.map(lambda x: x[1]).reduce(lambda x,y: [sum(i) for i in zip(x, y)])

answered Mar 4, 2016 at 16:26

the.malkolm

2,43218 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Jan van der Vegt · Accepted Answer · 2016-03-04 12:48:48Z

1

You can strip the keys as you said, and then reduce your RDD as follows (given that you have 2 columns):

myRDD.reduce(lambda x,y:[x[0]+y[0], x[1]+y[1]])

This will give you the sum of all the columns

answered Mar 4, 2016 at 12:48

Jan van der Vegt

1,51114 silver badges34 bronze badges

Collectives™ on Stack Overflow

Pyspark: sum column values

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related