0

I have this RDD (showing two elements):

[['a', [1, 2]], ['b', [3, 0]]]

and I'd like to add up elements in the list based on the index, so to have a final result

[4, 2]

how would I achieve this? I know the presence of first element ('a'/'b') is irrelevant as I could strip it out with a map so the question becomes how to sum column values.

2 Answers 2

2
$ pyspark
>>> x = [['a', [1, 2]], ['b', [3, 0]]]
>>> rdd = sc.parallelize(x)
>>> rdd.map(lambda x: x[1]).reduce(lambda x,y: [sum(i) for i in zip(x, y)])
Sign up to request clarification or add additional context in comments.

Comments

1

You can strip the keys as you said, and then reduce your RDD as follows (given that you have 2 columns):

myRDD.reduce(lambda x,y:[x[0]+y[0], x[1]+y[1]])

This will give you the sum of all the columns

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.