Filtering data in an RDD

Question

I have managed to pre process my data in pyspark to get something like this

[(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key3', u'2'), (u'key4', u'1'), (u'key1', u'4'), (u'key5', u'1'), (u'key6', u'2'), (u'key7', u'4'), (u'key8', u'5'), (u'key9', u'6'), (u'key10', u'7')]

Now I need to filter based on these conditions :

1) filter values associated to atleast 2 keys.

output - only those (k,v) pairs which has '1','2','4' as values should be present since they are associated with more than 2 keys

 [(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key3', u'2'), (u'key4', u'1'), (u'key1', u'4'), (u'key5', u'1'), (u'key6', u'2'), (u'key2', u'4')]

2) filter keys which are associated to atleast 2 values

output - only those (k,v) pairs which has key1, key2 as keys should be there since they are associated with atleast 2 values

[(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key1', u'4'), (u'key2', u'4')]

Any suggestions would be of great help.

Update : I used groupBy and a filter to group for keys with mutiple values

 [(u'key1', [u'1', u'2', u'4']), (u'key2',[u'1', u'4'])]

Now how do I split this (key, list(values)) to individual (k,v) pair to apply further transformation ?

You can do all in one pass - reduceByKey, filter items that have more than 2 values and then collect or process whatever is there. Which part specifically you are having problems with? — khachik
– khachik, Commented Nov 14, 2016 at 19:38
@khachik Reduce by key would aggregate based on key right ? so it would give something like (u'key1', u'1,2,3') if I join the values in reduceByKey seperated by ','. I don't need my data to get aggregated. please correct me if I am wrong. — Magic
– Magic, Commented Nov 14, 2016 at 19:54
If I use reduceByKey and then once it is aggregated I can use filter to filter only those which has more than 2 values. Now how do I perform second round of filtering to filter keys which are associated with more than 2 values ? — Magic
– Magic, Commented Nov 14, 2016 at 20:05
once you have groups like key1 -> 1, 2, 3 you can filter based on the size of the values (len >= 2) and collect both keys and values. — khachik
– khachik, Commented Nov 14, 2016 at 22:15

sau · Accepted Answer · 2016-11-16 08:59:15Z

my_rdd = sc.parallelize([(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key2', u'3'), (u'key4', u'1'), (u'key1', u'4'), (u'key4', u'1'), (u'key6', u'2'), (u'key7', u'4'), (u'key8', u'5'), (u'key9', u'6'), (u'key10', u'7')])

#filter keys which are associated to atleast 2 values

filter2_rdd = my_rdd.groupByKey() \
                    .mapValues(lambda x: list(x)) \
                    .filter(lambda x: len(x[1])>=2) \
                    .flatMap(lambda x: [(x[0],item) for item in x[1]])

#filter values associated to atleast 2 keys.
filte1_rdd = filter2_rdd.map(lambda x: (x[1],x[0])) \
                        .groupByKey().mapValues(lambda x: list(x))\
                        .filter(lambda x: len(x[1])>=2)\
                        .flatMap(lambda x: [(item,x[0]) for item in x[1]])

This will work!!

Had done almost all the things, except the flatMap part where you are splitting the list back as individual (k,v) pair. FlatMap part helped, Thanks!

2 revs user6022341 · Accepted Answer · 2016-11-16 06:59:45Z

0

Reduce by key, filter and join:

>>> rdd.mapValues(lambda _: 1) \  # Add key of value 1
...     .reduceByKey(lambda x, y: x + y) \ # Count keys
...     .filter(lambda x: x[1] >= 2) \ # Keep only if number is >= 2
...     .join(rdd) # join with original (serves as filter)
...     .mapValues(lambda x: x[0]) # reshape

edited Nov 16, 2016 at 6:59

community wiki

2 revs
user6022341

1 Comment

Magic Over a year ago

can you please explain what you are trying to do ?

Collectives™ on Stack Overflow

Filtering data in an RDD

2 Answers 2

1 Comment

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Related