I have managed to pre process my data in pyspark to get something like this
[(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key3', u'2'), (u'key4', u'1'), (u'key1', u'4'), (u'key5', u'1'), (u'key6', u'2'), (u'key7', u'4'), (u'key8', u'5'), (u'key9', u'6'), (u'key10', u'7')]
Now I need to filter based on these conditions :
1) filter values associated to atleast 2 keys.
output - only those (k,v) pairs which has '1','2','4' as values should be present since they are associated with more than 2 keys
[(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key3', u'2'), (u'key4', u'1'), (u'key1', u'4'), (u'key5', u'1'), (u'key6', u'2'), (u'key2', u'4')]
2) filter keys which are associated to atleast 2 values
output - only those (k,v) pairs which has key1, key2 as keys should be there since they are associated with atleast 2 values
[(u'key1', u'1'), (u'key2', u'1'), (u'key1', u'2'), (u'key1', u'4'), (u'key2', u'4')]
Any suggestions would be of great help.
Update : I used groupBy and a filter to group for keys with mutiple values
[(u'key1', [u'1', u'2', u'4']), (u'key2',[u'1', u'4'])]
Now how do I split this (key, list(values)) to individual (k,v) pair to apply further transformation ?
key1 -> 1, 2, 3you can filter based on the size of the values (len >= 2) and collect both keys and values.