Pyspark parallelize column wise operations in python

Question

I have dataframe with columns having some string values.

col1|col2
---------
aaa |bbb
ccc |ddd
aaa |ddd
eee |fff

I have to get number of allowed values ({aaa,ddd}) present in each columns.

cond = "`col1` = 'aaa' OR `col1` = 'ddd'"
dataframe.where(F.expr(cond)).count()

By this way we are getting required values. We are looping through all columns and perform this operation on each column.

This approach takes hours to process when number of columns increased to 2000.

Is there a better and faster approach for processing all columns parallely?

Ric S · Accepted Answer · 2022-01-18 16:58:37Z

1

One alternative is to use list comprehension in Python to apply the same condition on all columns of the dataframe

import pyspark.sql.functions as F

ok_values = ['aaa', 'ddd']
dataframe = dataframe.select(
  *[F.sum((F.col(c).isin(ok_values)).cast('integer')).alias(c) for c in dataframe.columns]
)

dataframe.show()
+----+----+
|col1|col2|
+----+----+
|   2|   2|
+----+----+

answered Jan 18, 2022 at 16:58

Ric S

9,3184 gold badges30 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Nandha Over a year ago

Yes @Ric S . This gives better performance than trivial loop and works well for simple operations like isin() check and count. I have some complex operations in a function that will perform series of computations on the given column and returns the corresponding column's result in a dictionary. Can i give the UDF functions in the above statement?

Ric S Over a year ago

Hi @Nandha, I'm afraid I don't know how to answer. I suggest you to make some attempts, and if it does not work create a new question since, as I understood, it is a little different and more difficult than the one you posted here.

Collectives™ on Stack Overflow

Pyspark parallelize column wise operations in python

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related