1

I have dataframe with columns having some string values.

col1|col2
---------
aaa |bbb
ccc |ddd
aaa |ddd
eee |fff

I have to get number of allowed values ({aaa,ddd}) present in each columns.

cond = "`col1` = 'aaa' OR `col1` = 'ddd'"
dataframe.where(F.expr(cond)).count()

By this way we are getting required values. We are looping through all columns and perform this operation on each column.

This approach takes hours to process when number of columns increased to 2000.

Is there a better and faster approach for processing all columns parallely?

1 Answer 1

1

One alternative is to use list comprehension in Python to apply the same condition on all columns of the dataframe

import pyspark.sql.functions as F

ok_values = ['aaa', 'ddd']
dataframe = dataframe.select(
  *[F.sum((F.col(c).isin(ok_values)).cast('integer')).alias(c) for c in dataframe.columns]
)

dataframe.show()
+----+----+
|col1|col2|
+----+----+
|   2|   2|
+----+----+
Sign up to request clarification or add additional context in comments.

2 Comments

Yes @Ric S . This gives better performance than trivial loop and works well for simple operations like isin() check and count. I have some complex operations in a function that will perform series of computations on the given column and returns the corresponding column's result in a dictionary. Can i give the UDF functions in the above statement?
Hi @Nandha, I'm afraid I don't know how to answer. I suggest you to make some attempts, and if it does not work create a new question since, as I understood, it is a little different and more difficult than the one you posted here.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.