Pyspark dataframe.limit is slow

Question

I am trying to work with a large dataset, but just play around with a small part of it. Each operation takes a long time, and I want to look at the head or limit of the dataframe.

So, for example, I call a UDF (user defined function) to add a column, but I only care to do so on the first, say, 10 rows.

sum_cols = F.udf(lambda x:x[0] + x[1], IntegerType())
df_with_sum = df.limit(10).withColumn('C',sum_cols(F.array('A','B')))

However, this still to take the same long time it would take if I did not use limit.

Ali Yesilli · Accepted Answer · 2018-09-26 07:39:56Z

3

If you work with 10 rows first, I think it is better that to create a new df and cache it

df2 = df.limit(10).cache()
df_with_sum = df2.withColumn('C',sum_cols(F.array('A','B')))

answered Sep 26, 2018 at 7:39

Ali Yesilli

2,21016 silver badges16 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chandan Ray · Accepted Answer · 2018-09-26 11:47:27Z

3

limit will first try to get the required data from single partition. If the it does not get the whole data in one partition then it will get remaining data from next partition.

So please check how many partition you have by using df.rdd.getNumPartition

To prove this I would suggest first coalsce your df to one partition and do a limit. You can see this time limit is faster as it’s filtering data from one partition

answered Sep 26, 2018 at 11:47

Chandan Ray

2,0911 gold badge13 silver badges16 bronze badges

Collectives™ on Stack Overflow

Pyspark dataframe.limit is slow

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related