I am trying to work with a large dataset, but just play around with a small part of it. Each operation takes a long time, and I want to look at the head or limit of the dataframe.
So, for example, I call a UDF (user defined function) to add a column, but I only care to do so on the first, say, 10 rows.
sum_cols = F.udf(lambda x:x[0] + x[1], IntegerType())
df_with_sum = df.limit(10).withColumn('C',sum_cols(F.array('A','B')))
However, this still to take the same long time it would take if I did not use limit.