1

I have 5 steps which produced df_a,df_b,df_c,df_e and df_f. Each step generates a dataframe (df_a for instance), and persist as parquet files. The file is used in the sequential step (df_b for example). The processing time for df_e and df_f took about 30 mins.

However, I started a new spark session, read df_c parquet to a data frame and processed df_e and df_f, it took less than a minute.

Was it because when we read the parquet file, it stores in a more compact way? Should I overwrite the dataframe with the spark read after writing to storage to improve the performance?

1 Answer 1

1

Using Parquet file format is definitely best from processing point of view but as you said in your case

  1. Why don't u convert the very first step to parquet( start reading parquet file itself in very first step).

  2. If storing intermediate df as parquet improves your performance , then may be you can apply filter to focus on the part of Data you require to read , this is as good as reading from parquet file format. because if you are storing intermediate df as parquet it is also taking some time to store the data and again reading the data into spark.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.