I have 5 steps which produced df_a,df_b,df_c,df_e and df_f.
Each step generates a dataframe (df_a for instance), and persist as parquet files.
The file is used in the sequential step (df_b for example).
The processing time for df_e and df_f took about 30 mins.
However, I started a new spark session, read df_c parquet to a data frame and processed df_e and df_f, it took less than a minute.
Was it because when we read the parquet file, it stores in a more compact way? Should I overwrite the dataframe with the spark read after writing to storage to improve the performance?