learnings from optimizing pandas code

#pandas #data #dataengineering #datascience

This is a blog post about things I've learnt about optimizing data pipelines written in python & pandas. As the size of the data grows, pipelines can become slow and clunky. Spending time on optimizing the pipeline can make it leaner and faster and allow you to scale your systems.

Optimisation is fundamentally of 2 types - speed and memory.

Speed optimization

The first step in optimising any code is to find the bottlenecks.

Benchmark time spent used in different parts of the pipeline to understand the most time consuming steps. Is it in reading files, certain data operations, writes, is there any operations which depend on network, etc.

Once you identify the bottlenecks - i.e. - the operations that take the most time, it makes sense to prioritize optimizing those. Some examples could be -

reading files - read_csv() - one of the optimizations for read_csv in pandas (v2.0 and greater) is to use the 'pyarrow' engine instead of the default. You could also use the dtype_backend = 'pyarrow' to store the data in pyarrow types rather than numpy - read more about pyarrow support here

Further optimization for reading files is to use a more performant format like parquet instead of csvs. parquet files are about 10x smaller and 10x faster to read.

Slow pandas operations - There are some pandas operations which are slow and inefficient and often can be re-written in a more performant way. Mostly they are the operations that do iterations on the dataframe rows. Most common examples are df.apply , series.map. If possible replace these with vectorized pandas operations. Things like np.where may be useful here.
Network / async operations - This is not just related to pandas but general python optimization - If there are operations that rely on files being transferred on network, use async & threading so that CPU doesn't sit around waiting for them to complete.
Multiprocessing - Once you've optimized the pipeline, the next optimization is by parallel processing. If the data can be split into batches, then we can parallely process the data using all the CPU cores. Since python doesn't allow for "true" multithreading because of the GIL, we can use the multiprocessing library.

That brings to an end speed optimizations - the fundamental thing is to focus on the bottlenecks.

For eg: If the pipeline consists of 3 operations A, B and C and if A takes 10 seconds, B takes 100 seconds and C takes 15 seconds - a 50% reduction in B is almost a 50% reduction in overall time, while a 90% reduction in A would be like a 10% reduction in overall time. So pick the parts to optimize wisely.

Memory Optimization

2nd type of optimization is memory optimization - Here you need to keep an eye on object lifetimes. If a data is loaded into memory, what is its lifecycle, are there copies being made, etc. Pandas does optimize things under the hood by lazy copies, but reviewing your code to see if you are creating unneccessary copies can help. You can also use memory profilers to get an idea of this.