Konstantinas Mamonas

Posted on May 26

Parquet? What Parquet?

#performance #python #programming #learning

If you’re in data, you’re probably using Parquet. It’s not officially the standard, but good luck trying to convince anyone to use something else.

This post is meant to open the black box that is Parquet to see what exactly makes it so damn good. I’ll give a breakdown of the internals and show you some optimizations on how to go from a badly optimized file to a blazingly fast™ one.

What Even Is Parquet?

Parquet is a columnar storage format, it stores data by column, not by row. That’s ideal for analytical queries like SUM(driver_pay) or WHERE trip_miles < 2, where you only need a few columns, not the entire row. Engines can skip the rest, making reads faster and more efficient.

A Parquet file is composed of several layers, each designed to improve performance and storage:

Columnar layout: Values from the same column are stored together, enabling tight compression and efficient scans.
Row groups: Horizontal partitions of the dataset. These are the unit of parallelism and skipping; each contains all columns for a batch of rows.
Column chunks: Within each row group, data is organized by column.
Pages: Each column chunk is divided into pages (typically ~8KB). These are the unit of encoding and compression, allowing for localized reads and decompression.
Encodings: Within pages, data can be encoded (dictionary, run-length, bit-packing) to reduce size.
Compression: Pages can be compressed (ZSTD, Snappy, etc.) for storage and I/O efficiency.
Metadata: stores stats and min/max indexes so engines can skip row groups that don’t match the filter.

This layered structure makes Parquet efficient, but also means you have a lot of tuning knobs and plenty of ways to screw things up if you’re not careful.

What I Tested

The original data comes from NYC’s TLC trip records, merged into a 1.6GB uncompressed Parquet file. One of the files I used in Which Compression Saves the Most Storage $? (gzip, Snappy, LZ4, zstd).

I used PyArrow to generate multiple Parquet variants, each applying a small tweak, like new row group sizes, better encoding, compression and sorting to isolate the effect.

Then I ran DuckDB queries to benchmark performance.

Environment:

MacBook Pro (2021, M1 Pro, 16GB RAM)

Variant Overview

Here’s a quick snapshot of what changed in each file.

Variant	Size	RGs	Comp	Enc	Dict	Sorted
00_Worst	7.40 GB	60,304	None	V1	No	No
01_Base	6.83 GB	60	None	V1	No	No
02a_Dict	1.67 GB	60	None	V1	Yes	No
02b_Comp	1.82 GB	60	ZSTD	V1	No	No
03_Dict+Comp	1.37 GB	60	ZSTD	V1	Yes	No
04a_OptNS	1.38 GB	60	ZSTD	V2	Yes	No
04b_SortV1	1.88 GB	60	ZSTD	V1	Yes	Yes
05b_OptSV2	1.88 GB	60	ZSTD	V2	Yes	Yes

Legend

Variant: Short ID for each Parquet file variant
Size: Final file size on disk
RGs: Number of row groups
Comp: Compression codec used (ZSTD or None)
Enc: Parquet data page version (V1 or V2)
Dict: Whether dictionary encoding was enabled
Sorted: Whether the file was sorted by PULocationID

How come the worst file ballooned to 7.40GB?

In the post about compression, the same uncompressed file was 1.6GB. What happened here?

I thought, damn, I must have uncompressed it wrong last time making the previous post invalid, however I wasn't wrong.

02a_Dict file confirms that it's dictionary encoding that significantly reduced the size. That alone caused the file to bloat from 1.6GB to 7.4GB. With it on, the size drops back down even without compression. Interestingly, compression alone 02b_Comp doesn't push the file size lower than just dictionary encoding.

How the Internals Affected Query Performance

I ran seven queries on each file variant: filters, projections, aggregations, and full row reads.

BENCHMARK_QUERIES = {
    "1. Count with selective filter (predicate pushdown on numeric 'trip_miles')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE trip_miles < 2.0;",
    "2. Projection of specific columns (column pruning 'pickup_datetime', 'dropoff_datetime')": "SELECT pickup_datetime, dropoff_datetime FROM read_parquet('{file}') LIMIT 1000;",
    "3. Aggregation on filtered data (mixed types, computation)": "SELECT AVG(trip_miles), SUM(base_passenger_fare) FROM read_parquet('{file}') WHERE tips > 0.0 AND trip_time > 600;",
    "4. Filter on low-cardinality string (dictionary encoding potential 'hvfhs_license_num')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE hvfhs_license_num = 'HV0003';",
    "5. Full scan and sum of one numeric column (I/O and decompression 'driver_pay')": "SELECT SUM(driver_pay) FROM read_parquet('{file}');",
    "6. Count with filter on an integer ID (predicate pushdown 'PULocationID')": "SELECT COUNT(*) FROM read_parquet('{file}') WHERE PULocationID = 148;",
    "7. Full row reconstruction (worst-case for column store, limited)": "SELECT * FROM read_parquet('{file}') LIMIT 10;",
}

Row Group Size Changed Everything

Just fixing the row group size and going from 60K tiny chunks to 60 made the biggest difference in performance.

In the baseline, a simple COUNT(*) WHERE trip_miles < 2 took 3.662s. With improved row groups? 0.124s. Same data, same engine, 28x faster.

Small row groups mean a ton of metadata scanning and disk seeks. Fewer, larger row groups = more I/O locality and less overhead.

Impact of Larger Row Groups (4.37M vs. 1M Rows per RG)

Official Parquet documentation recommends large row groups 512MB - 1GB for uncompressed data. In our case this would have been around 14 row groups, however after trying that in practice I noticed something interesting, keeping all things the same, the file size has increased. This seems to be caused by dictionary encoding as within larger row groups there might be more unique values within the column, which results in larger dictionaries.

Variant	Rows per RG	Row Groups	File Size
`04a_OptNS`	~1M (1,018,964)	60	1375.28 MB
`04a_OptNS_LargeRG`	~4.37M (4,370,585)	14	1444.29 MB

What about performance?

Query Type	`04a_OptNS`	`04a_OptNS_LargeRG`	% Change
1. Count with selective filter (numeric)	0.1633	0.1664	+1.9%
2. Projection of specific columns	0.0055	0.0070	+27.3%
3. Aggregation on filtered data	0.3541	0.3680	+3.9%
4. Filter on low-cardinality string	0.0934	0.1018	+9.0%
5. Full scan and sum of one numeric column	0.0883	0.0918	+4.0%
6. Count with filter on an integer ID	0.0718	0.0788	+9.7%
7. Full row reconstruction (worst-case for column store)	0.0298	0.0302	+1.3%

In my opinion the cause of the slower performance was mostly due to CPU having to do more work to decompress that larger block in one go.

This highlights a critical point: while a general recommendation for row group size exists, the optimal size is often workload and data dependent. In my case I kept Row Groups at 1M.

Compression Was Worth It

ZSTD reduced the file size from 6.83 GB to 1.82 GB and kept query times low, often under 0.2s for filtered counts and scans. It adds minimal CPU overhead compared to the I/O savings.

As such, for full scans like SUM(driver_pay) neither compression alone nor with dictionary encoding outperform the 01_Base version.

So, the overhead of compression and dictionary encoding slows it down a bit. However the results make sense as compression isn't meant to improve your query execution speeds but reduce the size of the data.

Encodings and Page Versions Gave Smaller Gains

Switching to better encodings made small improvements in file size. If you apply them without compression, they can be extremely powerful on their own, but combined they will truly shine.

The column hvfhs_license_num has low cardinality, just a few distinct values. Fixing the tiny row groups improved that to 0.085s. Turning on dictionary encoding 02a_Dict brought a small additional gain: 0.0846s.

Since in most cases encoding reduced performance due to additional overhead for space savings, I will take a W.

Parquet V2 pages gave marginal improvements. They’re worth turning on if your engine supports them, but they won’t move the needle like row groups or compression will.

Sorting Provided Targeted Speedups

Sorting the file by PULocationID had a huge impact on one query: a count filter on that column.

Unsorted: 0.07s
Sorted: 0.007s

That's a 10x improvement from better row group skipping. The rest of the queries didn’t change much, and the file got a bit larger, but it’s a trade-off that’s worth it if you know your access patterns.

Even Worst-Case Row Reads Got Faster

Reconstructing full rows (all columns, LIMIT 10) is the worst thing you can do to a columnar format.

Worst: 2.27s
Optimized: 0.023s

Good structure helps even when Parquet’s working against you.

So what about it?

Parquet has sensible defaults, and if you’re not actively trying to sabotage them, things usually work fine. But if you know which knobs to turn, you can make it blazingly fast™. Kachow.

The takeaway - the defaults don’t know your workload. You do.

Thanks for reading!

DEV Community