Partitioning on a categorical (low-cardinality) column

Question

Similar to Improving performance with a common low-cardinality field

We get a large dataset and we load it by source. Let's say team_id. We currently have our data partitioned by team_id and then by the timestamps of events. However, this does mean the team_id ends up being repeated over each monthly (for e.g.) partition for that team. Is there a way to save space both in the tables and in any indexes we use that would include team_id?

How many teams do you have? A smallint supports up to 32767 teams, integer up to 2147483647 teams, and bigint up to 9223372036854775807. Now you can choose between 2, 4, and 8 bytes per record. This represents roughly 2, 4, and 8 GB per one billion records. How many billion records do you have or expect? — Frank Heikens
– Frank Heikens, Commented Sep 8 at 16:34
Short answer: unless your data set is in the tens of TB or more, it's probably not worth your time to overthink this, except as a learning exercise. — Bill Karwin
– Bill Karwin, Commented Sep 8 at 22:07
The dataset is about 300-400 GB right now, we get about 80 M records per year. — raphael
– raphael, Commented Sep 9 at 14:56
80 million records per year translates to a few MB in storage per year. It's a waste of time to try to optimize storage. — Frank Heikens
– Frank Heikens, Commented Sep 10 at 15:57

Erwin Brandstetter · Accepted Answer · 2025-09-10 01:47:21Z

For a very small, mostly constant number of teams and large cardinalities of otherwise narrow rows, it can make sense to have a separate table per team - and no team_id column at all, which mainly benefits index size (and performance). Especially if each team table is, in turn, partitioned (by time) like in your case.

I would only consider the added overhead if the bulk of your queries targets a single team. Else you create a lot of overhead when addressing multiple or all teams.

You do save some overhead by removing one level of partitioning in your case.

Laurenz Albe · Accepted Answer · 2025-09-09 06:53:57Z

2

No, you have to keep the team_id in every row of every partition, even if it is the same for all rows in a partition.

answered Sep 9 at 6:53

Laurenz Albe

62.4k4 gold badges58 silver badges94 bronze badges

Add a comment |

Stack Exchange Network

Partitioning on a categorical (low-cardinality) column

2 Answers 2

Linked

Hot Network Questions

Partitioning on a categorical (low-cardinality) column

2 Answers 2

Linked

Related

Hot Network Questions