Sadeq Dousti

Posted on Jun 8 • Edited on Jun 9

PostgreSQL + Outbox Pattern Revamped — Part 1

#outbox #postgres #performance #incidents

The outbox pattern is a well-known design pattern used in distributed systems to ensure messages are sent reliably, by first storing messages in a local database before sending them to a message broker. The outbox table often contains fields like id, payload, created_at, and published_at. There's a scheduled process that picks up N messages from the outbox, sends them to the message broker, and marks them as published.

While this seems pretty straightforward, I've seen quite a number of incidents corresponding to the outbox pattern. In this blog post, I'll go over some of these incidents and discuss how to improve your setup to avoid them. With the last example, I'll demonstrate an issue that is often overlooked, and suggest a "revamped" solution that can significantly improve performance and reliability. This solution is based on partitioning the outbox table.

I focus exclusively on PostgreSQL as my database of choice, but I think the principles can be applied to many other databases as well. All the experiments are done on PostgreSQL 17.5, on a MacBook Air M3 with 24 GB of RAM.

Part 1 of this blog post gives a somewhat complete answer on how to avoid almost all outbox-pattern issues, yet it does not cover the problem of converting a regular outbox table to a partitioned table. Part 2 will tackle this problem, as well as give a more in-depth explanation of how to remove index bloat and improve performance further by adjusting the settings of the autovacuum daemon for the partitioned outbox table.

Note: Some people may favor Debezium or similar tools for the outbox pattern. With such tools, you only need to insert into the outbox table, and the tool will handle the rest by publishing the messages to the message broker. This is done by processing the database change logs, which is called WAL (Write-Ahead Logging) in PostgreSQL. I'm a proponent of Debezium myself, but there might be some limitations where teams prefer to implement the outbox pattern manually. Examples include:

avoiding the additional complexity of using Debezium,

publishing to something other than Kafka (Debezium supports other sinks but at additional complexity), and

avoiding logical replication in PostgreSQL (to simplify upgrades and avoid potential issues with logical replication slots).

As such, this blog post is not about Debezium, but rather about the outbox pattern itself.

Incident 1: Sorting by created_at
Incident 2: Using the wrong data types
Incident 3: Incorrect indexing
  * Bad alternative 1
  * Bad alternative 2
Incident 4: Picking too many messages at each iteration
Incident 5: Poison pill messages
Incident 6: Inefficient index and visibility checking
  * Showcasing the issue
    - Setting up the outbox table and testing the index
    - Looking at the query plan
    - Simulating outbox message insertion and processing
    - Observing the query plan changes
  * How to fix the issue
    - Step 1: Create the partitioned outbox table
    - Step 2: Test the setup
    - Step 3: Test outbox message-processing performance
Conclusion

Incident 1: Sorting by created_at

In distributed systems, getting a total ordering of events based on local timestamps is a fatal error. The first incident is when each pod of the microservice labeled the outbox messages with its own created_at timestamp. The message publisher fetched messages from the outbox and sorted them by created_at before sending them to the message broker:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY created_at
LIMIT 10;

This led to a situation where messages were sent in an order that was not consistent with the total ordering of events across the system, since the pods could have slightly different clocks. In our case, the difference was only a fraction of milliseconds, but it was enough to cause issues on a highly concurrent system.

Setting created_at using a single source of truth, such as the database server's clock, can help mitigate this issue. However, experience shows that relying on time for ordering is not a good idea in distributed systems. Instead, consider using a database-supplied sequence number.

In PostgreSQL, define the id column as GENERATED BY DEFAULT AS IDENTITY, and then use it to order the messages:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

Incident 2: Using the wrong data types

Continuing with the previous tip, always use the int8 (= bigint) data type for the id column, rather than int4 (= int or integer). The latter allows for only two billion entries (assuming signed integers), and I've seen incidents where the DB ran out of the sequence number (though it was not in the context of the outbox).

Tips:

Use this script to find int4 primary keys with out-of-range risks in a large database.
Use this technique to redefine an int4 primary key to int8 without downtime.

Also, for timestamps, always use the variant that accounts for time zones, such as timestamptz in PostgreSQL.

For the payload column, you can use text, or json/jsonb types in PostgreSQL. The latter is better if you want to store a serialized version of your data and possibly index it. However, it can be slower to store and retrieve, or more bulky. Also, if your message JSON serializer may output the character \x00, you can't use jsonb since it doesn't allow null characters.

create table o_jsonb (payload jsonb);
insert into o_jsonb values ('{"resp":"\x00"}');

Results in:

ERROR:  invalid input syntax for type json
LINE 1: insert into o_jsonb values ('{"resp":"\x00"}');
                                    ^
DETAIL:  Escape sequence "\x" is invalid.
CONTEXT:  JSON data, line 1: {"resp":"\x...

More info: (You don't) Insert unicode NULL character as Postgres jsonb

All in all, the outbox table should look something like this:

CREATE TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

Incident 3: Incorrect indexing

This is the most common incident I've seen. For the following query:

SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

The correct index is a partial index as follows. Note the column being indexed is id, and the condition is published_at IS NULL:

CREATE INDEX idx1 ON outbox (id)
WHERE published_at IS NULL;

Let's insert some data into the outbox table:

-- insert data with non-null published_at
INSERT INTO outbox(payload, published_at) 
SELECT '{}', now() FROM generate_series(1,10_000);

-- insert data with null published_at
INSERT INTO outbox(payload)
SELECT '{}' FROM generate_series(1,10_000);

Check the query plan:

EXPLAIN (COSTS OFF)
SELECT * FROM outbox
WHERE published_at IS NULL
ORDER BY id
LIMIT 10;

Which uses the index idx1:

              QUERY PLAN
---------------------------------------
 Limit
   ->  Index Scan using idx1 on outbox

Bad alternative 1

If you drop the index (DROP INDEX idx1;), the planner will use the primary key index instead, but since that index is not partial, it will be much bigger and less efficient; note the Filter condition in the query plan:

                  QUERY PLAN
----------------------------------------------
 Limit
   ->  Index Scan using outbox_pkey on outbox
         Filter: (published_at IS NULL)

If the query sorted by created_at instead of id, which is not indexed, then Postgres will perform a sequential scan and sort, which is even worse:

                  QUERY PLAN
----------------------------------------------
 Limit
   ->  Sort
         Sort Key: created_at
         ->  Seq Scan on outbox
               Filter: (published_at IS NULL)

Bad alternative 2

I've seen several cases where the index is defined on the published_at column instead of id, which is totally useless, because the B-Tree index will be full of NULL values, and there's no way to use this index for both filtering and sorting by id.

-- ❌ Don't do this ❌
CREATE INDEX idx2 ON outbox (published_at)
WHERE published_at IS NULL;

Incident 4: Picking too many messages at each iteration

PostgreSQL hates long-running transactions, and the outbox pattern is no exception. If you pick too many messages at once, you may end up with a long-running transaction that blocks other operations on the outbox table (such as Autovacuum), leading to performance degradation. For too large N, PostgreSQL may even prefer not to use the index at all!

Notice that the outbox pattern does the following steps:

Open a transaction
Fetch N messages from the outbox
Publish the messages to the message broker
Commit the transaction

Step 3 entails network I/O, which can take a long time, especially if the message broker is slow or the network is congested. If you pick too many messages at once, the transaction will take longer to complete. Also, any failure will result in the entire transaction being rolled back, which means all the messages will be reprocessed.

Do some performance testing to find the optimal number of messages to pick at each iteration. In my experience, picking ~100 messages at a time is a good starting point. Do not pick more than 1000 messages at once!

Also, if possible, send the messages to the message broker in batches, rather than one by one. The batches should honor the message order, group messages by some criteria (e.g., user ID, account ID, etc.), and send messages in each group in order, while allowing for parallel processing of different groups. We use Kotlin coroutines as opposed to threads for utmost performance.

Incident 5: Poison pill messages

A message that cannot be processed due to some error is called a "poison pill" message. If such a message is picked up by the outbox publisher, it will cause the entire transaction to fail, and the message will be reprocessed indefinitely. Grouping messages, as mentioned in the previous incident, can help lower the effect of poison pill messages: While the DB transaction is still around all the groups, each group can be processed independently, and exceptions in one group are caught to not affect the others.

However, this is not a perfect solution, and alerting should be in place to notify the team when a poison pill message is detected. The team should then investigate the root cause and fix it. (Example: The Kafka topic to which the message is being published has a new schema in the Schema Registry that is incompatible with the message payload.)

Incident 6: Inefficient index and visibility checking

This is by far the most complex issue, and understanding it requires a bit of background on how PostgreSQL handles visibility and indexing. It is also the main reason I wrote this blog post, as the solution we came up with was really rewarding in terms of performance and reliability. It kind of revamps the outbox pattern, in that we use partitioned tables without additional indexes, and allows for easy removal of old messages in the blink of an eye!

The issue arises when the outbox table is being inserted into frequently, and at the same time the messages are being published to the message broker at a fast rate. This causes any index on the outbox table to become "inefficient" (referring to dead tuples + being bloated), which in turn leads to performance degradation. The inefficient index takes up more space and requires more time to scan. Besides, the visibility checking mechanism in PostgreSQL can cause additional overhead, as it needs to check the visibility of each row in the index, and possibly fetch many rows from the heap.

Showcasing the issue

Let's first demonstrate the issue with a simple example. It is important to note that this kind of issue is paramount in high-throughput systems, where the outbox table is being inserted into and processed at a fast rate. So, if you have a low-throughput system, you may not see this issue in practice, but it is still good to be aware of it.

Setting up the outbox table and testing the index

This can be demonstrated with a simple example. We first set up the outbox table and its partial index as described earlier. I'm using an UNLOGGED table to speed up the tests, but you should use a regular table in production:

CREATE UNLOGGED TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
);

CREATE INDEX idx1 ON outbox (id)
WHERE published_at IS NULL;

Next, we insert a large number of messages into the outbox table:

INSERT INTO outbox(payload)
SELECT '{}'
FROM generate_series(1, 1_000_000);

VACUUM ANALYZE outbox;

Looking at the query plan

Next, let's check the query plan for fetching messages from the outbox. Notice that I'm only fetching the id column, to enforce the so-called "index-only scan." I'm also fetching a large number of messages (1000), in contrast to what was suggested earlier (See Incident 4). Both are for demonstration purposes only, to show a point that should be clear soon.

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The query plan shows that Postgres is using the index idx1 to fetch the messages:

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox (actual rows=1000 loops=1)
         Heap Fetches: 0
 Planning Time: 0.137 ms
 Execution Time: 0.133 ms

As expected, the index idx1 is used, the planner uses an "index-only scan," and there are no heap fetches, meaning that all the required data is available in the index itself. In other words, after reading the data from the index, Postgres does not need to access the heap (the actual table) to fetch any extra information. The query is fast, and it only takes a fraction of a millisecond to execute.

Simulating outbox message insertion and processing

Now, let's run two additional parallel sessions: one that inserts new messages into the outbox table, and another that processes messages from the outbox. While both parallel sessions are running, we'll check the query plan again and again to see how it changes over time. Here, I opted for a simple loop rather than a pgbench script, as I felt it's more beginner-friendly and easier to understand. However, you can use pgbench or any other tool to simulate the load in a more realistic way, and also gather statistics on the performance of each parallel session.

Parallel Session 1: Insert messages into the outbox table

DO $$
BEGIN
    LOOP
        INSERT INTO outbox(payload)
        SELECT '{}'
        FROM generate_series(1, 1000);

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

Parallel Session 2: Process messages from the outbox table

DO $$
BEGIN
    LOOP
        WITH t AS (
            SELECT id
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY id
            LIMIT 10000
        )
        UPDATE outbox
        SET published_at = NOW()
        WHERE id IN (SELECT id FROM t);

        -- Commit the transaction to avoid long-running transactions
        COMMIT;
    END LOOP;
END $$;

Observing the query plan changes

While the two parallel sessions are running, let's check the query plan again and again:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

After just a few seconds, I got this awful query plan. Note that it fetches close to 100 million rows from the heap, which takes over 18.5 seconds to execute. This means a slowdown 5 orders of magnitude compared to the initial query plan, and can easily lead to an incident.

                              QUERY PLAN
-----------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox (actual rows=1000 loops=1)
         Heap Fetches: 96633220
 Planning Time: 0.067 ms
 Execution Time: 18553.061 ms

I should point out that not all plans were this bad. I executed it several times, and the results I got were as follows:

#	Heap Fetches	Execution Time (ms)
1	870094	48
2	96633220	18553
3	11000	2844
4	1000	1791
5	11000	2708

The reason for this behavior is that autovacuum kicks in and cleans up the table and the index, and updates the visibility maps. As such, the number of heap fetches varies over time, but in most cases it was still 4 orders of magnitude slower!

Note: While the vacuum can clean up the index, it only marks dead rows in the index as invalid, and does not actually remove them from the index (because it requires rebuilding the index structure, which would slow down vacuum). Over time, this will cause the index to become bloated. We'll discuss this in the Part 2 of this blog post.

How to fix the issue

Fixing the issue is actually quite simple: Just use a partitioned outbox table. More specifically, the outbox table is partitioned by whether published_at is NULL or not. This way, one partition will always contain the messages that are not yet published, and the other partition will contain the messages that are already published. There will be no need for fetching tons of rows from the heap, since any row in the "unpublished" partition is guaranteed to have published_at IS NULL. Furthermore, it is straightforward to remove all messages from the "published" partition: You can simply TRUNCATE the partition regularly, which is a rapid operation in PostgreSQL. Finally, the operations are more "vacuum-friendly," as the "published" partition is only inserted to.

Let's dive in.

Step 1: Create the partitioned outbox table

We create the outbox table as before (note the UNLOGGED is for testing purposes, you should use a regular table in production), with two differences:

The table is partitioned by the published_at column.
The table has no primary key, as the id column is not listed in the partitioning key. Instead, we rely on the GENERATED ALWAYS AS IDENTITY to generate unique IDs. Though in some cases (e.g., in logical replication), the lack of a primary key can cause issues, this is an acceptable trade-off for the outbox table for most use cases.

CREATE UNLOGGED TABLE outbox (
    id INT8 GENERATED ALWAYS AS IDENTITY,
    payload JSON NOT NULL,
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    published_at TIMESTAMPTZ
)
PARTITION BY LIST (published_at);

Next, we create two partitions: one for the messages that are not yet published, and another for the messages that are already published. The first partition is called outbox_unpublished, and the second partition is called outbox_published.

CREATE UNLOGGED TABLE outbox_unpublished 
    PARTITION OF outbox
    FOR VALUES IN (NULL);

CREATE UNLOGGED TABLE outbox_published 
    PARTITION OF outbox
    DEFAULT;

Finally, we create the index on the outbox_unpublished partition, as it is the one that will be queried for fetching messages to publish. The other partition will not be queried, so it does not need an index. Also, the index should no longer be a partial index, as the partition itself is already filtered by published_at IS NULL. However, for demonstration purposes and showing heap fetches, I opt to create a partial index anyway:

CREATE INDEX idx1 ON outbox_unpublished (id)
-- This is unnecessary, as the partition is already filtered by published_at IS NULL
WHERE published_at IS NULL;

Step 2: Test the setup

Let's first insert some messages into the outbox partition:

INSERT INTO outbox(payload)
SELECT '{}'
FROM generate_series(1, 1_000_000);

VACUUM ANALYZE outbox;

Next, let's check the stats of the number of tuples inserted, updated, and deleted in each partition:

SELECT relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE relname LIKE 'outbox_%';

This is as expected: All 1 million tuples are inserted into the outbox_unpublished partition:

      relname       | n_tup_ins | n_tup_upd | n_tup_del
--------------------+-----------+-----------+-----------
 outbox_unpublished |   1000000 |         0 |         0
 outbox_published   |         0 |         0 |         0

What if we update the published_at column for a single row? The wrong way to do this, which is an easy way to create an incident, is as follows:

-- ❌ Don't do this ❌
UPDATE outbox
SET published_at = NOW()
WHERE id = 1;

Why on earth is this a bad idea? Because the outbox table has no index on the id column, remember? Only the outbox_unpublished partition has an index on the id column, so the above query will perform a sequential scan on the outbox_published partition:

EXPLAIN (COSTS OFF, SUMMARY OFF)
    UPDATE outbox
    SET published_at = NOW()
    WHERE id = 1;

And here's the query plan:

                     QUERY PLAN
-----------------------------------------------------
 Update on outbox
   Update on outbox_unpublished outbox_1
   Update on outbox_published outbox_2
   ->  Append
         ->  Seq Scan on outbox_unpublished outbox_1
               Filter: (id = 1)
         ->  Seq Scan on outbox_published outbox_2
               Filter: (id = 1)

You see Seq Scan on both partitions as I opted for a partial index on the outbox_unpublished partition, but as I pointed out earlier, this is just for demonstration purposes.

Regardless, the point is that the outbox_published partition will be sequentially scanned. In the beginning, where outbox_published is empty, this Seq Scan query will be fast. But as soon as the outbox_published partition starts to accumulate messages, the query will become slower and slower, to a point where it can cause an incident.

With partitioned tables, it's always best to include the partition key in the WHERE clause. In our case, we can do this by using the published_at IS NULL condition:

EXPLAIN (COSTS OFF, SUMMARY OFF)
    UPDATE outbox
    SET published_at = NOW()
    WHERE id = 1
    AND published_at IS NULL; -- This is the correct way to do it

The above will only scan the outbox_unpublished partition, and will be fast:

                         QUERY PLAN
------------------------------------------------------------
 Update on outbox
   Update on outbox_unpublished outbox_1
   ->  Index Scan using idx1 on outbox_unpublished outbox_1
         Index Cond: (id = 1)
         Filter: (published_at IS NULL)

Let's now actually run the update query, and check the stats again:

-- Update the published_at column for a single row
UPDATE outbox
SET published_at = NOW()
WHERE id = 1
AND published_at IS NULL;

-- Check the stats again
SELECT relname, n_tup_ins, n_tup_upd, n_tup_del
FROM pg_stat_user_tables
WHERE relname LIKE 'outbox_%';

The stats now show that one tuple was deleted from the outbox_unpublished partition, it was inserted into the outbox_published partition, and no tuples were updated:

      relname       | n_tup_ins | n_tup_upd | n_tup_del
--------------------+-----------+-----------+-----------
 outbox_unpublished |   1000000 |         0 |         1
 outbox_published   |         1 |         0 |         0

Step 3: Test outbox message-processing performance

Before running the message-processing query, let's check the query plan for fetching messages from the outbox:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The result is as follows:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 156
 Planning Time: 0.170 ms
 Execution Time: 0.165 ms

Wait, what?! There are heap fetches? Yes, because we updated the published_at column for a single row, and the change is not significant enough to trigger an autovacuum. Do a manual VACUUM outbox, and check the query plan again. You'll be surprised to see that there are still 156 heap fetches!

The reason is that, starting from Postgres 14, VACUUM may skip index cleanup when the number of removable index entries is not significant. It has an option called INDEX_CLEANUP, which is set to AUTO by default. To enforce the index cleanup, you can set it to ON:

VACUUM (INDEX_CLEANUP ON) outbox;

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

The result is as expected:

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 0
 Planning Time: 0.166 ms
 Execution Time: 0.135 ms

I'll reiterate that index cleanup is different from index bloat removal. This is a topic for the next part of this blog post, where we'll discuss how to remove index bloat and improve performance even further.

Okay, let's now run the two parallel sessions again. The code for session 1 is exactly the same as before, but for session 2, we need to change the query so that the UPDATE statement uses the additional condition published_at IS NULL to ensure that it only updates the outbox_unpublished partition:

Parallel Session 1: Insert messages into the outbox table

DO $$
BEGIN
    LOOP
        INSERT INTO outbox(payload)
        SELECT '{}'
        FROM generate_series(1, 1000);

        COMMIT;
    END LOOP;
END $$;

Parallel Session 2: Process messages from the outbox table

DO $$
BEGIN
    LOOP
        WITH t AS (
            SELECT id
            FROM outbox
            WHERE published_at IS NULL
            ORDER BY id
            LIMIT 10000
        )
        UPDATE outbox
        SET published_at = NOW()
        WHERE id IN (SELECT id FROM t)
        -- Ensure we only update the unpublished partition
        AND published_at IS NULL;

        COMMIT;
    END LOOP;
END $$;

Interestingly, the fetch query is pretty fast, and seemingly unaffected by the parallel sessions. I ran it many times over, and it gave me a consistent result of 1000 heap fetches and an execution time of 1–3 ms:

EXPLAIN (ANALYZE, TIMING OFF, COSTS OFF)
    SELECT id
    FROM outbox
    WHERE published_at IS NULL
    ORDER BY id
    LIMIT 1000;

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Limit (actual rows=1000 loops=1)
   ->  Index Only Scan using idx1 on outbox_unpublished outbox (actual rows=1000 loops=1)
         Heap Fetches: 1000
 Planning Time: 0.042 ms
 Execution Time: 2.543 ms

Conclusion

The outbox pattern is a powerful tool for ensuring reliable message delivery in distributed systems. However, it can lead to incidents if not implemented correctly. In this blog post, we discussed several common pitfalls and how to avoid them, including:

Sorting by created_at
Using the wrong data types
Incorrect indexing
Picking too many messages at each iteration
Poison pill messages
Inefficient index and visibility checking

The last point is particularly important, as it can lead to significant performance degradation in high-throughput systems. We showed how to revamp the outbox pattern by using partitioned tables, which allows for easy removal of old messages and avoids the issues with stale indexes and visibility checking.

While this revamped solution is straightforward to implement on a greenfield project, it can be more challenging to apply to an existing system. This is because PostgreSQL does not allow converting a regular table to a partitioned table. In part 2 of this blog post, we'll tackle this problem, as well as discuss how to remove index bloat and improve performance even further by adjusting the settings of the autovacuum daemon for the partitioned outbox table. Stay tuned for that, and wish you an incident-free outbox pattern implementation!

Top comments (4)

Lord Vlad • Jun 9

Great content, great writing, looking forward to part 2

Eci Peci • Jun 10

Great post, thank you so much.
I have just one thing that is not clear to me. I see how it works amazingly well for published partition and that one can be truncated to be on a clean slate from time to time.
But how all this scales with unpublished partition? And if it scales well, then why?
Isn’t a problem for that partition that it is not insert only and can never be trucnated?

Sadeq Dousti • Jun 11

Thanks! The unpublished partition is supposed to contain a small number of messages, since we want to publish them ASAP. If the system throughput for message publication isn't enough, this partition size will grow indefinitely, causing issues.
This partition isn't insert only. When we update messages, they are effectively deleted from it and inserted to the published partition.
We don't want to truncate it, but adjusting vacuum and doing index maintenance operations are important. These will be discussed in part 2 🙂