Sebastian for epilot

Posted on May 14

Scaling Notification Systems: How a Single Timestamp Improved Our DynamoDB Performance

#aws #dynamodb #serverless

Introduction

The epilot platform contains a comprehensive notification system. Users receive notifications about ongoing tasks, such as new assignments or overdue tasks. They can also get notified about incoming emails or when someone mentions them in notes, and the list goes on. Users can choose to receive these notifications via email or as in-app notifications. This article focuses on the latter.
Initially, in-app notifications were stored in Aurora (AWS's solution for SQL-based databases). This setup soon became a major pain point, prompting us to migrate to DynamoDB. The simplicity of the notification data structure and the amount of read and write operations we expected made DynamoDB the perfect choice to scale.
However, if you don't think carefully about how you design access patterns in DynamoDB, more problems arise than you'd expect.
Let's dive into why a bad implementation of a markAllAsRead feature us some headache and how we reduced the complexity from O(n) to O(1) by using a timestamp-based approach for unread notifications.

The Problem

The initial design was straightforward. Every user gets a new notification item in the DynamoDB table. The partition key (pk) was a combination of user_id and organization_id, while the sort key (sk) contains the notification_id. The access patterns were quite straightforward: fetch all notifications for a given user, mark a notification as read, and mark all notifications as read for the lazy ones. The latter is the origin of this article.
An attribute read_state indicates if a notification was already read by a user. Marking a single notification was quite straightforward. It was as simple as:

async function markAsRead(params: { ... }) {
  await ddb.update({
    TableName: config.NOTIFICATIONS_TABLE,
    Key: toUserNotificationSK(params),
    UpdateExpression: 'SET read_state = :read_state',
    ExpressionAttributeValues: {
      ':read_state': 1, // binary 1 is true
    },
  });
}

Once a notification is read, the item is updated and read_state is set to 1. A Global Secondary Index (GSI) called byReadState then allows us to read all unread notifications for a given tenant (org + user). This created two operations that performed poorly:

A bad implementation of the markAllAsRead feature. It first queried all unread notifications and then performed a batch operation to update all notifications to read. As shown in the graph below, DynamoDB began to throttle under load, when people with lots of unread notifications used the marked all as read feature.
To indicate that a user has unread messages, a getTotalUnreadCount endpoint is exposed. This allows us to render a notification bell in the UI to show the unread count.

The naive implementation to batch update all unread notifications worked surprisingly well in the beginning. However, as the volume of notifications increased, we started experiencing more and more throttling events in DynamoDB. What started as occasional hiccups became a serious bottleneck in our notification service's performance.

The issue was multi-faceted. First, DynamoDB has limits on batch operations, requiring us to split large batches into multiple smaller operations. This not only added complexity to our code but also increased the probability of partial failures.
Second, each notification update consumed Write Capacity Units (WCUs) from our table's provisioned capacity. For users with hundreds or thousands of unread notifications, a single "Mark All as Read" action would consume a significant portion of our available WCUs, causing other notification operations to be throttled.
Importantly, these issues didn't affect the entire epilot platform, but were isolated to the notification service itself. Users would see timeouts or delayed responses specifically when interacting with notifications, while the rest of the platform continued to function normally.
However, this created a frustrating user experience, especially for power users who relied heavily on notifications to manage their workflows.
The problem was particularly severe for organizations with large teams, where notification counts could grow rapidly, and the "Mark All as Read" feature was used frequently to manage notification overload.

The Solution: Last Read Timestamp

After evaluating several options, we settled on a timestamp-based approach that would fundamentally change how we track read states while maintaining backward compatibility with our existing system.
Instead of updating each notification individually when a user clicks "Mark All as Read," we simply record the timestamp of when this action occurred. Any notification created before this timestamp is considered "read," while notifications arriving after it are "unread." This solution transforms what was an O(n) operation into an O(1) operation, regardless of how many notifications a user has.

The New Table Structure

We created a new DynamoDB table called notifications-read-state with the following structure:

{
  pk: `ORG#${orgId}#USER#${userId}`,  // Partition key
  sk: `READMARK#${timestamp}`,         // Sort key
  read_at: ISO8601Timestamp,           // When the user marked all as read
  created_at: ISO8601Timestamp         // When this record was created
}

The primary key design allows us to:

Efficiently lookup the most recent "mark all as read" timestamp for any user
Support multiple organizations per user
Maintain a history of read events if needed for analytics

The read_at attribute stores an ISO-formatted timestamp that serves as our "high water mark" for read notifications. This single attribute is the cornerstone of our solution.

Before:

async function markAllAsRead(userId, orgId) {
  // Step 1: Query for all unread notifications
  const unreadNotifications = await getUnreadNotifications(userId, orgId);

  // Step 2: Prepare batch updates (25 items per batch due to DynamoDB limits)
  const batches = createBatches(unreadNotifications, 25);

  // Step 3: Execute all batch updates
  for (const batch of batches) {
    await ddb.batchWrite({
      RequestItems: {
        [NOTIFICATIONS_TABLE]: batch.map(item => ({
          PutRequest: {
            Item: { ...item, read_state: 1 }
          }
        }))
      }
    });
  }
}

Complexity: O(n) - As the number of notifications increases, both processing time and database load increase linearly.

After:

async function markAllAsRead(userId, orgId) {
  const now = new Date().toISOString();

  // Single write operation
  await ddb.put({
    TableName: NOTIFICATIONS_READ_STATE_TABLE,
    Item: {
      pk: `ORG#${orgId}#USER#${userId}`,
      sk: `READMARK#${now}`,
      read_at: now,
      created_at: now
    }
  });
}

Complexity: O(1) - Constant time operation regardless of notification count.

This simple change drastically improved our system's performance. The "Mark All as Read" operation now completes in milliseconds instead of potentially seconds, uses a predictable amount of database capacity, and never times out, even for users with thousands of unread notifications.
What makes this approach particularly powerful is that we don't need to modify any existing notifications. Instead, we're recording a state transition that implicitly affects all notifications for a user at once.

Given the following pseudo-code, the byReadState index can be removed completely. All you need is to fetch the latest getLastReadTimestamp for a given user and calculate whether the notification was already seen or not.

const lastReadAt = await getLastReadTimestamp({ userId: params.userId, orgId: params.orgId });

const { Items, LastEvaluatedKey } = await ddb.query({
  ...
  TableName: config.NOTIFICATIONS_TABLE,
  IndexName: 'byTimestamp',
  ExclusiveStartKey: params.cursor ? decodeLastEvaluatedKey(params.cursor) : undefined,
});
...

const notificationsWithReadStatus = Items.map((item) => {
  // A notification is considered read if:
  // - It was created before the last "mark all as read" time OR
  // - It has been individually marked as read (read_state = 1)
  const isRead = item.timestamp <= lastReadAt || item.read_state === 1;

  return {
    ...item,
    read_state: isRead,
  };
 });

Lessons Learned

Our journey from the first version to the optimized solution taught us a lot about designing for scale—especially with DynamoDB.

At first, updating each notification one by one seemed fine. It was simple, worked great in dev, and handled early traffic just fine. But as usage grew, that approach quickly hit its limits. It was a good reminder: what works now might not work when your data grows 10x.

The breakthrough came when we stopped trying to optimize the old way and instead rethought the problem. Rather than updating every record, we started recording state changes with timestamps. That shift made things both simpler and faster—and it's a pattern that applies well beyond notifications.

Most importantly, we learned to play to DynamoDB’s strengths: fast, predictable access with simple operations. Once we aligned our design with that, everything clicked.

Conclusion

It’s easy to overthink scalability from the start, but the truth is: you won’t know your real problems until users are actually using the system. Our experience reminded us that it's totally fine to start simple and ship. You learn way more from real-world usage than from guessing at edge cases.

Scalability issues aren’t failures—they’re signs of growth. When we hit our limits, it forced us to rethink things. And funny enough, the fix—a single timestamp—ended up being both simple and powerful. It made the system faster, more reliable, and easier to reason about.

So if you’re torn between shipping something basic now or building for every possible future, go with the simple version. Ship it, learn from it, and improve as you go.

Do you want to work on features like this? Check out our career page or reach out to at X or LinkedIn

Top comments (1)

Emil • May 18

I like the solution. Great thinking.