A brief background about the app
It’s an app where users upload their data into categories. The app processes that data, runs some additional calculations and aggregations, and stores everything in a database. After that, users can view the resulting data in charts and grids on the website.
There’s also a requirement to capture that data and upload it as CSV files to an S3 bucket — daily, per account and per category.
There are around 20 categories and 10,000 accounts. So in total, the app needs to upload about 200,000 CSV files to S3 every day.
How it was
It all started with a simple cron job that ran a script every day at 5 AM.
The script did 3 things per account and category:
- Loaded data from the MySQL database
- Prepared the CSV file
- Uploaded it to S3
Here’s what it looked like:
Problems 🙄
- It’s not scalable — any increase in categories or accounts would linearly increase execution time
- It’s not fault tolerant — no built-in retry mechanism or partial success tracking could cause permanent export loss for an entire day
- It has a single point of failure — if the machine dies or script crashes mid-way, there’s no recovery or continuation
- There’s no proper monitoring — no visibility into what succeeded, failed, or how long it took
How it’s now
Obviously, the previous solution didn’t scale well and needed a fix — fast.
Since we were already using AWS services, I decided to stay within that ecosystem.
I realized that to get the best scalability and fault tolerance, each CSV file had to be prepared and exported separately.
So, I broke the system down into three core components:
- Export Scheduler — handles everything related to scheduling
- Export Trigger — kicks off the export process for each file
- Export Runner — does the actual work: loads the data, prepares the CSV, and uploads it to S3
Here’s what I came up with:
Here’s how the new system works, end to end:
- EventBridge kicks things off by emitting a daily export event — this is the trigger that starts the whole export flow.
- That event is picked up by the Export Trigger lambda, which is responsible for fetching relevant application data from the database and generating export requests — one for each account and category.
- These requests are sent to a queue (or stream), which acts as a buffer and decouples the trigger from the actual export logic.
- The Export Runner lambda then processes each export request: it loads the data, prepares the CSV file, and uploads it directly to the S3 bucket.
- Meanwhile, logs, metrics, and alarms are pushed to CloudWatch, giving us visibility into what’s happening at every step — useful for monitoring and debugging.
What’s solved 🙂
- It’s scalable — each file is handled separately, so the system can easily grow with more categories or accounts without slowing down.
- It’s fault tolerant — if a single export fails, it won’t stop the others; failures are isolated and can be retried.
- It has no single point of failure — the whole process runs on distributed, managed services, so there’s no risk of a script crashing and killing the job.
- It has proper monitoring — with logs, metrics, and alarms, you can track what succeeded, what failed, and how long everything took.
- It’s built on AWS-native tools — everything runs on managed services like Lambda, EventBridge, and S3, so there’s no infrastructure to manage or maintain.
Thanks for reading! If you want to stay updated with future posts, hit that follow button. And don’t hesitate to share your ideas or questions in the comments — I’m eager to hear from you!
Top comments (0)