Scenario: A bunch of records (like 10k, maybe more) of small size (average of 50 Bytes each) must be processed. The processing must be done in parallel or any other way to improve performance (remember, we have a lot of records to go through). Also, the processing itself it is a very simple task (that's one of the why's for using AWS Lambda). Although it's simplicity, some processing may end before/after others, so that's another reason why those records are independent of each other and the order of processing does not matter.
So far, Step Functions looks like the way to go.
With Step Functions, we can have the following graph:
I can define the RecordsRetrieval as one task. After that, those records will be processed in parallel by the tasks ProcessRecords-Task-1, ProcessRecords-Task-2 and ProcessRecords-Task-3. By the looks of it, all fine and dandy, right? wrong!
First Problem: Dynamic Scaling If i want to have dynamic scaling of those tasks (let's say... 10, 100, 5k or 10k), taking in consideration the amount of records to be processed, i would have to dynamic build the json to achieve that (not a very elegant solution, but it might work). I am very confident that the number of tasks have a limit, so i cannot rely on that. It would be way better if the scaling heavy-lifting is handled by the infra structure and not by me.
Either way, for a well defined set of parallel tasks like: GetAddress, GetPhoneNumber, GetWhatever... is great! Works like a charm!
Second Problem: Payload Dispatch After the RecordsRetrieval task, i need that each one of those records to be processed individually. With Step Functions i did not see any way of accomplishing that. Once the RecordsRetrieval task pass along it's payload (in this case those records), all the parallel tasks will be handling the same payload.
Again, just like i said in the first problem, for a well defined set of parallel tasks it will be a perfect fit.
Conclusion I think that, probably, AWS Step Functions is not the solution for my scenario. This is a summary of my knowledge about it, so feel free to comment if i did miss something.
I am digging with the microservice approach for many reasons (scalability, serverless, simplicity and so forth).
I know that it is possible to retrieve those records and send one by one to another lambda, but again, not a very elegant solution.
I also know that this is a batch job and AWS has the Batch service. What i am trying to do is to keep the microservice approach without depending on AWS Batch/EC2.
What are your thoughts about it? Feel free to comment. Any suggestions will be appreciated.
