AWS Step Functions with batch processing limitations

Question

Scenario: A bunch of records (like 10k, maybe more) of small size (average of 50 Bytes each) must be processed. The processing must be done in parallel or any other way to improve performance (remember, we have a lot of records to go through). Also, the processing itself it is a very simple task (that's one of the why's for using AWS Lambda). Although it's simplicity, some processing may end before/after others, so that's another reason why those records are independent of each other and the order of processing does not matter.

So far, Step Functions looks like the way to go.

With Step Functions, we can have the following graph:

I can define the RecordsRetrieval as one task. After that, those records will be processed in parallel by the tasks ProcessRecords-Task-1, ProcessRecords-Task-2 and ProcessRecords-Task-3. By the looks of it, all fine and dandy, right? wrong!

First Problem: Dynamic Scaling If i want to have dynamic scaling of those tasks (let's say... 10, 100, 5k or 10k), taking in consideration the amount of records to be processed, i would have to dynamic build the json to achieve that (not a very elegant solution, but it might work). I am very confident that the number of tasks have a limit, so i cannot rely on that. It would be way better if the scaling heavy-lifting is handled by the infra structure and not by me.

Either way, for a well defined set of parallel tasks like: GetAddress, GetPhoneNumber, GetWhatever... is great! Works like a charm!

Second Problem: Payload Dispatch After the RecordsRetrieval task, i need that each one of those records to be processed individually. With Step Functions i did not see any way of accomplishing that. Once the RecordsRetrieval task pass along it's payload (in this case those records), all the parallel tasks will be handling the same payload.

Again, just like i said in the first problem, for a well defined set of parallel tasks it will be a perfect fit.

Conclusion I think that, probably, AWS Step Functions is not the solution for my scenario. This is a summary of my knowledge about it, so feel free to comment if i did miss something.

I am digging with the microservice approach for many reasons (scalability, serverless, simplicity and so forth).

I know that it is possible to retrieve those records and send one by one to another lambda, but again, not a very elegant solution.

I also know that this is a batch job and AWS has the Batch service. What i am trying to do is to keep the microservice approach without depending on AWS Batch/EC2.

What are your thoughts about it? Feel free to comment. Any suggestions will be appreciated.

Rishikesh Darandale · Accepted Answer · 2018-02-13 09:43:49Z

Having said with your inputs, according to me following solution can work inline with your criteria. You can use either AWS lambda or AWS batch for below solution.

var BATCH_RECORD_SIZE = 100;
var totalRecords = getTotalCountOfRecords();
var noOfBatchInvocation = getTotalCountOfRecords() % BATCH_RECORD_SIZE == 0 ? getTotalCountOfRecords() / BATCH_RECORD_SIZE : getTotalCountOfRecords() /BATCH_RECORD_SIZE + 1;
var start = 0;
for( 1 to noOfBatchInvocation ) {
    // invoke lambda / submit job
    invokeLambda(start, BATCH_RECORD_SIZE);
    // OR
    submitJobWith(start, BATCH_RECORD_SIZE);
    // increment start
    start += BATCH_RECORD_SIZE;
}

Define lambda which task will be just get number of records as above. This lambda can be triggered on an s3 event or scheduled event or as per your way. Here we can define number of records processed per lambda invocation/batch job. This lambda will invoke/ submit batch job no of times = (total records) / (no of records per job/lambda invocation).
If you prefer lambda, then define lambda such way that it takes two parameters start and limit as an input. These parameter will decide from where to start reading the file to be processed and where to stop. This lambda will also know that from where to read the records.
If you prefer batch, then define the job definition with same logic as above.

You can use AWS lambda as your record processing is not compute/memory intensive. But if it is, then I will suggest to use AWS batch for this processing.

But what if some of these records fail and I need to reprocess them? Then, it becomes complex to manage the returns from Lambda's to create retry logic. Invoke the lambda only is not a assurance of that all records is processed.

user3801111 · Accepted Answer · 2019-10-14 22:17:39Z

AWS Step Function now provides support for spawning dynamic parallel tasks by use of Map: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-map-state.html.

The input is provided in the array and it outputs an array once completed. You need to define the ItemPath (which is the location of the array in InputPath). See ItemPath: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-itemspath.html.
This solve your both the problems.

First Problem: Define your ProcessRecords-Task as Map. Sure, the question comes on the maximum of Lambda functions invoked which can be replaced by ECS container with defined max resources doing the job for you. See:https://docs.aws.amazon.com/step-functions/latest/dg/connect-ecs.html.

Second Problem: ItemPath lets you pass the parameters in an array. See ItemPath: https://docs.aws.amazon.com/step-functions/latest/dg/input-output-itemspath.html

Edit: Example from AWS documentation using Map with Lambdas https://docs.aws.amazon.com/step-functions/latest/dg/tutorial-creating-map-state-machine.html

s.hesse · Accepted Answer · 2018-02-13 07:30:08Z

First Problem: you're basically right. What else you can do is to ask AWS support to increase the parallel Lambda executions of certain functions. See "request a limit increase": https://docs.aws.amazon.com/lambda/latest/dg/limits.html Anyway, make sure that the each function is executed in parallel (i.e. insert a loop on the payload items, so each function gets executed more than once).

Second Problem: If you don't want to hand over the payload to each function, you can kind of filter it for certain functions: https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-input-output-processing.html So you can filter out addresses, etc. only for the specific functions.

rmharrison · Accepted Answer · 2018-10-22 20:17:53Z

The bad news is that dumb parallelization in AWS Step remains an open question, see: https://forums.aws.amazon.com/thread.jspa?threadID=244196&start=0&tstart=0

The good news is that in Nov 2017, AWS introduced support for Array Jobs in AWS Batch, see: https://aws.amazon.com/about-aws/whats-new/2017/11/aws-batch-adds-support-for-large-scale-job-submissions/. Array jobs allows for dumb parallelization of ProcessRecord-Task-?, essentially what @Rishikesh Darandale did with the for loop and submitJobWith(start, BATCH_RECORD_SIZE).

Collectives™ on Stack Overflow

AWS Step Functions with batch processing limitations

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

Comments

Related