feat(control-plane): add support for handling multiple events in a single invocation #4603

iainlane · 2025-05-29T12:13:55Z

⚠️ This hasn't been run in reality yet, but will be soon.⚠️

Currently we restrict the scale-up Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive.

In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns.

At its core, this Lambda's task is to construct an EC2 CreateFleet call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call.

The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to ReportBatchItemFailures. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process.

Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation.

We also remove the same batch_size = 1 constraint from the job-retry Lambda. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events.

Tests are added for all of this.

Test in a private repo (sorry) look good. This was running ephemeral runners with no pool, SSM high throughput enabled, the job queued check _dis_abled, batch size of 200, wait time of 10 seconds. The workflow runs are each a matrix with 250 jobs.

modules/runners/scale-up.tf

iainlane · 2025-06-13T18:52:05Z

lambdas/libs/aws-powertools-util/src/logger/index.ts

    'aws-request-id': context.awsRequestId,
    'function-name': context.functionName,
    module: module,
  });

  // Add the context to all child loggers
  childLoggers.forEach((childLogger) => {
-    childLogger.addPersistentLogAttributes({


I was getting warnings about mixing this with persistentKeys. The function says

/** * @deprecated This method is deprecated and will be removed in the future major versions, please use {@link appendPersistentKeys() `appendPersistentKeys()`} instead. */ addPersistentLogAttributes(attributes: LogKeys): void;

so I've switched it!

Copilot

Pull Request Overview

This PR adds support for processing multiple events in a single Lambda invocation, improving efficiency in busy environments and reducing redundant API client instantiation. Key changes include:

Introducing new Terraform variables for configuring Lambda event source mapping batch size and batching window.
Updating the scale-up and job-retry Lambdas to handle batch events and partial failures.
Enhancing logging, documentation, and tests to support the batched processing feature.

Reviewed Changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
modules/runners/variables.tf	Adds new variables for Lambda event source mapping batch size and maximum batching window.
modules/runners/scale-up.tf	Updates event source mapping configuration to use the new batch parameters and response types.
modules/runners/job-retry/*	Removes batch_size constraint and configures new batch variables for job retry.
modules/multi-runner/*	Propagates new Lambda batch configuration to multi-runner modules.
lambdas/libs/aws-powertools-util/src/logger/index.ts	Adjusts logging methods from addPersistentLogAttributes to appendPersistentKeys.
lambdas/functions/control-plane/src/scale-runners/scale-up.ts	Refactors the scale-up handler to process an array of events rather than a single event.
lambdas/functions/control-plane/src/lambda.ts	Updates the Lambda handler to process batched SQS messages and return batch item failures.
Tests and README updates	Updates tests and documentation to reflect the batch processing changes.

Comments suppressed due to low confidence (1)

lambdas/functions/control-plane/src/scale-runners/scale-up.ts:342

[nitpick] The variable name 'scaleUp' is used to count the number of events triggering a runner creation, which can be ambiguous. Consider renaming it to 'requestedRunnerCount' or a similar descriptive name to clarify its purpose.

let scaleUp = 0;

…ngle invocation Currently we restrict the `scale-up` Lambda to only handle a single event at a time. In very busy environments this can prove to be a bottleneck: there are calls to GitHub and AWS APIs that happen each time, and they can end up taking long enough that we can't process job queued events faster than they arrive. In our environment we are also using a pool, and typically we have responded to the alerts generated by this (SQS queue length growing) by expanding the size of the pool. This helps because we will more frequently find that we don't need to scale up, which allows the lambdas to exit a bit earlier, so we can get through the queue faster. But it makes the environment much less responsive to changes in usage patterns. At its core, this Lambda's task is to construct an EC2 `CreateFleet` call to create instances, after working out how many are needed. This is a job that can be batched. We can take any number of events, calculate the diff between our current state and the number of jobs we have, capping at the maximum, and then issue a single call. The thing to be careful about is how to handle partial failures, if EC2 creates some of the instances we wanted but not all of them. Lambda has a configurable function response type which can be set to `ReportBatchItemFailures`. In this mode, we return a list of failed messages from our handler and those are retried. We can make use of this to give back as many events as we failed to process. Now we're potentially processing multiple events in a single Lambda, one thing we should optimise for is not recreating GitHub API clients. We need one client for the app itself, which we use to find out installation IDs, and then one client for each installation which is relevant to the batch of events we are processing. This is done by creating a new client the first time we see an event for a given installation. We also remove the same `batch_size = 1` constraint from the `job-retry` Lambda and make it configurable instead, using AWS's default of 10 for SQS if not configured. This Lambda is used to retry events that previously failed. However, instead of reporting failures to be retried, here we maintain the pre-existing fault-tolerant behaviour where errors are logged but explicitly do not cause message retries, avoiding infinite loops from persistent GitHub API issues or malformed events. Tests are added for all of this.

npalm · 2025-06-22T11:31:15Z

lambdas/functions/control-plane/src/lambda.ts

-    } else {
-      logger.warn(`Ignoring error: ${e}`);
-      return Promise.resolve();
+      throw e;


Not started a full review yet. But this part is a bit tricky.

In th current logic the runner.ts is making a difference in type type of errors. The one that can be recovered by a retry (scale errors) and the rest. It also assumes the creation of a single runner. Which means in case of an error that can auto recover the message goes imply back to the queue. Example is limit (max runners), but also not spot instance available.

With the changes introduce here in case of an error the whole batch message goes back to the queue (error in runner.ts). And a retry fro all messages are initated.

To address this the runner.ts needs to be made smarter and be able to return the the number of instances failed to created and retry could make sense. That number (or messages ids) can be added to the ScaleError object.

Keeping the excpetion logic in this way also will caus probelms when the createRunenr only create a few runner since the max is exceeded. If this combines with a ScaleError. The runners not created bec ause a max will not be reprorted back. Due to the Exception.

iainlane force-pushed the iainlane/many-events branch 3 times, most recently from a7720aa to 9056deb Compare May 29, 2025 12:45

iainlane commented May 29, 2025

View reviewed changes

modules/runners/scale-up.tf Show resolved Hide resolved

npalm self-requested a review May 30, 2025 07:48

iainlane force-pushed the iainlane/many-events branch 8 times, most recently from d7d1b7c to 54e4a8a Compare June 12, 2025 14:55

iainlane force-pushed the iainlane/many-events branch from 54e4a8a to 79eb6a5 Compare June 13, 2025 18:50

iainlane commented Jun 13, 2025

View reviewed changes

iainlane marked this pull request as ready for review June 13, 2025 18:58

iainlane requested review from a team as code owners June 13, 2025 18:58

npalm requested a review from Copilot June 13, 2025 21:40

Copilot AI reviewed Jun 13, 2025

View reviewed changes

iainlane force-pushed the iainlane/many-events branch from 69ff9b8 to 9443df8 Compare June 16, 2025 10:17

npalm reviewed Jun 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(control-plane): add support for handling multiple events in a single invocation #4603

feat(control-plane): add support for handling multiple events in a single invocation #4603

Uh oh!

iainlane commented May 29, 2025 •

edited

Loading

Uh oh!

iainlane Jun 13, 2025

Copilot AI left a comment

npalm Jun 22, 2025

feat(control-plane): add support for handling multiple events in a single invocation #4603

Are you sure you want to change the base?

feat(control-plane): add support for handling multiple events in a single invocation #4603

Uh oh!

Conversation

iainlane commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iainlane Jun 13, 2025

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

npalm Jun 22, 2025

Choose a reason for hiding this comment

iainlane commented May 29, 2025 •

edited

Loading