Jakub Stanisławczyk

Posted on Jun 2

Become the Serverless DJ. How to process audio using AWS?

#aws #node #programming #dataengineering

Intro

As software developers, we work with different types of files. These are often formats like JSON, XML, or CSV. Data engineers, on the other hand, use more specialized tools such as Parquet. Beyond text files, we also process images - resizing them, adjusting colors, or altering shapes.

However, there is one type of medium that seems to be somewhat overlooked, despite surrounding us everywhere. After all, who doesn't enjoy listening to music to relax or using it as background sound while working?

But how do we work with audio files? How can we process them in the AWS cloud? What aspects should we consider to ensure our architecture is both scalable and cost-effective? I'll answer these and many other questions in this article.

Full code can be found on my Github. I used Terraform to describe all AWS resources. This will help you set up this project without having to configure it manually. I decided not to use any architectures or design patterns because I wanted to keep things as simple as possible. Feel free to extend it using layers, interfaces, hexagonal architecture etc.

Technologies used

In the world of modern cloud solutions, we are constantly looking for ways to increase efficiency, reduce costs and eliminate unnecessary infrastructure. That is why choosing serverless technology seemed like a natural step. Thanks to the model in which I do not have to worry about managing servers, I can focus on what is most important - processing audio in a fast, scalable and cost-optimized way. The following services will help me with this:

Amazon S3 - The storage, where we will save our files. It scales very well and has high availability (99.999999999%). Mechanisms such as Presigned URL or S3 Events will be an important part of our architecture.

AWS Lambda - This will be our main working tool that allows us to run our Node.js code. The main advantage is that we only pay for the time it takes to run, making it ideal for reacting to events. Cons? Maximum 15 minutes of runtime and 10GB of RAM.

AWS API Gateway - This service allow us to expose our Lambdas as REST API endpoints.

DynamoDB - A NoSQL database that provides very good performance and scalability. To be able to talk about its capabilities would require a separate (and not so short) article. TLDR: DDB is for you if you need an efficient and scalable database, and at the same time you know exactly what the query patterns will be.

Amazon SQS - A simple AWS queue that allows us to send events in two modes:
- Normal - where we have virtually unlimited scaling but duplicate messages are possible.
- FIFO - where duplicates are automatically removed but with a limit of 300 operations per second.

FFmpeg - This tool allows us to customize and modify the audio file to suit our needs. It is a CLI tool designed for multimedia processing and consists of two subtools:
- FFmpeg – enables conversion to different formats, trimming and merging files, changing the sampling rate… and that’s just the tip of the iceberg.
- FFprobe – allows us to analyze the file, including checking size, format, and other attributes..
Well, a CLI tool. Won't that be a problem in the case of Lambda? After all, it is a serverless solution, which is very high-level. Fortunately, there is a way to solve this problem, but more on that later.

Architecture

When developing our solution, we need to cover three fundamental aspects:

Upload – How can we efficiently deliver new files?
Audio Processing – Similar to text files, audio files come in a wide range of formats. Additionally, each file can have different sampling rates and channel configurations. Standardizing them will simplify further processing.
Metadata – It’s essential to ensure that files can be easily searched and sorted later.

Here's how the final process looks like.

Upload

Uploading a file seems like the least of your problems. After all, it's just sending a file to our backend and throwing it into an S3 bucket? Right?

Well, not really. Of course, it can be done this way, but it will be inefficient. After all, in such an architecture, our Lambda will work a bit like a shovel that has to move a large amount of data and that is its only task. What if our client could upload a file directly to the Bucket?

Fortunately, AWS provides us with a Presigned URL mechanism that allows for direct upload to the S3 Bucket. It's very simple:

In the first step, we ask the bucket to generate an URL that allows us to upload file directly.
After receiving the response from the URL, we do a redirect to it. We place our file in the body of the new request.

Another factor to consider is the file size - larger files may take longer to upload and are more susceptible to network errors. The solution is to use the Multipart Upload. It allows parallelization of requests and increases resilience, for example by enabling re-sending. For what files should it be used?

>100MB - you should consider using this mechanism.
>5GB - AWS requires the use of Multipart Upload for files larger than 5GB.

In this example I will stick to the PutObject for simplicity purposes.

Audio Processing

With the raw audio file already saved, the next step is to pass it to Lambda to initiate processing. The easiest way is to use the S3 Events mechanism. It allows us to listen for the changes in directories and files. We specify the event type, prefix, suffix, and the destination service for the notification. From now on, each time a file is added that fits the rules given above, an event is triggered that will start our Lambda. Of course, it doesn't contain the file itself - only the metadata needed to download it from our S3 bucket.

Alternatively, you can use EventBridge, which is a more general solution that supports a much wider range of services and events.

As you can see, this is a very simple architecture and unfortunately not fully functional. There is one detail that may be problematic and it is "At least one delivery" of S3 events. This means that duplicates may occur, which will cause us to process the same file twice. The simplest way is to set up an additional SQS FIFO queue, which will automatically reject duplicates and save us some computing resources. Note that this will only work if we fit into the SQS FIFO limit (300 messages per second). To achieve higher throughput, we can use a solution like DynamoDB to track whether a given event has already been processed.

Okay, but how should our actual audio processing look like? How to use the FFmpeg in Node.js? There is two ways:

You can directly call CLI commands using Node child_process

import { exec } from 'child_process';
import { promisify } from 'util';

...

const execAsync = promisify(exec);

try {
  const { stderr } = await execAsync(`ffmpeg -i ${audioFilePath} -b:a ${bitrate} ${transformedAudioPath}.${format}`);

  if (stderr) {
    console.warn('stderr:', stderr);
  }
} catch (err) {
  console.error('Error:', err.message);

  if (err.stderr) {
    console.error('stderr:', err.stderr);
  }
}

You can also use fluent-fmmpeg. It's the NPM package that wraps ugly CLI commands in a beautiful chain of functions. I know it's deprecated but it can still be useful for most operations. Here's how we use it:

await new Promise((resolve, reject) => {
  ffmpeg(audioFilePath)
    .toFormat(format) // Change format
    .audioBitrate(bitrate) // Change bitrate
    .save(transformedAudioPath)
    .on('end', () => {
      console.log('File has been transformed successfully');
      return resolve(transformedAudioPath);
    })
    .on('error', (error: Error) => {
      console.log('Failed to transform audio file: ', error.message);
      return reject(error);
    });
});

Simple and easy to use. No matter what are the input parameters, we still get the unified and predictable output. This helps us in further processing. E.g. we no need to worry if given format is supported by the browser or not. We can also degrade the audio quality to save some space on S3.

"But wait. You mentioned that FFmpeg is a CLI tool. Can we just install the library and expect it to work?" Well, unfortunately, it is not that easy. We still need to have FFmpeg installed on our system. But how can we do this? Do we need to put it in a ZIP with the Lambda code? This is where Lambda Layers comes in handy. This mechanism allows us to pack our dependencies into archives, which can then be used in our functions. We can include predefined dependencies, as well as include external tools. In our case, FFmpeg and FFprobe will be packaged in this way. We only need to zip the binaries, create the new layers and attach them to our Lambda function. We also need to remember to set the appropriate FFMPEG_PATH and FFPROBE_PATH values using Lambda environment variables.

From now on we can use FFmpeg CLI commands.

Metadata

The last element is the metadata needed for later filtering and searching of audio files, e.g. for frontend purposes. Here, I will use DynamoDB as a database, which provides very good scalability and on demand payment (only for used resources). During the entire flow, we will update the current state of file processing, which looks as follows:

And this is how it looks like in the code:

Create new metadata record

export const createAudioMetadataRecord = async (audioMetadata: AudioMetadata): Promise<void> => {
  const documentClient = initDocumentClient();
  const putCommand = new PutCommand({
    TableName: process.env.AUDIO_TABLE_NAME,
    Item: audioMetadata,
  });

  await documentClient.send(putCommand);
}

Update record with new status

  const updateCommand = new UpdateCommand({
    TableName: process.env.AUDIO_TABLE_NAME,
    Key: {
      id: audioId,
    },
    UpdateExpression: 'SET #status = :status',
    ExpressionAttributeNames: {
      '#status': 'status',
    },
    ExpressionAttributeValues: {
      ':status': 'UPLOADED' satisfies FileStatus,
    },
  });

Summary

The final architecture looks as follows:

Let's test it!
First, we need to generate new Presigned URL. We get it in the response body of our POST /api/files endpoint.

Then we use it in the PUT method. We can also leverage HTTP 301 code to automatically redirect after successful response. This starts the whole processing flow. We can get updated metadata with GET /api/files endpoint that lists all the uploaded files.

As you can see, even such a simple task should be well planned with attention to details such as scaling, duplicates or low operating cost of our solution. Of course, such architecture is only a base for more complex business cases, so I encourage you to experiment.

DEV Community