Brayan Arrieta

Posted on Aug 20

Build a Serverless Workflow: AWS Lambda + Textract (Sync & Async with SNS + SQS) — Step‑by‑Step Guide

#aws #devops #programming #lambda

Extracting text and data from documents at scale is a common requirement in modern applications, from invoice processing to contract analysis. Amazon Textract, combined with AWS Lambda, provides a serverless approach to building automated pipelines for text extraction.

In this blog, we’ll walk through both synchronous and asynchronous integration approaches between Lambda and Textract, including how to use SNS and SQS for async jobs. By the end, you’ll have a clear roadmap to build a robust, serverless text-extraction pipeline.

Prerequisites

Before diving into the integration, make sure you have the following in place:

AWS Account – with access to create and configure S3, Lambda, IAM, Textract, SNS, and SQS.

Architecture

Synchronous Architecture (S3 → Lambda → Textract → S3)

Secuence Diagram

Asynchronous Architecture (S3 → Lambda → Textract → SNS → SQS → Lambda → S3)

Secuence Diagram

Synchronous

Step 1: Create an S3 bucket

When setting up Amazon S3 for this workflow, we have two main options:

Two Buckets (Recommended for Separation of Concerns)

Bucket 1: Used for uploading the input documents that need to be processed.

Bucket 2: Used for storing the processed output documents.

✅ Advantage: Provides a clean separation between raw and processed data, making it easier to manage permissions, lifecycle policies, and logging.

Single Bucket with Prefixes

Use one bucket, but organize files with two prefixes (folders):

input/ for documents to be processed
output/ for processed results
✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.

Step 2: Create an IAM customer-managed policy

Go to the IAM console and create a policy, choose the JSON tab

Paste the next JSON

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TextractPermissions",
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText"
      ],
      "Resource": "*"
    },
    {
      "Sid": "S3Permissions",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": "arn:aws:s3:::YOUR_S3_BUCKET/*"
    },
    {
      "Sid": "CloudWatchLogGroupAccess",
      "Effect": "Allow",
      "Action": "logs:CreateLogGroup",
      "Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
    },
    {
      "Sid": "CloudWatchLogStreamAccess",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/SyncLambdaTextractRole:*"
      ]
    }
  ]
}

You can name it as you prefer. For this example, it will be named as SyncLambdaTextractPolicy

Step 3: Create an IAM role for the AWS Lambda

Go to the IAM console and create a role, choose Lambda as the service or use case

Attach the SyncLambdaTextractPolicy to the permissions policies

You can name it as you prefer. For this example, it will be named as SyncLambdaTextractRole

Step 4: Create the AWS Lambda Function

Create a lambda function with the next setup settings

Runtine: Python

Execution Role: SyncLambdaTextractRole or the role name that you created in step 3

Paste the next Python code in the Code tab

import boto3
import os

def lambda_handler(event, context):
    textract = boto3.client('textract')
    s3_bucket = event['Records'][0]['s3']['bucket']['name']
    s3_key = event['Records'][0]['s3']['object']['key']

    response = textract.detect_document_text(
        Document={
            'S3Object': {
                'Bucket': s3_bucket,
                'Name': s3_key
            }
        }
    )

    text_lines = []
    for item in response['Blocks']:
        if item['BlockType'] == 'LINE':
            text_lines.append(item['Text'])

    print("Extracted Text:", "\n".join(text_lines))
    return {"status": "success", "lines": text_lines}

Test event for testing

{
  "Records": [
    {
      "s3": {
        "bucket": { "name": "your-bucket-name" },
        "object": { "key": "your-file.pdf" }
      }
    }
  ]
}

Asynchronous

Step 1: Create an S3 bucket

Use one bucket, but organize files with two prefixes (folders):

input/ for documents to be processed
output/ for processed results
✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.

Step 2: Lambda Start Textract Job

Step 2.1: Create an IAM customer-managed policy

Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobPolicy

Note: This step can be omitted to create a customer inline policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "TextractPermissions",
            "Effect": "Allow",
            "Action": [
                "textract:StartDocumentTextDetection"
            ],
            "Resource": "*"
        },
        {
            "Sid": "S3ReadAccess",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::YOUR_S3_BUCKET",
                "arn:aws:s3:::YOUR_S3_BUCKET/*"
            ]
        },
        {
            "Sid": "CloudWatchLogGroupAccess",
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
        },
        {
            "Sid": "CloudWatchLogStreamAccess",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncStartTextractJob:*"
            ]
        }
    ]
}

Step 2.2: Create an IAM role for the AWS Lambda

Go to the IAM console and create a role, choose Lambda as the service or use case

Attach the AsyncStartTextractJobPolicy to the permissions policies

You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole

Step 2.3: Create an AWS SNS Topic

To handle notifications from AWS Textract, you need to create an SNS topic. You have two main options:

Option 1: Create the Topic via the AWS Console

Navigate to the SNS Console. Click Create topic, give it a name (e.g., textract-job-completion), and configure any additional settings such as access policies or delivery protocols.

✅ Advantage: Intuitive interface with easy configuration and management of subscriptions.

In the information section of the topic, you will be able to see the TopicARN

Option 2: Create the Topic via AWS CLI

Use the AWS CloudShell or your local terminal to execute the following command:

aws sns create-topic --name textract-job-completion

Note, this will give you an output with the TopicArn

{
  "TopicArn": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
}

✅ Advantage: Quick and scriptable, ideal for automation or Infrastructure as Code.

Step 2.4: Create AWS Textract Assume Role

Create an IAM Role for Textract

Go to the IAM Console → Roles → Create role.
Under Trusted entity type, select AWS service.
Choose Textract as the service that will use this role.

Attach Permissions

Add a policy that allows sns:Publish access to your specific SNS topic.
Example policy (replace the ARN with your own SNS topic ARN):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "SNS:Publish",
            "Resource": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
        }
    ]
}

You can name it as you prefer. For this example, it will be named as TextractSNSPublish We will need the ARN of this role. Once created will look something like arn:aws:iam::************:role/TextractSNSPublish

Step 2.5: Create the AWS Lambda Function

Create a lambda function with the following settings

Name: AsyncStartTextractJob (This can be renamed, but remember to update the log groups’ permissions)

Runtine: Python

Execution Role: AsyncStartTextractJobRole or the role name that you created in step 2.2

Environment Variables:

SNS_TOPIC_ARN: arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion from step 2.3
TEXTRACT_ROLE_ARN: arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/TextractSNSPublish from step 2.4

Paste the next Python code in the Code tab

import os
import boto3
import urllib.parse  # <-- for decoding S3 keys

textract = boto3.client("textract")

SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
TEXTRACT_ROLE_ARN = os.environ['TEXTRACT_ROLE_ARN']

def start_textract_job(bucket, key, sns_topic_arn, textract_role_arn):
    response = textract.start_document_text_detection(
        DocumentLocation={
            'S3Object': {
                'Bucket': bucket,
                'Name': key
            }},
        NotificationChannel={
            'SNSTopicArn': sns_topic_arn,
            'RoleArn': textract_role_arn
        })
    return response["JobId"]

def lambda_handler(event, context):
    # S3 event records
    for record in event["Records"]:
        bucket = record["s3"]["bucket"]["name"]
        key = record["s3"]["object"]["key"]

        # Decode URL-encoded key
        key = urllib.parse.unquote_plus(key)

        jobId = start_textract_job(bucket, key, SNS_TOPIC_ARN, TEXTRACT_ROLE_ARN)
        print(f"Started Textract job {jobId} for file {key} in bucket {bucket}")

    return {
        "statusCode": 200,
        "body": f"Textract Job(s) started successfully."
    }

Step 2.6 Create an AWS SQS & Subscribe the SNS to the Queue

When Amazon Textract completes a job, it sends a notification to the SNS topic you configured. To process these notifications reliably, you can subscribe an SQS queue to the SNS topic. This ensures that your application can consume the messages asynchronously and at its own pace.

You have two main options for creating the SQS queue:

Option 1: Create the Queue & subscribe via the AWS Console

Navigate to the Amazon SQS Console.
Choose Create queue and provide a name (e.g., textract-results-queue).
After creation, go to your SNS Topic in the AWS Console and add a subscription to this SQS queue.
Copy the Topic ARN from the SNS topic information section when prompted.

✅ Advantage: Simple, user-friendly interface — ideal for quick setup and testing.

Option 2: Create the Queue & subscribe via AWS CLI

You can also create the queue programmatically:

aws sqs create-queue --queue-name textract-results-queue

Output

{
    "QueueUrl": "https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue"
}

To get the queue ARN, run the next command

aws sqs get-queue-attributes \
  --queue-url https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue \
  --attribute-names QueueArn

Output

{
    "Attributes": {
        "QueueArn": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
    }
}

After creating the queue, subscribe it to your SNS topic using the following command (replace with your Topic ARN and Queue ARN):

aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue

✅ Advantage: Easy to script, automate, and reproduce across environments.

Notes:

Once the SQS queue is subscribed to the SNS topic, every Textract job completion notification will be delivered to the queue, where your application can consume and process it.

Step 2.7 Add an AWS Lambda Function Trigger event when S3 Upload

Option 1: Create the Trigger Event via AWS CLI

Use the AWS CloudShell or your local terminal to execute the following command:

aws s3api put-bucket-notification-configuration \
  --bucket YOUR_S3_BUCKET_NAME \
  --notification-configuration '{
    "LambdaFunctionConfigurations": [
      {
        "LambdaFunctionArn": "arn:aws:lambda:us-east-2:YOUR_AWS_ACCOUNT_ID:function:AsyncStartTextractJob",
        "Events": ["s3:ObjectCreated:Put"]
      }
    ]
  }'

Option 2: Using the AWS Console (UI)

Open the S3 Console → https://s3.console.aws.amazon.com/s3
Click your bucket name (your-bucket-name).
Go to the Properties tab.
Scroll down to Event notifications → click Create event notification.
Fill in the details:
1. Name: PutTriggerLambda
2. Event types: select PUT (under ObjectCreated).
3. Destination: choose Lambda function → select AsyncStartTextractJob.
Click Save changes.

Step 3: Lambda Process Textract Job Results

Step 3.1: Create an IAM customer-managed policy

Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncProcessTextractJobResultsPolicy

Note: This step can be omitted to create a customer inline policy

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncProcessTextractJobResults:*"
            ]
        },
        {
            "Sid": "AllowTextractRead",
            "Effect": "Allow",
            "Action": [
                "textract:GetDocumentTextDetection"
            ],
            "Resource": "*"
        },
        {
            "Sid": "AllowS3Write",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": "arn:aws:s3:::YOUR_S3_BUCKET/output/*"
        },
        {
            "Sid": "AllowSQSTrigger",
            "Effect": "Allow",
            "Action": [
                "sqs:ReceiveMessage",
                "sqs:DeleteMessage",
                "sqs:GetQueueAttributes"
            ],
            "Resource": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
        }
    ]
}

This IAM policy gives a Lambda function the ability to:

Write logs to CloudWatch (for monitoring).
Read the Textract job results.
Save processed data into S3.
Consume and delete messages from an SQS queue that delivers Textract job notifications.

Step 3.2: Create an IAM role for the AWS Lambda

Go to the IAM console and create a role, choose Lambda as the service or use case

Attach the AsyncProcessTextractJobResultsPolicy to the permissions policies and saved. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole

Step 3.3: Create the AWS Lambda Function

Create a lambda function with the following settings

Name: AsyncProcessTextractJobResults (This can be renamed, but remember to update the log groups’ permissions)

Runtine: Python

Execution Role: AsyncProcessTextractJobResultsRole or the role name that you created in step 3.2

Environment Variables:

SQS_QUEUE_URL: https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue To get this, you can go to step 2.6

Timeout: 30s (This can be increased, but make sure that the SQS queue visibility timeout matches that)

Paste the next Python code in the Code tab

import json
import boto3
import os

s3_client = boto3.client("s3")
textract_client = boto3.client("textract")
sqs_client = boto3.client("sqs")

SQS_QUEUE_URL = os.environ['SQS_QUEUE_URL']  # add this env var for your queue

def get_textract_job_info(record):
    # Get the Textract job info from the SNS/SQS event
    message = json.loads(record["body"])
    textract_message = json.loads(message["Message"])

    job_id = textract_message["JobId"]
    status = textract_message["Status"]
    object_name = textract_message["DocumentLocation"]["S3ObjectName"]
    bucket_name = textract_message["DocumentLocation"]["S3Bucket"]

    return {
        "job_id": job_id,
        "status": status,
        "bucket_name": bucket_name,
        "object_name": object_name
    }


def get_all_textract_blocks(job_id):
    # # Get raw Textract output
        response = textract_client.get_document_text_detection(JobId=job_id)

        # Collect all text blocks
        raw_text = []
        while True:
            for block in response["Blocks"]:
                if block["BlockType"] == "LINE":
                    raw_text.append(block["Text"])

            if "NextToken" in response:
                response = textract_client.get_document_text_detection(
                    JobId=job_id, NextToken=response["NextToken"]
                )
            else:
                break

        return "\n".join(raw_text)

def get_base_filename(object_name):
    return os.path.splitext(os.path.basename(object_name))[0]

def delete_sqs_message(receiptHandle):
    # Delete message from SQS once done
    sqs_client.delete_message(
        QueueUrl=SQS_QUEUE_URL,
        ReceiptHandle=receiptHandle
    )


def lambda_handler(event, context):
    print("Event:", json.dumps(event))

    for record in event["Records"]:

        job_info = get_textract_job_info(record)

        job_id = job_info["job_id"]
        status = job_info["status"]
        bucket_name = job_info["bucket_name"]
        object_name = job_info["object_name"]


        if status != "SUCCEEDED":
            print(f"Textract job {job_id} failed with status: {status}")
            # delete the message anyway, so it doesn’t retry forever
            delete_sqs_message(record["receiptHandle"])
            continue

        text = get_all_textract_blocks(job_id)

        # Save result to S3 with the same filename but .txt extension
        base_filename = get_base_filename(object_name)
        output_key = f"output/{base_filename}.txt"

        s3_client.put_object(
            Bucket=bucket_name,
            Key=output_key,
            Body=text.encode("utf-8"),
            ContentType="text/plain"
        )

        print(f"✅ Processed Textract result saved at s3://{bucket_name}/{output_key}")

        delete_sqs_message(record["receiptHandle"])

        print(f"🗑️ Deleted message from SQS: {record['messageId']}")

    return {"statusCode": 200, "body": "Processing complete"}

Step 3.4 Add an AWS Lambda Function Trigger event for the SQS Queue

Option 1. Create the event source mapping (SQS → Lambda)

aws lambda create-event-source-mapping \
  --function-name AsyncProcessTextractJobResults \
  --batch-size 10 \
  --event-source-arn arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue

Explanation:

batch-size = max number of SQS messages passed per Lambda invocation (default 10, max 10,000).
event-source-arn = ARN of your SQS queue.

Option 2: Using the AWS Console (UI)

Open the Lambda Console → https://console.aws.amazon.com/lambda
Select your Lambda function.
Go to the Configuration tab.
Under Triggers → click Add trigger.
Choose SQS from the list.
Select your SQS queue.
(Optional) Set Batch size (default 10).
Click Add.

When to use Sync vs Async

Synchronous (Detect/Analyze — direct reply)

Best for: small, single‑page images (JPG/PNG/TIFF) or very small PDFs where the user is waiting for a response.
Pros: simple, fast to wire up, easy to expose via API Gateway.
Trade‑offs: request/response timeouts, Lambda runtime limits, not great for multi‑page or large files.

Asynchronous (Start/Get with SNS+SQS)

Best for: multi‑page PDFs/TIFFs, batches, or workloads where you don’t need an immediate response.
Pros: resilient, scalable, cost‑effective; decoupled with retries and DLQs.
Trade‑offs: more moving parts (SNS, SQS, extra Lambda), eventual consistency.

Best Practices

Use SQS between SNS and Lambda for reliability.
Break large documents into smaller batches when possible.
Encrypt sensitive documents in S3.
Monitor Textract usage with CloudWatch Metrics.
Use DLQs (Dead Letter Queues) for failed message processing.

Conclusion

Integrating AWS Lambda with Amazon Textract enables the creation of powerful document-processing pipelines with minimal infrastructure management. For small documents, the synchronous flow is simple and effective. For larger workloads, the asynchronous flow, combined with SNS and SQS, ensures reliability and scalability.

In this example, we saved the extracted text as a file in Amazon S3, but the same output could just as easily be stored in a database, or even used directly as input to an AWS Bedrock model for downstream tasks such as summarization, classification, or question answering.

By following the steps outlined in this guide, you now have the foundation to build automated document extraction systems tailored to your business needs.

DEV Community

Build a Serverless Workflow: AWS Lambda + Textract (Sync & Async with SNS + SQS) — Step‑by‑Step Guide

Prerequisites

Architecture

Synchronous Architecture (S3 → Lambda → Textract → S3)

Asynchronous Architecture (S3 → Lambda → Textract → SNS → SQS → Lambda → S3)

Synchronous

Step 1: Create an S3 bucket

Two Buckets (Recommended for Separation of Concerns)

Single Bucket with Prefixes

Step 2: Create an IAM customer-managed policy

Step 3: Create an IAM role for the AWS Lambda

Step 4: Create the AWS Lambda Function

Asynchronous

Step 1: Create an S3 bucket

Step 2: Lambda Start Textract Job

Step 2.1: Create an IAM customer-managed policy

Step 2.2: Create an IAM role for the AWS Lambda

Step 2.3: Create an AWS SNS Topic

Option 1: Create the Topic via the AWS Console

Option 2: Create the Topic via AWS CLI

Step 2.4: Create AWS Textract Assume Role

Step 2.5: Create the AWS Lambda Function

Step 2.6 Create an AWS SQS & Subscribe the SNS to the Queue

Option 1: Create the Queue & subscribe via the AWS Console

Option 2: Create the Queue & subscribe via AWS CLI

Step 2.7 Add an AWS Lambda Function Trigger event when S3 Upload

Option 1: Create the Trigger Event via AWS CLI

Option 2: Using the AWS Console (UI)

Step 3: Lambda Process Textract Job Results

Step 3.1: Create an IAM customer-managed policy

Step 3.2: Create an IAM role for the AWS Lambda

Step 3.3: Create the AWS Lambda Function

Step 3.4 Add an AWS Lambda Function Trigger event for the SQS Queue

Option 1. Create the event source mapping (SQS → Lambda)

Option 2: Using the AWS Console (UI)

When to use Sync vs Async

Best Practices

Conclusion

Top comments (0)