Extracting text and data from documents at scale is a common requirement in modern applications, from invoice processing to contract analysis. Amazon Textract, combined with AWS Lambda, provides a serverless approach to building automated pipelines for text extraction.
In this blog, we’ll walk through both synchronous and asynchronous integration approaches between Lambda and Textract, including how to use SNS and SQS for async jobs. By the end, you’ll have a clear roadmap to build a robust, serverless text-extraction pipeline.
Prerequisites
Before diving into the integration, make sure you have the following in place:
- AWS Account – with access to create and configure S3, Lambda, IAM, Textract, SNS, and SQS.
Architecture
Synchronous Architecture (S3 → Lambda → Textract → S3)
Secuence Diagram
Asynchronous Architecture (S3 → Lambda → Textract → SNS → SQS → Lambda → S3)
Secuence Diagram
Synchronous
Step 1: Create an S3 bucket
When setting up Amazon S3 for this workflow, we have two main options:
Two Buckets (Recommended for Separation of Concerns)
Bucket 1: Used for uploading the input documents that need to be processed.
Bucket 2: Used for storing the processed output documents.
✅ Advantage: Provides a clean separation between raw and processed data, making it easier to manage permissions, lifecycle policies, and logging.
Single Bucket with Prefixes
Use one bucket, but organize files with two prefixes (folders):
input/
for documents to be processedoutput/
for processed results✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.
Step 2: Create an IAM customer-managed policy
Go to the IAM console and create a policy, choose the JSON tab
Paste the next JSON
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TextractPermissions",
"Effect": "Allow",
"Action": [
"textract:DetectDocumentText"
],
"Resource": "*"
},
{
"Sid": "S3Permissions",
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::YOUR_S3_BUCKET/*"
},
{
"Sid": "CloudWatchLogGroupAccess",
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Sid": "CloudWatchLogStreamAccess",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/SyncLambdaTextractRole:*"
]
}
]
}
You can name it as you prefer. For this example, it will be named as SyncLambdaTextractPolicy
Step 3: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the SyncLambdaTextractPolicy
to the permissions policies
You can name it as you prefer. For this example, it will be named as SyncLambdaTextractRole
Step 4: Create the AWS Lambda Function
Create a lambda function with the next setup settings
Runtine: Python
Execution Role: SyncLambdaTextractRole
or the role name that you created in step 3
Paste the next Python code in the Code tab
import boto3
import os
def lambda_handler(event, context):
textract = boto3.client('textract')
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_key = event['Records'][0]['s3']['object']['key']
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3_bucket,
'Name': s3_key
}
}
)
text_lines = []
for item in response['Blocks']:
if item['BlockType'] == 'LINE':
text_lines.append(item['Text'])
print("Extracted Text:", "\n".join(text_lines))
return {"status": "success", "lines": text_lines}
Test event for testing
{
"Records": [
{
"s3": {
"bucket": { "name": "your-bucket-name" },
"object": { "key": "your-file.pdf" }
}
}
]
}
Asynchronous
Step 1: Create an S3 bucket
Use one bucket, but organize files with two prefixes (folders):
input/
for documents to be processedoutput/
for processed results✅ Advantage: Easier to set up and manage since you only maintain one bucket.
⚠️ Consideration: Requires stricter access control and prefix-based policies to prevent accidental overwrites or permission leaks.
Step 2: Lambda Start Textract Job
Step 2.1: Create an IAM customer-managed policy
Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobPolicy
Note: This step can be omitted to create a customer inline policy
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "TextractPermissions",
"Effect": "Allow",
"Action": [
"textract:StartDocumentTextDetection"
],
"Resource": "*"
},
{
"Sid": "S3ReadAccess",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::YOUR_S3_BUCKET",
"arn:aws:s3:::YOUR_S3_BUCKET/*"
]
},
{
"Sid": "CloudWatchLogGroupAccess",
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Sid": "CloudWatchLogStreamAccess",
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncStartTextractJob:*"
]
}
]
}
Step 2.2: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the AsyncStartTextractJobPolicy
to the permissions policies
You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole
Step 2.3: Create an AWS SNS Topic
To handle notifications from AWS Textract, you need to create an SNS topic. You have two main options:
Option 1: Create the Topic via the AWS Console
Navigate to the SNS Console. Click Create topic, give it a name (e.g., textract-job-completion
), and configure any additional settings such as access policies or delivery protocols.
✅ Advantage: Intuitive interface with easy configuration and management of subscriptions.
In the information section of the topic, you will be able to see the TopicARN
Option 2: Create the Topic via AWS CLI
Use the AWS CloudShell or your local terminal to execute the following command:
aws sns create-topic --name textract-job-completion
Note, this will give you an output with the TopicArn
{
"TopicArn": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
}
✅ Advantage: Quick and scriptable, ideal for automation or Infrastructure as Code.
Step 2.4: Create AWS Textract Assume Role
Create an IAM Role for Textract
Go to the IAM Console → Roles → Create role.
Under Trusted entity type, select AWS service.
Choose Textract as the service that will use this role.
Attach Permissions
Add a policy that allows
sns:Publish
access to your specific SNS topic.Example policy (replace the ARN with your own SNS topic ARN):
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "SNS:Publish",
"Resource": "arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion"
}
]
}
You can name it as you prefer. For this example, it will be named as TextractSNSPublish
We will need the ARN of this role. Once created will look something like arn:aws:iam::************:role/TextractSNSPublish
Step 2.5: Create the AWS Lambda Function
Create a lambda function with the following settings
Name: AsyncStartTextractJob
(This can be renamed, but remember to update the log groups’ permissions)
Runtine: Python
Execution Role: AsyncStartTextractJobRole
or the role name that you created in step 2.2
Environment Variables:
SNS_TOPIC_ARN:
arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion
from step 2.3TEXTRACT_ROLE_ARN:
arn:aws:iam::YOUR_AWS_ACCOUNT_ID:role/TextractSNSPublish
from step 2.4
Paste the next Python code in the Code tab
import os
import boto3
import urllib.parse # <-- for decoding S3 keys
textract = boto3.client("textract")
SNS_TOPIC_ARN = os.environ['SNS_TOPIC_ARN']
TEXTRACT_ROLE_ARN = os.environ['TEXTRACT_ROLE_ARN']
def start_textract_job(bucket, key, sns_topic_arn, textract_role_arn):
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': bucket,
'Name': key
}},
NotificationChannel={
'SNSTopicArn': sns_topic_arn,
'RoleArn': textract_role_arn
})
return response["JobId"]
def lambda_handler(event, context):
# S3 event records
for record in event["Records"]:
bucket = record["s3"]["bucket"]["name"]
key = record["s3"]["object"]["key"]
# Decode URL-encoded key
key = urllib.parse.unquote_plus(key)
jobId = start_textract_job(bucket, key, SNS_TOPIC_ARN, TEXTRACT_ROLE_ARN)
print(f"Started Textract job {jobId} for file {key} in bucket {bucket}")
return {
"statusCode": 200,
"body": f"Textract Job(s) started successfully."
}
Step 2.6 Create an AWS SQS & Subscribe the SNS to the Queue
When Amazon Textract completes a job, it sends a notification to the SNS topic you configured. To process these notifications reliably, you can subscribe an SQS queue to the SNS topic. This ensures that your application can consume the messages asynchronously and at its own pace.
You have two main options for creating the SQS queue:
Option 1: Create the Queue & subscribe via the AWS Console
Navigate to the Amazon SQS Console.
Choose Create queue and provide a name (e.g.,
textract-results-queue
).After creation, go to your SNS Topic in the AWS Console and add a subscription to this SQS queue.
Copy the Topic ARN from the SNS topic information section when prompted.
✅ Advantage: Simple, user-friendly interface — ideal for quick setup and testing.
Option 2: Create the Queue & subscribe via AWS CLI
You can also create the queue programmatically:
aws sqs create-queue --queue-name textract-results-queue
Output
{
"QueueUrl": "https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue"
}
To get the queue ARN, run the next command
aws sqs get-queue-attributes \
--queue-url https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue \
--attribute-names QueueArn
Output
{
"Attributes": {
"QueueArn": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
}
}
After creating the queue, subscribe it to your SNS topic using the following command (replace with your Topic ARN and Queue ARN):
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-job-completion \
--protocol sqs \
--notification-endpoint arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue
✅ Advantage: Easy to script, automate, and reproduce across environments.
Notes:
Once the SQS queue is subscribed to the SNS topic, every Textract job completion notification will be delivered to the queue, where your application can consume and process it.
Step 2.7 Add an AWS Lambda Function Trigger event when S3 Upload
Option 1: Create the Trigger Event via AWS CLI
Use the AWS CloudShell or your local terminal to execute the following command:
aws s3api put-bucket-notification-configuration \
--bucket YOUR_S3_BUCKET_NAME \
--notification-configuration '{
"LambdaFunctionConfigurations": [
{
"LambdaFunctionArn": "arn:aws:lambda:us-east-2:YOUR_AWS_ACCOUNT_ID:function:AsyncStartTextractJob",
"Events": ["s3:ObjectCreated:Put"]
}
]
}'
Option 2: Using the AWS Console (UI)
Open the S3 Console → https://s3.console.aws.amazon.com/s3
Click your bucket name (
your-bucket-name
).Go to the Properties tab.
Scroll down to Event notifications → click Create event notification.
-
Fill in the details:
-
Name:
PutTriggerLambda
-
Event types: select
PUT
(under ObjectCreated). -
Destination: choose Lambda function → select
AsyncStartTextractJob
.
-
Name:
Click Save changes.
Step 3: Lambda Process Textract Job Results
Step 3.1: Create an IAM customer-managed policy
Go to the IAM console and create a policy, paste the next JSON content. You can name it as you prefer. For this example, it will be named as AsyncProcessTextractJobResultsPolicy
Note: This step can be omitted to create a customer inline policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-2:YOUR_AWS_ACCOUNT_ID:log-group:/aws/lambda/AsyncProcessTextractJobResults:*"
]
},
{
"Sid": "AllowTextractRead",
"Effect": "Allow",
"Action": [
"textract:GetDocumentTextDetection"
],
"Resource": "*"
},
{
"Sid": "AllowS3Write",
"Effect": "Allow",
"Action": [
"s3:PutObject"
],
"Resource": "arn:aws:s3:::YOUR_S3_BUCKET/output/*"
},
{
"Sid": "AllowSQSTrigger",
"Effect": "Allow",
"Action": [
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes"
],
"Resource": "arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue"
}
]
}
This IAM policy gives a Lambda function the ability to:
Write logs to CloudWatch (for monitoring).
Read the Textract job results.
Save processed data into S3.
Consume and delete messages from an SQS queue that delivers Textract job notifications.
Step 3.2: Create an IAM role for the AWS Lambda
Go to the IAM console and create a role, choose Lambda
as the service or use case
Attach the AsyncProcessTextractJobResultsPolicy
to the permissions policies and saved. You can name it as you prefer. For this example, it will be named as AsyncStartTextractJobRole
Step 3.3: Create the AWS Lambda Function
Create a lambda function with the following settings
Name: AsyncProcessTextractJobResults
(This can be renamed, but remember to update the log groups’ permissions)
Runtine: Python
Execution Role: AsyncProcessTextractJobResultsRole
or the role name that you created in step 3.2
Environment Variables:
-
SQS_QUEUE_URL:
https://sqs.us-east-2.amazonaws.com/YOUR_AWS_ACCOUNT_ID/textract-results-queue
To get this, you can go to step 2.6
Timeout: 30s (This can be increased, but make sure that the SQS queue visibility timeout matches that)
Paste the next Python code in the Code tab
import json
import boto3
import os
s3_client = boto3.client("s3")
textract_client = boto3.client("textract")
sqs_client = boto3.client("sqs")
SQS_QUEUE_URL = os.environ['SQS_QUEUE_URL'] # add this env var for your queue
def get_textract_job_info(record):
# Get the Textract job info from the SNS/SQS event
message = json.loads(record["body"])
textract_message = json.loads(message["Message"])
job_id = textract_message["JobId"]
status = textract_message["Status"]
object_name = textract_message["DocumentLocation"]["S3ObjectName"]
bucket_name = textract_message["DocumentLocation"]["S3Bucket"]
return {
"job_id": job_id,
"status": status,
"bucket_name": bucket_name,
"object_name": object_name
}
def get_all_textract_blocks(job_id):
# # Get raw Textract output
response = textract_client.get_document_text_detection(JobId=job_id)
# Collect all text blocks
raw_text = []
while True:
for block in response["Blocks"]:
if block["BlockType"] == "LINE":
raw_text.append(block["Text"])
if "NextToken" in response:
response = textract_client.get_document_text_detection(
JobId=job_id, NextToken=response["NextToken"]
)
else:
break
return "\n".join(raw_text)
def get_base_filename(object_name):
return os.path.splitext(os.path.basename(object_name))[0]
def delete_sqs_message(receiptHandle):
# Delete message from SQS once done
sqs_client.delete_message(
QueueUrl=SQS_QUEUE_URL,
ReceiptHandle=receiptHandle
)
def lambda_handler(event, context):
print("Event:", json.dumps(event))
for record in event["Records"]:
job_info = get_textract_job_info(record)
job_id = job_info["job_id"]
status = job_info["status"]
bucket_name = job_info["bucket_name"]
object_name = job_info["object_name"]
if status != "SUCCEEDED":
print(f"Textract job {job_id} failed with status: {status}")
# delete the message anyway, so it doesn’t retry forever
delete_sqs_message(record["receiptHandle"])
continue
text = get_all_textract_blocks(job_id)
# Save result to S3 with the same filename but .txt extension
base_filename = get_base_filename(object_name)
output_key = f"output/{base_filename}.txt"
s3_client.put_object(
Bucket=bucket_name,
Key=output_key,
Body=text.encode("utf-8"),
ContentType="text/plain"
)
print(f"✅ Processed Textract result saved at s3://{bucket_name}/{output_key}")
delete_sqs_message(record["receiptHandle"])
print(f"🗑️ Deleted message from SQS: {record['messageId']}")
return {"statusCode": 200, "body": "Processing complete"}
Step 3.4 Add an AWS Lambda Function Trigger event for the SQS Queue
Option 1. Create the event source mapping (SQS → Lambda)
aws lambda create-event-source-mapping \
--function-name AsyncProcessTextractJobResults \
--batch-size 10 \
--event-source-arn arn:aws:sqs:us-east-2:YOUR_AWS_ACCOUNT_ID:textract-results-queue
Explanation:
batch-size
= max number of SQS messages passed per Lambda invocation (default 10, max 10,000).event-source-arn
= ARN of your SQS queue.
Option 2: Using the AWS Console (UI)
Open the Lambda Console → https://console.aws.amazon.com/lambda
Select your Lambda function.
Go to the Configuration tab.
Under Triggers → click Add trigger.
Choose SQS from the list.
Select your SQS queue.
(Optional) Set Batch size (default 10).
Click Add.
When to use Sync vs Async
Synchronous (Detect/Analyze — direct reply)
Best for: small, single‑page images (JPG/PNG/TIFF) or very small PDFs where the user is waiting for a response.
Pros: simple, fast to wire up, easy to expose via API Gateway.
Trade‑offs: request/response timeouts, Lambda runtime limits, not great for multi‑page or large files.
Asynchronous (Start/Get with SNS+SQS)
Best for: multi‑page PDFs/TIFFs, batches, or workloads where you don’t need an immediate response.
Pros: resilient, scalable, cost‑effective; decoupled with retries and DLQs.
Trade‑offs: more moving parts (SNS, SQS, extra Lambda), eventual consistency.
Best Practices
Use SQS between SNS and Lambda for reliability.
Break large documents into smaller batches when possible.
Encrypt sensitive documents in S3.
Monitor Textract usage with CloudWatch Metrics.
Use DLQs (Dead Letter Queues) for failed message processing.
Conclusion
Integrating AWS Lambda with Amazon Textract enables the creation of powerful document-processing pipelines with minimal infrastructure management. For small documents, the synchronous flow is simple and effective. For larger workloads, the asynchronous flow, combined with SNS and SQS, ensures reliability and scalability.
In this example, we saved the extracted text as a file in Amazon S3, but the same output could just as easily be stored in a database, or even used directly as input to an AWS Bedrock model for downstream tasks such as summarization, classification, or question answering.
By following the steps outlined in this guide, you now have the foundation to build automated document extraction systems tailored to your business needs.
Top comments (0)