Local PDF Parsing with AWS Textract & Python (Part 1)

#python #aws #textract #documentprocessing

✍️ Introduction

Throughout my experience working with clients from domains like healthcare, insurance, and legal, I often found myself curious about how certain backend document workflows functioned, especially in healthcare. While supporting these systems, I’d often get paged for incidents related to PDF pipelines: upload failures, script errors, or extraction gaps. At that stage, like many in support roles, we’re limited to handling outcomes rather than building or understanding the full solution.

Over time, as we gain more experience, build trust, and make people feel confident in our abilities, we gradually get the opportunity to be part of architecture discussions and solution design conversations. But that curiosity about how these pipelines actually work — from PDF upload to raw text extraction — always stayed with me.

So I decided to finally explore this from scratch, hands-on, and document it as a small weekend project. This repository reflects that journey — one that started with a question and ended with deeper insights, hands-on practice, and a working prototype. My hope is that others who share this curiosity will find this just as helpful.

🔍 What This Project Is

This project focuses on extracting structured data from scanned or uploaded PDFs using AWS Textract, starting with a local Python-based flow.

It simulates real-world use cases commonly seen in healthcare, legal, and insurance sectors — where physical documents like visit summaries or forms need to be digitized and stored in structured formats like databases.

The goal?
To break down what typically happens behind the scenes — from raw scanned input to clean, queryable output — using AWS-native services.

📄 Why Document Parsing Matters

In many industries, large volumes of information are still locked inside unstructured files, like PDFs or images.

For example:

A hospital stores patient visit summaries scanned from handwritten or printed forms.
An insurance company receives thousands of claim forms uploaded as PDFs every week.
A legal team scans documents, contracts, and evidence that need to be searchable.

Without parsing, this data remains buried and unusable.

Document parsing — especially automated parsing — allows organizations to:

Extract critical fields (like patient name, ID, diagnosis)
Store them in structured systems (like DynamoDB)
Enable downstream use (dashboards, alerts, summaries, etc.)

This project is a hands-on way to explore how that all comes together.

🧪 Local First: Why I Didn’t Start with Automation

While it’s tempting to jump straight into Lambda functions and triggers, I deliberately started with a local-first mindset.

Why?

It helps build intuition: you understand exactly what Textract returns and how the parsing logic works.
Easier to test and debug before handing things off to automation.
You stay in control, tweaking and improving the logic before putting it behind an event trigger.

This mirrors how real-world teams prototype internally before scaling.

In my case:

Took a sample patient visit summary in PDF format.
Wrote a simple Python script to call AWS Textract.
Parsed the returned lines into structured fields.
The script automatically saved the extracted text as a .txt file inside output-texts/. I opened it to manually check if Textract returned the expected content.

That local foundation made automation smoother and more predictable.

🧱 Prerequisites

To follow along or replicate this project, ensure the following are in place:

An AWS account (root access only used for visual verification)
A dedicated IAM user with the following permission:
- AmazonTextractFullAccess
AWS CLI installed and configured with the IAM user credentials
Python 3.9+ installed
virtualenv installed
VS Code (or any preferred IDE)

You should also:

Create a virtual environment (python -m venv venv) and activate it.
Install boto3 (pip install boto3) and freeze dependencies into requirements.txt

📂 Project Structure

pdf-to-text-extractor/
├── input-pdfs/                # Local test PDFs
├── output-texts/              # Extracted raw text output
├── scripts/                   # Python scripts
│   └── extract_textract.py
├── venv/                      # Virtual environment (ignored in git)
├── requirements.txt
├── .gitignore
└── README.md                  # Project Documentation

Why venv and requirements.txt matter:

Using a venv/ keeps dependencies isolated — it’s a clean, repeatable habit in Python workflows.
The requirements.txt file lists all the packages I used, so anyone can recreate the same environment instantly.

🧪 What the Local Script Does

In this phase, wrote a simple Python script to:

Load a PDF from the input-pdfs/ folder
Send it to Textract for text extraction
Save the output to output-texts/ as a .txt file

This helped validate if Textract could read and extract meaningful content, before jumping into parsing or automation.

✅ Outcome and What’s Next

By the end of this phase, I had a working local prototype that:

Pulled a PDF from input-pdfs/
Extracted raw text using AWS Textract
Saved it to output-texts/
Gave me a chance to test and fine-tune the logic manually

This local-first phase gave me the space to deeply understand what each piece does before scaling up.

🔗 References

📂 Explore the Full Codebase

All the files used in this local setup are available here:

🔗 GitHub Repo → pdf-to-text-extractor/local

🔜 Coming Up in Part 2:

We’ll build on this by:

Triggering Textract via AWS Lambda
Parsing and storing results in DynamoDB
We’ll automate everything after the upload — using AWS services to handle extraction, parsing, and storage — just like a real-world backend system would.

📘 Read Part 2 here: Building a Serverless PDF Ingestion Flow