Artur Schneider for AWS Community Builders

Posted on Jun 22

Architecting your GenAI data pipeline with AWS native services

#aws #genai #data #dataengineering

When I started building GenAI solutions, I felt confident working with models, prompts, and architecture — but the data part always felt like a black box. I kept asking myself: Where do I even begin if I want to use my own data? Every time I looked into it, I found a pile of scattered advice, incomplete setups, or tools that didn’t quite fit together. It reminded me of moving into a new house and opening the garage — only to find it packed with boxes from ten different people. You know there’s valuable stuff in there, but it’s all mixed up, mislabeled, and overwhelming.

This post is what I wish someone had handed me back then — a clear, hands-on walkthrough of how to turn that chaotic garage into a well-organized workshop. If you’re comfortable with AWS and GenAI but still wondering how to structure, clean, and prepare your own data properly, this is for you.

1. Getting Started

First things first: what do I need before I even begin? Beyond an AWS account and some basic AWS skills, start by taking inventory of your data. Ask yourself: What types of data do I have? (CSVs, JSON logs, PDFs of documents, etc.) Where is it coming from? (On-prem systems, databases, S3 buckets, etc.) This will inform your pipeline design. Honestly, at first I felt paralyzed: “Should I use Glue? Athena? Lambda? Everything?” I was like a newbie chef staring at a pantry full of ingredients. The answer is: you don’t need magic – just a plan.

Start small:

Create or identify an S3 bucket (or a couple) for raw data.
If you have existing sources (RDS, on-prem, SaaS APIs, etc.), consider how to move that data in – AWS Glue has connectors, and AWS DataSync or Transfer Family can help with files.
Enable AWS CloudTrail for your S3 bucket so events (like new file uploads) can trigger processing.

Honestly, the first thing I realized was: you don’t have to nail everything at once. Get one data source flowing into S3, see how it goes, then expand. It’s like testing a new recipe by cooking a small batch first.

2. Structuring Your S3 Data Lake

Once you’re ready to store data, structure is key. AWS best practices recommend multiple S3 “zones” or layers, typically separate buckets for raw, stage/processing, and analytics/curated data. Think of it as organizing your garage: one rack for “just moved in – untouched stuff” (raw), one workbench for “mid-cleanup – in progress” (stage), and one shelf for “ready-to-use” (analytics). For example:

Raw layer – store files exactly as you received them (CSV, JSON, PDF, etc.). Enable versioning here so you never lose the original. This is your time-capsule.
Stage layer – place intermediate, cleaned-up data here. Convert formats (e.g. CSV→Parquet), perform initial transformations, and catalog the schema with AWS Glue.
Analytics layer – put your fully processed, query-ready tables (often Parquet or Iceberg) here. This is what your models or analysts will ultimately consume.

You will likely use separate S3 buckets named by layer (e.g. myorg-data-raw, myorg-data-stage, myorg-data-analytics), possibly including environment or account info for clarity. Good naming makes governance and cost-tracking easier.

A handy checklist:

Use at least 3 layers (raw, stage, analytics), each in its own S3 bucket.
Keep originals intact in raw (no manual edits!).
Plan a folder structure/prefixes inside each bucket (date-based partitions can help).
Enable encryption (S3 or via AWS KMS) and versioning on raw and stage buckets.

By thinking of your S3 lake as a well-labeled garage, you will save countless headaches later.

3. Ingesting Mixed Data Types

Your data recipe likely has all sorts of ingredients: relational tables, CSVs, JSON logs, and even PDFs or images. AWS provides tools for each:

For batch files (CSV, JSON, images, PDFs in S3): you can simply upload them to S3 (via CLI, SDK, or GUI) or use AWS DataSync/Transfer for large/migrated datasets. Once in S3, set up AWS Glue Crawlers to auto-detect schema on CSV/JSON and populate the Glue Data Catalog. Crawlers are lazy scientists that automatically say, “Hey, new data here!” and create tables you can query with Athena.
For streaming or real-time data (e.g. logs, IoT, social feeds): use Amazon Kinesis Data Firehose to ingest streams directly into S3. Firehose can even convert JSON to Parquet on the fly for you, and it supports invoking Lambda for custom transforms (e.g. turning CSV logs into JSON via a blueprint). It’s like having a smart conveyor belt that batches, compresses, and drops your data into S3.
For databases (on-prem or RDS): consider AWS Database Migration Service (DMS) or Glue’s JDBC connections to pull data into S3 or directly into Redshift/Athena as needed.
For documents and images (PDFs, scanned docs): Amazon Textract is your friend. It’s an AI OCR service that can extract text and structured data from scanned files. For example, drop a PDF of a report into S3 and trigger a Lambda that calls Textract to get the text. Save that output (say as plain text or JSON) back into S3 so it can join the rest of the lake. Bedrock Knowledge Bases (KBs) can even ingest common document formats like PDF, Word, HTML, and CSV directly. (Just keep each source file ≤50 MB.)

Key points: Your pipeline will likely be a mix of the above. You might set an EventBridge rule (via S3 events/CloudTrail) so that any new S3 upload triggers a Glue or Lambda job to process it. For example, new CSVs trigger a Glue ETL that cleans and moves data into the stage bucket; new PDFs trigger a Textract process that spits out text to S3. Remember, AWS is great at handling heterogenous data – just string the right services together.

4. Cataloging and Schema Harmonization (Glue & DataBrew)

Once data is landing in S3, you need a catalog(index) and some cleaning. AWS Glue handles the catalog and ETL magic. Glue Crawlers will scan your raw S3 folders and register tables in the Glue Data Catalog. Think of the Data Catalog as a library card catalog for all your datasets. You can then use AWS Glue jobs (PySpark) or AWS Glue DataBrew (no-code GUI) to clean and harmonize.

DataBrew is a visual data-prep tool – no coding needed – that can profile and transform your data. It has 250+ built-in functions (filtering, renaming columns, converting formats, handling missing values, etc.). It is like having a friendly spreadsheet on steroids: you load a dataset, click to apply fixes (e.g. fix date formats, standardize field names, remove duplicates), and output a clean file. For example, if one source has “FirstName” and another has “first_name”, you can use DataBrew to unify them.

Meanwhile, Glue ETL jobs can join, enrich, or further transform data. The advantage: everything can be automated in Glue Workflows. Glue (with DataBrew) lets “clean and normalize without writing code” – perfect for teams where not everyone is a developer. After transformation, write outputs to the “stage” or “analytics” bucket, and update the Data Catalog with the new schema.

In practice, you can do:

Crawlers auto-detect schema on raw data and log errors for missing or invalid values.
Run a DataBrew recipe to fix common issues across datasets (format dates, fill nulls, etc.).
Use a Glue job (Spark) to join datasets or convert everything to an analytics-friendly format (like Parquet/Apache Iceberg tables).

This process is like cleaning that messy garage: Dust off the data, sort it onto shelves, and create a manifest (the Glue catalog) so you can find what you need. All those crawlers and DataBrew steps mean your files go from “hot mess” to “whew, this is actually usable”.

5. Normalizing Documents for GenAI (Bedrock KBs, Textract)

Text documents are a special case. For Generative AI (especially Retrieval-Augmented Generation), you often feed documents into a knowledge base or query engine. AWS’s Amazon Bedrock Knowledge Bases let you ingest text-based files (see supported formats above) and then query them with LLMs. But first you may need to normalize and chunk the text.

For example, if you have PDFs of manuals or reports, use Textract to extract the text and tables. Once you have raw text, you might run a Glue job or Lambda to split it into logical sections or “chunks” of a few hundred words each. This is because Bedrock KBs often work best when content is broken into pieces (like paragraphs). Think of it as chopping a long document into bite-size pieces.

Advanced note: Bedrock now supports things like semantic and hierarchical chunking and even using foundation models to parse tricky PDFs. But at a beginner level, start simple: extract text, remove any scanned gibberish, and put everything in plain .txt or .md files in S3. Then connect that S3 as the data source for your Bedrock KB. The service will parse and index it for RAG queries.

One more tip – normalizing terminology: Different documents might use “DOB”, “Date of Birth”, and “Birth Date”. You can use Glue or even LLM prompts to standardize these keys (so your GenAI sees them as the same concept). In intelligent document processing terms, this is “template and normalization” (defining aliases for different terms). It’s like telling the AI, “hey, whenever you see DOB or Birthdate, they all mean the same thing.” This ensures your knowledge base is clean and consistent.

6. Securing and Governing the Data Lake (IAM, Lake Formation, LF-Tags)

Now that you have valuable data, lock it up properly. Start with IAM: enforce the principle of least privilege, use IAM roles for your Glue jobs and Lambda functions (they get just the S3/Glue permissions they need), and consider AWS KMS for key management. But IAM alone can get tedious for per-table or per-column access. That’s where AWS Lake Formation comes in.

Lake Formation lets you set fine-grained access controls on your data catalogs and S3 tables. A powerful feature is LF-Tags: attribute-based tags you attach to tables/columns (like department=finance or sensitivity=PII), and then grant user roles those same tag values. It’s literally like parking passes and parking lots: imagine a corporate garage with zones. Each dataset has a parking sticker (LF-Tag) and each data analyst has a matching pass. Only matching stickers let you “drive through” to access the data. LF-Tags scale better than hand-granting each table to each user one by one.

(Important note: LF-Tags are not the same as IAM tags. LF-Tags gate Lake Formation table access; IAM tags control IAM policies. Don’t confuse the two – one is for data access, the other for service permissions.)

Also use Lake Formation to enforce column- or row-level filtering if needed (e.g. mask emails or only show Europe-region rows). And don’t forget S3 bucket policies or Access Points for cross-account sharing. In short, treat your data lake like Fort Knox: multi-layer security, logging everything.

I like to joke that LF-Tags are like parking passes: you stick a tag on the dataset and give matching passes to users; if the tags match, voila, access granted. It’s way easier than juggling dozens of user policies as your lake grows.

7. Automating the Pipeline (EventBridge, Glue Workflows, Step Functions)

Manual clicks are doomed to error. Automate your pipeline so new data flows on its own. AWS EventBridge is your event router: for example, set it to catch all S3 PutObject events (via CloudTrail) and trigger AWS Glue Workflows. Glue Workflows let you chain multiple Glue jobs, crawlers, and triggers as a single pipeline.

A common pattern: S3 → CloudTrail → EventBridge rule → Glue workflow. In practice I do this: when ten files land or after a 5-minute window, EventBridge starts the Glue workflow. Glue then runs a crawler (to detect new schema), runs an ETL job (to clean/convert data), and maybe another job to move data to analytics zone. This is event-driven ETL. The nice part is you don’t have to poll or schedule useless jobs – it reacts to real events.

You could also use AWS Step Functions for orchestration (especially if you need branching logic or parallel tasks). Step Functions can call Glue, Lambda, SageMaker, etc., with built-in retries. I’ve used it to coordinate complex flows: e.g. after Glue finishes, step through custom quality checks (via Lambda) before loading data.

In short, trigger-on-upload or scheduled rules in EventBridge start your pipeline; Glue workflows or Step Functions manage the steps; and the pipeline runs itself. It’s like setting up a line of dominoes – once you tip the first piece (new data arrived), everything else happens automatically.

8. Making Your Data Discoverable (Amazon DataZone & SageMaker)

Lastly, once your data lake is humming, make it easy for your teams to find and use the data. Amazon DataZone is a data catalog and governance service – think of it as a Google for your company’s data assets. You can register your S3 data (via Glue tables) in DataZone, tag them (e.g. sales, raw-data), and write descriptions. Users across the org can then search or browse for the datasets they need. This is golden for collaboration.

The best part: SageMaker now integrates directly with DataZone. Data scientists and ML engineers can search the DataZone catalog inside SageMaker Studio or Canvas, and even pull data into notebooks or Canvas by clicking “Subscribe” to a dataset. It’s like having a built-in data shopping mall – shop for datasets (or features/models) right in your IDE.

Amazon SageMaker now integrates with Amazon DataZone to streamline machine learning governance | Artificial Intelligence and Machine Learning

aws.amazon.com

For example, you might publish a cleaned Parquet table to DataZone as a “SalesStats” asset. Anyone on the ML team can now discover SalesStats from SageMaker and add it to their Jupyter environment. After training a model, they could even publish the model back to DataZone for others to reuse.

In short, leverage DataZone (linked to your Lake Formation / Glue Catalog) to catalog everything, and enable SageMaker’s data search/subscribe. That way, your data becomes a first-class citizen in the ML workflow. Your messy garage is now a public library where everyone can check out books (i.e. datasets) they need.

Putting It All Together

Phew! That was a lot of detail, but remember: start simple and iterate. The steps are roughly: get your AWS setup, organize S3, ingest all your data sources, clean and catalog with Glue/DataBrew, extract text from documents (Textract) for Bedrock, secure everything with IAM/LF-Tags, automate with EventBridge/Glue/Step Functions, and finally plug it into DataZone/SageMaker for others to find.

At first it might feel like there are a million AWS services to learn – trust me, I’ve been there. But take it one step at a time. Each piece you set up (a crawler, a Tag, a DataBrew recipe) is like cleaning one corner of that garage. Eventually you’ll stand back and say, “Wow, my GenAI application now actually sees my data!”

Good luck with your GenAI data adventure. I am still learning every day, and I will admit sometimes I screw up a bucket policy or mix up LF-Tag names. But as long as we keep our data organized, secure, and discoverable, our AI projects have the fuel they need.

DEV Community