Insight and analysis on the information technology space from industry thought leaders.

Preparing Unstructured Data for AI? Forget ETL

Organizations must replace outdated ETL processes with intelligent metadata-driven workflows to effectively prepare the vast amounts of unstructured data needed for modern AI applications.

6 Min Read
unstructured data processing and analysis
Alamy

By Krishna Subramanian, Komprise

As AI transforms business operations, organizations need to focus on the data and, specifically, how to build efficient data pipelines to feed AI. The issue is that traditional data pipelines leveraging Extract, Transform, Load (ETL) were built for structured data and are fundamentally misaligned with AI's needs. ETL processes — the backbone of business intelligence for decades — were designed for a different era and different data types.

ETL, which was designed for structured data from databases, no longer works in a world where 90% of data is unstructured and lives in files of many different formats and types. This data consists of documents, images, videos and audio files, instrument, and sensor data.

This shift in focus from data analytics of the past leveraging structured data to AI of today that requires large amounts of unstructured data demands a complete rethinking of how organizations prepare data for AI consumption.

The Unstructured Data Challenge

The core problem with unstructured data is its inherent lack of a common schema. You can't take a video file, an audio file, or even three video files from three different applications and place them in a tabular format because they all have different contexts and different semantics. This variety creates significant transformation challenges.

Related:Why Cohort Analysis Is the Key to Better Product Marketing Decisions

An MRI medical image and a marketing photograph may share the same file extension, but they require unique metadata structures and processing approaches. Perhaps most importantly, transformation requirements vary dramatically by context. The same document format might need entirely different preprocessing depending on whether it's being analyzed for legal compliance, customer sentiment, or research insights.

To make unstructured data usable, safe and searchable for AI pipelines, organizations need to accurately enrich metadata in ways that don't require tedious, Sisyphean manual work. The metadata that storage systems automatically generate is limited: file type, creation date, author, modification date, size, last access date, and user ID.

To enrich metadata, you first need a way to create a global index of your unstructured data regardless of which storage or cloud houses the data. Once you have visibility, you can add tags manually with the help of departmental users who know their data and/or using AI and other automated tools. These new technologies — which can be standalone or exist within an unstructured data management platform — rapidly scan data sets and apply relevant tags describing their contents. This can identify sensitive data like personally identifiable information (PII) that must be excluded from AI workflows and add tags such as project code or research keywords that distinctly identify it for unique use cases. As you catalog unstructured data, it is important to ensure that metadata can follow the data wherever it moves, avoiding the need to re-create metadata.

Related:Data Mesh: The Solution to Financial Services' Data Management Nightmare

Copying and moving unstructured data to locations for AI analysis is also time-consuming and expensive, and due to the size of the data, it can take weeks to months. As a result, you only want to move the precise data sets that you need, further highlighting the need for metadata enrichment and classification.

Why AI Workflows Break the ETL Model

Beyond format challenges, AI processing itself fundamentally differs from traditional analytics. With AI, the workflows become iterative and non-linear.

For example, if you want Amazon Rekognition to look at images and tag them, run PII detection to find and exclude sensitive data and then send data to a large language model (LLM) like Azure OpenAI for chat augmentation. You now have three different AI processes working on the same data at different points.

Related:The Critical Role of a Data Pipeline in Security

This creates an AI-feeding-AI scenario where outputs from one process become inputs for another. Traditional ETL simply wasn't designed for this cyclical enrichment process.

Additionally, AI introduces critical data governance challenges that are different from traditional analytics. Ensuring that employees do not inadvertently expose sensitive data to commercial (external) AI services is one challenge, while maintaining clear audit trails of what corporate data was processed by which AI service is another. ETL doesn't support these audit and verification requirements for unstructured data.

Finally, there is a need to incorporate human verification, which is becoming a core component of AI data governance. We must keep a record of what metadata was AI-enriched versus AI-enriched and human-verified.

Smart Data Workflows for AI

A modern approach to AI unstructured data preparation requires rethinking the entire data pipeline. Rather than immediately moving data, organizations are finding success by first building a comprehensive metadata index that spans all storage environments. This delivers intelligent curation that identifies the exact subset of data for AI processing based on content, context, and business requirements. A global metadata index should be designed to retain metadata and tags no matter where the data lives, so it is independent of your storage.

This approach delivers significant advantages. In one real-world example, when an organization needed to analyze three million documents for specific image content, the traditional ETL approach would have copied all files to a data lake before processing. Not only is copying this volume of files extremely time-consuming, but it also results in unnecessary delays in running the AI, plus added AI and storage costs. By first indexing and curating the data set to identify just 10,000 relevant images, this organization reduced processing costs by 97%.

The key elements of smart data workflows for AI include:

  • Global metadata indexing and curation: Discover and select relevant data before moving it, integrating with AI processors as needed for rapid content analysis and tagging.

  • User tagging: Allow end users to tag their own data since they know it best.

  • Iterative enrichment: Store results as reusable metadata to avoid redundant processing.

  • Built-in governance: Automatically detect sensitive information and maintain comprehensive audit trails.

These unstructured data workflows support specialized pre-processors for different data types: text extractors for documents, recognition services for images, and specialized analyzers for industry-specific data. All that occurs while maintaining data lineage and context throughout the process.

A Modern Approach for AI Data Preparation

There are a number of steps to follow on the path toward modern AI data preparation:

  1. Map your unstructured data landscape: Understand what types of data you have, where it resides, and its potential AI value.

  2. Focus on targeted use cases first: Start with a well-defined AI initiative rather than attempting to transform all data preparation at once.

  3. Build governance into the foundation: Address privacy, security, and compliance requirements from the start.

  4. Measure new metrics: Track how effectively your pipeline supports diverse AI use cases, not just data movement efficiency.

  5. Deliver self-service: Find ways for data owners and AI practitioners to find, classify, and tag the right data, partnering closely with IT operations and infrastructure teams.

As AI becomes central to business strategy, the organizations that implement smart data workflows will gain significant advantages in agility, cost efficiency, and risk management. The question isn't whether your organization needs a new approach to unstructured data preparation for AI — it's how quickly you can implement one.

About the author:

Krishna Subramanian is COO, president, and co-founder of Komprise. In her career, Subramanian has built three successful venture-backed IT businesses and was named a "2021 Top 100 Women of Influence" by Silicon Valley Business Journal.

You May Also Like