DevOps Fundamental for DevOps Fundamentals

Posted on Jun 23

Big Data Fundamentals: big data tutorial

#bigdata #dataengineering #data #bigdatatutorial

Mastering Data Compaction in Apache Iceberg: A Production Deep Dive

Introduction

The relentless growth of data in modern applications presents a significant engineering challenge: maintaining query performance on ever-increasing datasets. We recently faced this at scale while building a real-time analytics platform for a large e-commerce company. Ingesting clickstream data at 100K events/second, coupled with historical order data totaling over 500TB, resulted in query latencies exceeding acceptable thresholds for dashboarding and reporting. The root cause wasn’t compute or storage capacity, but the proliferation of small files and inefficient data layout within our data lake. This led us to deeply investigate and optimize data compaction strategies within Apache Iceberg, a table format designed for large analytic datasets. This post details our journey, focusing on the architectural considerations, performance tuning, and operational realities of Iceberg compaction. We’ll cover how it fits into a modern data lakehouse architecture built on Spark, Flink, and cloud storage (AWS S3).

What is Data Compaction in Big Data Systems?

Data compaction, in the context of table formats like Iceberg, Delta Lake, and Hudi, is the process of rewriting small files into larger, more efficient files. This is crucial for several reasons. Small files lead to increased metadata overhead (listing files in object storage is expensive), increased I/O operations (more files to open and read), and reduced parallelism during query execution. Iceberg’s architecture relies on metadata layers (manifest lists, manifest files, data files) to track data. Frequent small file writes overwhelm these layers, degrading performance. Compaction isn’t simply copying data; it’s a carefully orchestrated process involving rewriting data files, updating metadata, and ensuring atomicity. Iceberg’s compaction process leverages its snapshot isolation and optimistic concurrency control mechanisms. At the protocol level, compaction jobs read data files based on manifest lists, rewrite them in a chosen format (typically Parquet), and then update the metadata to point to the new files, atomically replacing the old ones.

Real-World Use Cases

Clickstream Analytics: High-velocity, append-only data streams require frequent compaction to prevent small file bloat. Without it, dashboard queries become sluggish.
Change Data Capture (CDC) Ingestion: CDC pipelines often write data in micro-batches, resulting in numerous small files. Compaction consolidates these changes into larger, queryable units.
Log Analytics: Aggregating logs from thousands of servers generates a high volume of small log files. Compaction is essential for efficient log querying and analysis.
Machine Learning Feature Pipelines: Incremental feature updates frequently create new data files. Compaction ensures that feature stores remain performant for model training and inference.
Historical Data Archiving: Compacting older, less frequently accessed data into larger files can reduce storage costs and improve query performance on recent data.

System Design & Architecture

graph LR
    A[Data Sources (Kafka, CDC)] --> B(Spark Streaming/Flink);
    B --> C{Iceberg Table (S3)};
    C --> D[Query Engines (Spark SQL, Presto/Trino)];
    C --> E[Iceberg Compaction Service];
    E --> C;
    subgraph AWS
        C
        E
    end
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style D fill:#ccf,stroke:#333,stroke-width:2px

This diagram illustrates a typical Iceberg deployment. Data is ingested from sources like Kafka or CDC systems using Spark Streaming or Flink. The data is written to an Iceberg table stored in S3. Query engines like Spark SQL and Presto/Trino query the table. A dedicated Iceberg compaction service (often a scheduled Spark job or a Flink application) monitors the table and triggers compaction when necessary.

We leverage AWS EMR for our Spark and Flink clusters, and S3 for storage. The Iceberg metadata is stored in AWS Glue as a Hive Metastore catalog. This allows seamless integration with our existing data governance tools.

Performance Tuning & Resource Management

Effective compaction requires careful tuning. Here are key parameters:

iceberg.compaction.target-file-size: The desired size of compacted files (e.g., 128MB, 256MB). Larger files reduce metadata overhead but can increase compaction latency.
iceberg.compaction.max-files-per-compaction: The maximum number of files to compact in a single job. Too many files can lead to memory issues.
spark.sql.shuffle.partitions: Crucial for parallelizing compaction. We found that setting this to 200-400 based on cluster size (70 worker nodes) provided optimal performance.
fs.s3a.connection.maximum: Controls the number of concurrent connections to S3. Increase this to improve I/O throughput (e.g., 1000).
iceberg.compaction.enabled: Enable or disable compaction.
iceberg.compaction.schedule: Define a schedule for compaction jobs (e.g., daily, hourly).

We monitor compaction job performance using Spark UI and CloudWatch metrics. Key metrics include:

Compaction Duration: Indicates the efficiency of the compaction process.
Files Compacted: Shows the number of files consolidated.
Data Rewritten: Tracks the amount of data processed.
S3 Read/Write Throughput: Identifies I/O bottlenecks.

A typical compaction Spark job configuration snippet:

spark.conf.set("iceberg.compaction.target-file-size", "128M")
spark.conf.set("iceberg.compaction.max-files-per-compaction", "100")
spark.conf.set("spark.sql.shuffle.partitions", "300")
spark.conf.set("fs.s3a.connection.maximum", "1000")

Failure Modes & Debugging

Common failure modes include:

Data Skew: Uneven data distribution can lead to some tasks taking significantly longer than others, causing job failures. Solution: Use Iceberg’s partitioning capabilities to distribute data more evenly.
Out-of-Memory Errors: Compacting too many files simultaneously can exhaust memory. Solution: Reduce iceberg.compaction.max-files-per-compaction or increase executor memory.
Job Retries: Transient network errors or S3 throttling can cause job retries. Solution: Implement exponential backoff and retry mechanisms.
DAG Crashes: Errors in the compaction logic or corrupted metadata can lead to DAG crashes. Solution: Review logs, check metadata integrity, and update Iceberg libraries.

Debugging tools:

Spark UI: Provides detailed information about task execution, memory usage, and shuffle statistics.
Flink Dashboard: Similar to Spark UI, but for Flink applications.
Iceberg CLI: Useful for inspecting table metadata and data files.
CloudWatch Logs: Captures application logs and error messages.

Data Governance & Schema Management

Iceberg’s schema evolution capabilities are critical. We use a schema registry (Confluent Schema Registry) to manage schema changes. Compaction jobs automatically handle schema evolution, ensuring backward compatibility. We leverage Iceberg’s column pruning feature to only read necessary columns during compaction, reducing I/O overhead. Metadata is stored in AWS Glue, providing a centralized catalog for all our data assets.

Security and Access Control

We use AWS Lake Formation to manage access control to our Iceberg tables. Lake Formation allows us to define granular permissions at the table and column level. Data is encrypted at rest using S3 encryption and in transit using TLS. Audit logging is enabled to track data access and modifications.

Testing & CI/CD Integration

We use Great Expectations to validate data quality during compaction. We define expectations for data completeness, accuracy, and consistency. We also use DBT tests to validate schema changes. Our CI/CD pipeline includes automated regression tests that run compaction jobs on staging environments before deploying to production.

Common Pitfalls & Operational Misconceptions

Ignoring Compaction: Assuming that storage is cheap and compaction is unnecessary. This leads to performance degradation and increased costs.
Over-Compaction: Compacting too frequently can waste resources and introduce unnecessary overhead.
Incorrect target-file-size: Choosing a file size that is too small or too large.
Insufficient Resources: Not allocating enough memory or CPU to compaction jobs.
Lack of Monitoring: Not monitoring compaction job performance and identifying bottlenecks.

Example log snippet showing compaction skew:

23/10/27 10:00:00 WARN TaskSetManager: Task 0 in stage 1.0 failed 4 times due to Exception in app-xxxxxxxxxxxxx/stage-1.0/task-0.0
java.lang.OutOfMemoryError: Java heap space

This indicates a task is running out of memory, likely due to processing a disproportionately large amount of data.

Enterprise Patterns & Best Practices

Data Lakehouse Architecture: Embrace a data lakehouse architecture that combines the benefits of data lakes and data warehouses.
Batch vs. Streaming Compaction: Use batch compaction for historical data and streaming compaction for real-time data.
File Format Selection: Parquet is generally the preferred file format for Iceberg due to its efficient compression and columnar storage.
Storage Tiering: Tier data based on access frequency to reduce storage costs.
Workflow Orchestration: Use Airflow or Dagster to orchestrate compaction jobs and ensure they run reliably.

Conclusion

Data compaction is a critical component of a well-managed Iceberg data lake. By understanding the architectural considerations, performance tuning strategies, and operational realities of compaction, you can ensure that your Big Data infrastructure remains reliable, scalable, and cost-effective. Next steps include benchmarking different compaction configurations, introducing schema enforcement, and migrating to newer Iceberg versions to leverage the latest features and optimizations. Regularly reviewing compaction metrics and adjusting parameters based on workload changes is essential for maintaining optimal performance.

DEV Community