Best Practices for Syncing Hive Data to Apache Doris — From Scenario Matching to Performance Tuning

#doris #database #bigdata

Hive to Apache Doris Data Synchronization: A Comprehensive Guide

In the realm of big data, Hive has long been a cornerstone for massive data warehousing and offline processing, while Apache Doris shines in real-time analytics and ad-hoc query scenarios with its robust OLAP capabilities. When enterprises aim to combine Hive's storage prowess with Doris's analytical agility, the challenge lies in efficiently and reliably syncing data between these two systems. This article provides a comprehensive guide to Hive-to-Doris data synchronization, covering use cases, technical solutions, model design, and performance optimization.

I. Core Use Cases and Scope

When target data resides in a Hive data warehouse and requires accelerated analysis via Doris's OLAP capabilities, key scenarios include:

Reporting & Ad-Hoc Queries: Enable fast analytics through synchronization or federated queries.
Unified Data Warehouse Construction: Build layered data models in Doris to enhance query efficiency.
Federated Query Acceleration: Directly access Hive tables from Doris to avoid frequent data ingestion.

II. Technical Pathways and Synchronization Modes

(1) Synchronization Mode

Full/Incremental Sync: Suitable for low-update-frequency scenarios (e.g., log data, dimension tables) where a complete data model is needed in Doris.
Federated Query Mode: Ideal for high-frequency, small-data-volume scenarios (e.g., real-time pricing data) to reduce storage costs and ingestion latency by querying Hive directly from Doris.

(2) Technical Solutions Overview

Four mainstream approaches exist, chosen based on data volume, update frequency, and ETL complexity:

III. In-Depth Analysis of Four Synchronization Solutions

(1) Broker Load: Asynchronous Sync for Large Dataset

Core Principle: Leverage Doris's built-in Broker service to asynchronously load data from HDFS (where Hive data resides) into Doris, supporting full and incremental modes.
Use Case:
- Suitable for datasets ranging from tens to hundreds of GB, stored in HDFS accessible by Doris.
- Performance: Syncing a 5.8GB SSB dataset (60M rows) takes 140–164 seconds, achieving 370k–420k rows/sec (cluster-dependent).
Key Operations:
- Table Optimization: Temporarily set replication_num=1 during ingestion for speed, then adjust to 3 replicas for durability.
- Partition Conversion: Convert Hive partition fields (e.g., yyyymm) to Doris-compatible date types using str_to_date.
- HA Configuration: Include namenode addresses in WITH BROKER for HDFS high-availability setups.

(2) Doris On Hive: Low-Latency Federated Querie

Core Principle: Use a Catalog to access Hive metadata, enabling direct queries or INSERT INTO SELECT syncs.
Use Case:
- Small datasets (e.g., pricing tables) with frequent updates (minute-level), no pre-aggregation needed in Doris.
- Supports Text, Parquet, ORC formats (Hive ≥2.3.7).
Advantages:
- No data landing in Doris; direct join queries between Hive and Doris tables with sub-0.2-second latency.

(3) Spark Load: Performance Acceleration for Complex ETL

Core Principle: Offload data preprocessing to an external Spark cluster, reducing Doris's computational pressure.
Use Case:
- Complex data cleaning (e.g., multi-table JOINs, field transformations) with Spark accessing HDFS.
- Performance: 5.8GB synced in 137 seconds (440k rows/sec), outperforming Broker Load.
Configuration:
- Spark Settings: Update Doris FE config (fe.conf) with spark_home and spark_resource_path:

enable_spark_load = true 

spark_home_default_dir = /opt/cloudera/parcels/CDH/lib/spark 

spark_resource_path = /opt/cloudera/parcels/CDH/lib/spark/spark-2x.zip

External Resource Creation:

CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "1g",
"spark.yarn.queue" = "queue0",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://nodename:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://nodename:8020",
"working_dir" = "hdfs://nodename:8020/tmp/doris",
"broker" = "broker_name_1"
);

(4) DataX: Heterogeneous Data Source Compatibility

Core Principle: Use Alibaba's open-source DataX tool with custom hdfsreader and doriswriter plugins.
Use Case:
- Non-standard file formats (e.g., CSV) or non-HA HDFS environments.
Drawback: Lower performance (5.8GB in 1,421 seconds, 40k rows/sec) – use as a fallback.
Configuration Example:

{ 

 "job": { 

   "content": [ 

     { 

       "reader": { 

         "name": "hdfsreader", 

         "parameter": { 

           "path": "/data/ssb/*", 

           "defaultFS": "hdfs://xxxx:9000", 

           "fileType": "text" 
         } 

       }, 

       "writer": { 

         "name": "doriswriter", 

         "parameter": { 

           "feLoadUrl": ["xxxx:18040"], 

           "database": "test", 

           "table": "lineorder3" 

         } 

       } 

     } 

   ] 

 } 

}

IV. Decision Tree for Solution Selection

Priority: Broker Load – Large datasets (≥10GB), minimal ETL, high throughput needs.
Second Choice: Doris On Hive – Small datasets (<1GB), frequent updates, federated query requirements.
Complex ETL: Spark Load – Data preprocessing needed; leverage Spark cluster resources.
Fallback: DataX – Special formats or network constraints; prioritize compatibility over performance.

V. Data Modeling and Storage Optimization

(1) Data Model Selection

Aggregate Model: Ideal for log statistics; stores aggregated metrics by key to reduce data volume.
Unique Model: Ensures key uniqueness for slowly changing dimensions (equivalent to Replace in Aggregate).
Duplicate Model: Stores raw data for multi-dimensional analysis without aggregation.

(2) Data Type Mapping

String to Varchar: Use Varchar for Doris key columns (avoid String); reserve 3x Hive field length for Chinese characters.
Type Consistency: Convert Hive dates to Doris Date/DateTime and numeric types to Decimal/Float to avoid query-time conversions.

(3) Partitioning & Bucketing Strategie

Partition Keys: Reuse Hive partition fields (e.g., year-month) converted via str_to_date for pruning.
Bucket Keys: Choose high-cardinality fields (e.g., order ID); keep single bucket size under 10GB to avoid skew and segment limits (default ≤200).

VI. Performance Comparison and Best Practices

Solution	5.8GB Sync Time	Throughput	Query Latency	Use Case
Broker Load	140–164	370k–420k rows/	0.2–0.5	Large-scale full sync
Spark Load	137	440k rows/	0.3	ETL-intensive sync
Doris On Hive	Immediate	–	0.2–0.4	High-frequency federated querie
DataX	1,421	40k rows/	1–3	Special format compatibility

Optimization Tips:

Small File Merging: Use HDFS commands to merge small files and reduce Broker Load scanning.
Model Tuning: Use Duplicate model for fast ingestion, then create materialized views for query speed.
Monitoring: Track load status with SHOW LOAD in Doris.

VII. Conclusion

Combining Hive and Doris unlocks synergies between offline storage and real-time analytics. By choosing the right sync strategy (prioritizing Broker/Spark Load), optimizing data models (Aggregate for storage, bucketing for skew), and leveraging federated queries (Doris On Hive), enterprises can build efficient data architectures. Test with small datasets (e.g., SSB) before scaling to production, and stay updated with Doris community improvements (e.g., predicate pushdown) for ongoing performance gains.

if want to get more information and help of doris, you can join us

https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-31dcopb90-zqBVqBrOIYhmy4U29fv9yQ