Hive to Apache Doris Data Synchronization: A Comprehensive Guide
In the realm of big data, Hive has long been a cornerstone for massive data warehousing and offline processing, while Apache Doris shines in real-time analytics and ad-hoc query scenarios with its robust OLAP capabilities. When enterprises aim to combine Hive's storage prowess with Doris's analytical agility, the challenge lies in efficiently and reliably syncing data between these two systems. This article provides a comprehensive guide to Hive-to-Doris data synchronization, covering use cases, technical solutions, model design, and performance optimization.
I. Core Use Cases and Scope
When target data resides in a Hive data warehouse and requires accelerated analysis via Doris's OLAP capabilities, key scenarios include:
Reporting & Ad-Hoc Queries: Enable fast analytics through synchronization or federated queries.
Unified Data Warehouse Construction: Build layered data models in Doris to enhance query efficiency.
Federated Query Acceleration: Directly access Hive tables from Doris to avoid frequent data ingestion.
II. Technical Pathways and Synchronization Modes
(1) Synchronization Mode
Full/Incremental Sync: Suitable for low-update-frequency scenarios (e.g., log data, dimension tables) where a complete data model is needed in Doris.
Federated Query Mode: Ideal for high-frequency, small-data-volume scenarios (e.g., real-time pricing data) to reduce storage costs and ingestion latency by querying Hive directly from Doris.
(2) Technical Solutions Overview
Four mainstream approaches exist, chosen based on data volume, update frequency, and ETL complexity:
III. In-Depth Analysis of Four Synchronization Solutions
(1) Broker Load: Asynchronous Sync for Large Dataset
Core Principle: Leverage Doris's built-in Broker service to asynchronously load data from HDFS (where Hive data resides) into Doris, supporting full and incremental modes.
-
Use Case:
- Suitable for datasets ranging from tens to hundreds of GB, stored in HDFS accessible by Doris.
- Performance: Syncing a 5.8GB SSB dataset (60M rows) takes 140–164 seconds, achieving 370k–420k rows/sec (cluster-dependent).
-
Key Operations:
- Table Optimization: Temporarily set
replication_num=1
during ingestion for speed, then adjust to 3 replicas for durability. - Partition Conversion: Convert Hive partition fields (e.g.,
yyyymm
) to Doris-compatible date types usingstr_to_date
. - HA Configuration: Include namenode addresses in
WITH BROKER
for HDFS high-availability setups.
- Table Optimization: Temporarily set
(2) Doris On Hive: Low-Latency Federated Querie
Core Principle: Use a Catalog to access Hive metadata, enabling direct queries or
INSERT INTO SELECT
syncs.-
Use Case:
- Small datasets (e.g., pricing tables) with frequent updates (minute-level), no pre-aggregation needed in Doris.
- Supports Text, Parquet, ORC formats (Hive ≥2.3.7).
-
Advantages:
- No data landing in Doris; direct join queries between Hive and Doris tables with sub-0.2-second latency.
(3) Spark Load: Performance Acceleration for Complex ETL
Core Principle: Offload data preprocessing to an external Spark cluster, reducing Doris's computational pressure.
-
Use Case:
- Complex data cleaning (e.g., multi-table JOINs, field transformations) with Spark accessing HDFS.
- Performance: 5.8GB synced in 137 seconds (440k rows/sec), outperforming Broker Load.
-
Configuration:
- Spark Settings: Update Doris FE config (
fe.conf
) withspark_home
andspark_resource_path
:
- Spark Settings: Update Doris FE config (
enable_spark_load = true
spark_home_default_dir = /opt/cloudera/parcels/CDH/lib/spark
spark_resource_path = /opt/cloudera/parcels/CDH/lib/spark/spark-2x.zip
- External Resource Creation:
CREATE EXTERNAL RESOURCE "spark0"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "1g",
"spark.yarn.queue" = "queue0",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://nodename:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://nodename:8020",
"working_dir" = "hdfs://nodename:8020/tmp/doris",
"broker" = "broker_name_1"
);
(4) DataX: Heterogeneous Data Source Compatibility
Core Principle: Use Alibaba's open-source DataX tool with custom
hdfsreader
anddoriswriter
plugins.-
Use Case:
- Non-standard file formats (e.g., CSV) or non-HA HDFS environments.
Drawback: Lower performance (5.8GB in 1,421 seconds, 40k rows/sec) – use as a fallback.
Configuration Example:
{
"job": {
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/data/ssb/*",
"defaultFS": "hdfs://xxxx:9000",
"fileType": "text"
}
},
"writer": {
"name": "doriswriter",
"parameter": {
"feLoadUrl": ["xxxx:18040"],
"database": "test",
"table": "lineorder3"
}
}
}
]
}
}
IV. Decision Tree for Solution Selection
Priority: Broker Load – Large datasets (≥10GB), minimal ETL, high throughput needs.
Second Choice: Doris On Hive – Small datasets (<1GB), frequent updates, federated query requirements.
Complex ETL: Spark Load – Data preprocessing needed; leverage Spark cluster resources.
Fallback: DataX – Special formats or network constraints; prioritize compatibility over performance.
V. Data Modeling and Storage Optimization
(1) Data Model Selection
Aggregate Model: Ideal for log statistics; stores aggregated metrics by key to reduce data volume.
Unique Model: Ensures key uniqueness for slowly changing dimensions (equivalent to
Replace
in Aggregate).Duplicate Model: Stores raw data for multi-dimensional analysis without aggregation.
(2) Data Type Mapping
String to Varchar: Use Varchar for Doris key columns (avoid String); reserve 3x Hive field length for Chinese characters.
Type Consistency: Convert Hive dates to Doris
Date/DateTime
and numeric types toDecimal/Float
to avoid query-time conversions.
(3) Partitioning & Bucketing Strategie
Partition Keys: Reuse Hive partition fields (e.g., year-month) converted via
str_to_date
for pruning.Bucket Keys: Choose high-cardinality fields (e.g., order ID); keep single bucket size under 10GB to avoid skew and segment limits (default ≤200).
VI. Performance Comparison and Best Practices
Solution | 5.8GB Sync Time | Throughput | Query Latency | Use Case |
---|---|---|---|---|
Broker Load | 140–164 | 370k–420k rows/ | 0.2–0.5 | Large-scale full sync |
Spark Load | 137 | 440k rows/ | 0.3 | ETL-intensive sync |
Doris On Hive | Immediate | – | 0.2–0.4 | High-frequency federated querie |
DataX | 1,421 | 40k rows/ | 1–3 | Special format compatibility |
Optimization Tips:
Small File Merging: Use HDFS commands to merge small files and reduce Broker Load scanning.
Model Tuning: Use Duplicate model for fast ingestion, then create materialized views for query speed.
Monitoring: Track load status with
SHOW LOAD
in Doris.
VII. Conclusion
Combining Hive and Doris unlocks synergies between offline storage and real-time analytics. By choosing the right sync strategy (prioritizing Broker/Spark Load), optimizing data models (Aggregate for storage, bucketing for skew), and leveraging federated queries (Doris On Hive), enterprises can build efficient data architectures. Test with small datasets (e.g., SSB) before scaling to production, and stay updated with Doris community improvements (e.g., predicate pushdown) for ongoing performance gains.
if want to get more information and help of doris, you can join us
https://join.slack.com/t/apachedoriscommunity/shared_invite/zt-31dcopb90-zqBVqBrOIYhmy4U29fv9yQ
Top comments (0)