Introduction to Data Analytics Platform with Databricks

#webdev #data #datascience #dataengineering

Introduction

Recently, I've had several opportunities to build data analytics platforms using Databricks.
As a reference, I'd like to summarize what I've learned.

Prerequisites for this Article

This article explains the basic concepts of Databricks.
A construction article using sample scripts will be introduced in a separate article.
It is written with the assumption of building in combination with AWS.

What is Databricks

Databricks is a data analytics platform based on Apache Spark.
Databricks is a platform that can handle everything related to data analytics, including data collection, processing, analysis, and visualization.
Construction can be performed using services from multiple cloud vendors such as AWS, Azure, and GCP.

Architecture Overview

Databricks has an account as the top-level resource, with workspaces underneath it.
The account manages billing, users, and workspaces.
A workspace is where actual data analysis takes place, managing notebooks, jobs, SQL warehouses, and more.
Using AWS Organizations as an example, the Databricks account corresponds to the management account, and workspaces correspond to individual AWS accounts.

┌──────────────────────────────────────────────────────────────────┐
│                    Databricks Account                            │
│                                                                  │
│  ┌─ Management functions ──────────────────────────────────┐     │
│  │ • Billing management                                    │     │
│  │ • User management                                       │     │
│  │ • Workspace management                                  │     │
│  │ • Security configuration                                │     │
│  └─────────────────────────────────────────────────────────┘     │
│                                                                  │
│  ┌─ Workspace A ──────────────┐  ┌─ Workspace B ──────────────┐  │
│  │                            │  │                            │  │
│  │ ┌─ Control Plane ────────┐ │  │ ┌─ Control Plane ────────┐ │  │
│  │ │ • Web UI               │ │  │ │ • Web UI               │ │  │
│  │ │ • Job Scheduler        │ │  │ │ • Job Scheduler        │ │  │
│  │ │ • Metadata Store       │ │  │ │ • Metadata Store       │ │  │
│  │ │ • Security Manager     │ │  │ │ • Security Manager     │ │  │
│  │ └────────────────────────┘ │  │ └────────────────────────┘ │  │
│  │            │               │  │            │               │  │
│  │            ▼               │  │            ▼               │  │
│  │ ┌─ Compute Plane ────────┐ │  │ ┌─ Compute Plane ────────┐ │  │
│  │ │ • Clusters             │ │  │ │ • Clusters             │ │  │
│  │ │ • SQL Warehouses       │ │  │ │ • SQL Warehouses       │ │  │
│  │ │ • Notebooks            │ │  │ │ • Notebooks            │ │  │
│  │ │ • Jobs                 │ │  │ │ • Jobs                 │ │  │
│  │ └────────────────────────┘ │  │ └────────────────────────┘ │  │
│  └────────────────────────────┘  └────────────────────────────┘  │
└──────────────────────────────────────────────────────────────────┘

Within a workspace, there are two main components: the control plane for management and the compute plane for data processing.
The control plane is a service provided by Databricks - a web application managed by Databricks.
The compute plane consists of resources for actual data processing. There are two configuration methods:

Classic configuration: Create clusters as EC2 instances within the user's AWS account
Serverless configuration: Execute resources within Databricks-managed AWS accounts (utilizing underlying AWS resources invisibly to users)

The following page provides a clear explanation:
Databricks architecture overview

Databricks Features

What are the benefits of adopting Databricks as a data analytics platform? The main Databricks-specific features for analytics platforms include:

Delta lake
Unity Catalog

Delta Lake is an open-source storage format.
It consists of Apache Parquet format files, delta logs (JSON format), and metadata, enabling ACID transactions that were not possible with traditional data lakes and data warehouses.
Specifically, while traditional data lakes could potentially retrieve inconsistent data when reading during data updates, Delta Lake enables reading data in a consistently coherent state through its transaction functionality.
Databricks delta lake
Databricks provides default support for Delta Lake.

Unity Catalog is a data governance feature provided by Databricks.
It enables access management and quality management for storage across workspaces.
Databricks unity catalog

Specific use cases include:

Why Choose Databricks

While simple analytics systems that store data in S3 or DWH and analyze with Python or SQL could be implemented using only AWS services, Databricks offers the following advantages:

Strong data consistency guarantee: Transaction functionality through Delta Lake
Integrated data quality management: Quality monitoring and governance through Unity Catalog
Multi-cloud support: Enables integrated data collection and analysis even when there are multiple cloud environments within an organization These features enable the operation of high-quality data platforms, which I believe are the main reasons for adopting Databricks.

Conclusion

That's all for now. This article briefly explained the basic knowledge about Databricks.
I plan to publish an actual construction article using Terraform in the future.