Hadoop - HDFS (Hadoop Distributed File System)

Last Updated : 12 Aug, 2025

Before learning about HDFS (Hadoop Distributed File System), it’s important to understand what a file system is. A file system is a way an operating system organizes and manages files on disk storage. It helps users store, maintain, and retrieve data from the disk.

Example: Windows uses file systems like NTFS (New Technology File System) and FAT32 (File Allocation Table 32). FAT32 is an older file system but is still supported on versions like Windows XP. Similarly, Linux uses file systems such as ext3 and ext4.

Distributed File System

DFS stands for distributed file system, it is a concept of storing file in multiple nodes in a distributed manner. DFS actually provides Abstraction for a single large system whose storage is equal to the sum of storage of other nodes in a cluster.

Why We Need DFS?

Storing very large files (e.g., 30TB) on a single system is impractical because:

Disk capacity of one machine is limited and can only grow so much.
Processing huge datasets on a single machine is inefficient and slow.

Distributed File Systems (DFS) overcome these issues by storing data across multiple machines, enabling faster and scalable processing.

Example:

Suppose you have a 40TB file to process. On a single machine, it might take about 4 hours to complete. However, using a Distributed File System (DFS), as shown in the image below, 40TB file is split across 4 nodes in a cluster, with each node storing 10TB. Since all nodes work simultaneously, processing time reduces to just 1 hour. This demonstrates why DFS is essential for faster and efficient big data processing.

Local File System Processing:

Local-File-System-Processing

Distributed File System Processing:

Distributed-File-System-Processing

Now we think you become familiar with the term file system so let's begin with HDFS.

HDFS

HDFS (Hadoop Distributed File System) is the main storage system in Hadoop. It stores large files by breaking them into blocks (default 128 MB) and distributing them across multiple low-cost machines.

HDFS ensures fault-tolerance by keeping copies of data blocks on different machines. This makes it reliable, scalable and ideal for handling big data efficiently.

Features of HDFS

It's easy to access the files stored in HDFS.
HDFS also provides high availability and fault tolerance.
Provides scalability to scaleup or scaledown nodes as per our requirement.
Data is stored in distributed manner i.e. various Datanodes are responsible for storing the data.
HDFS provides Replication because of which no fear of Data Loss.
HDFS Provides High Reliability as it can store data in a large range of Petabytes.
HDFS has in-built servers in Name node and Data Node that helps them to easily retrieve the cluster information.
Provides high throughput.

Storage Daemons in HDFS

Hadoop follows a master-slave architecture using the MapReduce algorithm. Similarly, HDFS has two main components that follow this structure:

1. NameNode (Master):

The NameNode acts as the master of the Hadoop cluster. It is responsible for storing metadata — data about the actual data. This includes information like:

File names
File sizes
Block numbers and IDs
Locations of blocks stored in DataNodes

The NameNode manages and monitors the DataNodes and sends them instructions to create, delete or replicate data blocks. It also receives regular heartbeat signals and block reports from the DataNodes to ensure everything is functioning properly.

Since NameNode controls entire cluster, it requires high RAM and processing power to manage system efficiently.

2. DataNode (Slave):

The DataNodes act as slaves and are responsible for storing actual data blocks in Hadoop cluster. You can have dozens or even hundreds of DataNodes and more DataNodes mean more storage capacity.

Each DataNode performs tasks like creating, deleting or replicating data blocks all based on the instructions received from the NameNode. To store large files efficiently, DataNodes should have a high storage capacity.

Namenode-and-Datanode

Objectives and Assumptions of HDFS

System Failure Tolerance: Since Hadoop uses low-cost hardware, node failures are expected. HDFS is built to detect and recover from such failures automatically.
Handling Large Datasets: HDFS can manage files ranging from gigabytes to petabytes efficiently within a single cluster.
Data-Local Computation: HDFS prioritizes moving computation closer to the data instead of transferring data over the network, improving speed and reducing congestion.
Platform Portability: HDFS is designed to work across various hardware and software environments, ensuring flexibility and adaptability.
Simple Coherency Model: HDFS follows a "write-once, read-many" model. Once a file is written and closed, it cannot be modified, only appended, reducing data consistency issues.
Scalability: It supports easy scaling by adding or removing nodes, allowing the system to grow with data needs without affecting performance.
Security: HDFS includes authentication, authorization, encryption, and integrity checks to ensure secure data storage and access.
Data Locality: Computation is performed near the data location, improving performance and reducing network overhead.
Cost-Effective: Runs efficiently on inexpensive hardware and scales incrementally, making it a budget-friendly choice for big data processing.
Support for Multiple File Formats: HDFS can store structured, semi-structured, and unstructured data, simplifying data management across various formats.