Hadoop Series - HDFS

Hadoop Series - HDFS#

1. HDFS Startup Process#

Load file metadata
Load log files
Set checkpoints
Enter safe mode. The purpose is: to check the replication rate of data blocks and whether redundancy meets the requirements.

2. HDFS Operating Mechanism#

User files are split into blocks and stored across multiple DataNode servers, with each file having multiple replicas stored throughout the cluster, which enhances data security.

3. Basic Architecture#

NameNode: The management node of the entire file system, responsible for recording how files are split into Block data blocks and keeping track of the storage information of these data blocks.
1. Fsimage: The image file on the hard disk that stores metadata.
2. edits: The log file that records system operations.
3. Fstime: The time of the last checkpoint.
4. seen_txid: The number of the last edit.
5. version
Secondary NameNode: An auxiliary background program (not a disaster recovery node for NameNode) that communicates with the NameNode to periodically save snapshots of HDFS metadata.
DataNode: Data nodes responsible for reading and writing HDFS data blocks to the local file system.

HDFS is not suitable for storing small files because each file generates metadata, and as the number of small files increases, the metadata also increases, putting pressure on the NameNode.

Federated HDFS#

Each NameNode maintains a namespace, and the namespaces of different NameNodes are independent of each other. Each DataNode needs to register with each NameNode.

Multiple NameNodes share the storage resources of a single cluster DataNode, and each NameNode can provide services independently.
Each NameNode defines a storage pool with a separate ID, and each DataNode provides storage for all storage pools.
DataNodes report block information to their corresponding NameNode according to the storage pool ID, while also reporting the available local storage resources to all NameNodes.
If convenient access to resources on several NameNodes is needed from the client, a client mount table can be used to map different directories to different NameNodes, but the corresponding directories must exist on the NameNodes.

4. NameNode Working Mechanism#

NameNode Working Mechanism.png
HDFS read/write -> Rolling log records -> SN asks NN if a checkpoint is needed -> Time is up (60 minutes) or edits data is full triggers checkpoint -> SN requests to execute checkpoint -> NN copies edits file and fsimage file to SN -> SN merges edits log into fsimage -> SN synchronizes the merged fsimage back to NN.

5. DataNode Working Mechanism#

DataNode Working Mechanism.jpeg

DataNode starts and registers with NN, then periodically reports block data information.
NN and DataNode communicate through a heartbeat detection mechanism, with a heartbeat every 3 seconds. If no heartbeat is received for more than 10 minutes, the node is considered unavailable.

6. HDFS Data Read Process#

HDFS Read Process.png

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("demo.txt");
FSDataInputStream inStream = fs.open(file);
String data = inStream.readUTF();
System.out.println(data);
inStream.close();

The client initializes the FileSystem object and calls the open() method to obtain a DistributedFileSystem object.
The DistributedFileSystem requests the first batch of block locations from the NN via RPC.
The first two steps generate an FSDataInputStream, which is encapsulated into a DFSInputStream object.
The client calls the read() method, and the DFSInputStream finds the nearest DataNode to the client and connects to start reading the first data block of the file, with data being transmitted from the DataNode to the client.
When the first data block is read completely, the DFSInputStream closes the connection and connects to the next DataNode for the next data block transfer.
If there is an exception in communication between the DFSInputStream and the DataNode while reading data, it will attempt to connect to the next DataNode containing the data block and will record which DataNode encountered an error, skipping that DataNode for the remaining blocks.
After the client finishes reading data, it calls the close() method to close the connection and writes the data to the local file system.

7. HDFS Data Write Process#

HDFS Write Process.png

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path file = new Path("demo.txt");
FSDataOutputStream outStream = fs.create(file);
out.write("Welcome to HDFS Java API !!!".getBytes("UTF-8"));
outStream.close();

The client calls the create method to create a file output stream DFSDataOutputStream object and requests the NN to upload the file.
The DistributedFileSystem calls the NN via RPC to create a new file entry without associated blocks. Before creation, the NN checks if the file exists or if there are permissions to create it. If successful, it writes the operation to the EditLog (WAL, write ahead log) and returns the output stream object; otherwise, it throws an IO exception.
The client first splits the file; for example, if a block is 128M and the file is 300M, it will be split into three blocks: two 128M and one 44M, and then requests the NN for which DataNode servers the blocks should be transferred to.
The NN returns the writable DataNode information, and the client establishes a pipeline connection with multiple DataNodes assigned by the NameNode, writing data into the output stream object.
Data is written to the DataNode through the FSDataOutputStream object, and this data is split into small packets and queued in a data queue. Each time the client writes a packet to the first DataNode, that packet is directly passed to the second, third, etc., DataNode in the pipeline, not waiting to write a complete block or file before distributing.
Each DataNode responds with an ack confirmation message after writing a block, noting that confirmation is not returned after writing each packet.
After the client completes writing data, it calls the close method to close the stream.

Supplement:

After the client performs a write operation, only the completed blocks are visible; the blocks being written are not visible to the client. The client can ensure that the write operation of the file is complete only by calling the sync method. When the client calls the close method, it will automatically call the sync method. Whether to call it manually depends on the balance between data robustness and throughput based on program needs.

Error Handling During Writing:#

Error in a DataNode Replica During Writing:#

First, the pipeline will be closed.
Packets that have been sent to the pipeline but have not yet received confirmation will be written back to the data queue to avoid data loss.
The currently functioning DataNode will be assigned a new version number (the latest timestamp version can be obtained using the lease information in the NameNode), so that when the faulty node recovers, it will be deleted due to version mismatch.
The faulty node will be removed, and a normal DataNode will be selected to re-establish the pipeline and begin writing data again.
If replicas are insufficient, the NN will create a new replica on other DataNode nodes.

Client Crash During Writing:#

When the client exits abnormally during data writing, different replicas of the same block may be in an inconsistent state. A primary DataNode is selected to coordinate with other DataNodes, and the NN will find the minimum block length that all DataNode replicas have for this data block through the lease mechanism, then restore the data block to their minimum length.

For detailed reference: HDFS Recovery Process 1

8. HDFS Replication Mechanism#

First replica: If the upload node is a DataNode, it uploads to that node; if the upload node is a NameNode, it randomly selects a DataNode.
Second replica: Placed on a DataNode in a different rack.
Third replica: Placed on a different DataNode in the same rack as the second replica.

9. HDFS Safe Mode#

Safe mode is a working state of HDFS, where it only provides a read-only view of files to clients and does not accept modifications to the namespace.

When the NN starts, it first loads the fsimage into memory and then executes the operations recorded in the edits log. Once the mapping of file system metadata is successfully established in memory, a new fsimage and an empty edits log file are created, at which point the NN operates in safe mode.
During this phase, the NN collects information from DataNodes and counts the data blocks for each file. When it confirms that the minimum replication condition is met, meaning a certain proportion of data blocks have reached the minimum number of replicas, it will exit safe mode. If not met, it will arrange for DataNodes to replicate the insufficiently replicated data blocks until the minimum number of replicas is reached.
When starting a newly formatted HDFS, it will not enter safe mode because there are no data blocks.

Exit safe mode: hdfs namenode -safemode leave

10. HA High Availability Mechanism#

Reference: Hadoop NameNode High Availability Implementation Analysis

Basic Architecture Implementation#

HA high availability of HDFS is ensured through Zookeeper, with the basic architecture:

Active NameNode and Standby NameNode: Primary and backup NameNode nodes, only the Active primary NameNode provides services externally.
Shared Storage System: Stores the metadata generated during the operation of the NN. The primary and backup NNs achieve metadata synchronization through the shared storage system, and the new NN can only provide services externally after confirming that the metadata is fully synchronized during primary-backup switching.
Primary-Backup Switching Controller ZKFailoverController: ZKFC runs as an independent process and can promptly monitor the health status of the NN. When the primary NN fails, it uses the ZK cluster to achieve automatic primary-backup election switching.
DataNode: DataNodes need to upload block information to both the primary and backup NNs to ensure synchronization of the mapping relationship between HDFS data blocks and DataNodes.
Zookeeper Cluster: Provides primary-backup election support for the primary-backup switching controller.

Primary-Backup Switching Implementation#

Reference: Hadoop NameNode High Availability Implementation Analysis

The primary-backup switching of the NameNode is mainly coordinated by the three components: ZKFailoverController, HealthMonitor, and ActiveStandbyElector:

When the primary-backup switching controller ZKFailoverController starts, it creates the two main internal components, HealthMonitor and ActiveStandbyElector. ZKFailoverController registers corresponding callback methods with HealthMonitor and ActiveStandbyElector at the same time.
HealthMonitor is mainly responsible for detecting the health status of the NameNode. If it detects a change in the NameNode's status, it will call back the corresponding method of ZKFailoverController for automatic primary-backup election.
ActiveStandbyElector is mainly responsible for completing the automatic primary-backup election, encapsulating the processing logic of Zookeeper internally. Once the Zookeeper primary-backup election is completed, it will call back the corresponding method of ZKFailoverController to switch the primary-backup state of the NameNode.

Process Analysis:

After HealthMonitor initialization is complete, it will start an internal thread to periodically call the HAServiceProtocol RPC interface methods of the corresponding NameNode to check the health status of the NameNode.
When a change in the NN status is detected, it calls back the corresponding method of ZKFailoverController for processing.
When ZKFailoverController detects that a primary-backup switch is needed, it uses ActiveStandbyElector to handle it.
ActiveStandbyElector interacts with ZK to complete the automatic election and then calls back the corresponding method of ZKFailoverController to notify the current NN.
ZKFailoverController calls the corresponding HAServiceProtocol RPC interface methods of the NameNode to convert the NameNode to Active or Standby state.