Hadoop Introduction

Content

Hadoop is open sourced by ASF (Apache Software Foundation) and designed according to three open-source big data papers by Google. It is a framework that allows large amounts of data to be distributed in a computer cluster through the use of a simple programming model. The scale of its design can range from a single server to thousands of servers, each of which can provide local computing and storage functions. Hadoop does not rely on expensive hardware to support high availability. Hadoop can detect and process errors at the application layer, and transfer the errors to other servers (let it be wrong, I can use other servers on top), so Hadoop provides a computer cluster-based, efficient service.

After ten years of development, the term Hadoop itself is constantly evolving. At present, we mentioned that Hadoop mostly refers to the big data ecosystem, which includes numerous software technologies (e.g. HBase, Hive, Spark, etc.).

Just as the Spring framework has the most basic modules Context, Bean and Core, other modules and projects are built based on these basic modules. Like Hadoop, there are several basic modules.

Common: A common toolkit supporting other modules.

HDFS: A distributed file system that can access application data with high throughput.
YARN: A framework for managing cluster server resources and task scheduling.
MapReduce: A system for parallel computing of large data sets based on Yarn.
Others, like HBase, Hive, etc., are only high-level abstractions on these basic modules. In addition, Hadoop is not the only solution for big data at present, like Amazon's big data technology solutions and so on. At present, the author has more contact with Apache native software and Cloudera's commercial software. In the following chapters, we will use the Cloudera version as the blueprint to explain to you.

Common

The Common module is the most basic module of Hadoop. It provides the most basic implementations for other modules, such as operating the file system, I/O, serialization, and remote method invocation. If you want to learn more about the specific implementation of Hadoop, you can read Common's source code.

HDFS

HDFS is the acronym for "Hadoop Distributed File System". It is a distributed file system designed to run under general hardware conditions (not necessarily server-level equipment, but better equipment can play a greater role) . It has many similarities with other existing distributed file systems (eg RAID). The difference from other distributed file systems is that HDFS is designed to run on low-cost hardware (eg ordinary PCs) and provide high-reliability servers. HDFS is designed to meet applications with large data volumes and large throughput .

In order to better understand the distributed file system, let's start with the file.

file

The word file is probably familiar to modern people. But in different industries, documents have different meanings. In the field of computer science, what is a file? Is the file an icon that can be seen in the directory? of course not. When a file is stored in a storage device, it is an N-length byte sequence. From the perspective of a computer user, a file is an abstraction of all I/O devices. Every I/O device can be regarded as a file, including disks, keyboards, and networks. The connotation of this simple and delicate concept of file is very rich. It provides a unified perspective to the application to look at the various I/O devices that may be contained in the system.

File system

So there must be more than one file on a computer, how to manage thousands of files? Therefore, we need something to manage files, that is, the file system. The file system is a method of storing and organizing data on a computer, which makes it easy to access and find it. The file system uses the abstract logical concept of files and tree directories instead of hard disks and optical disks and other physical devices that use data blocks. The concept is that the user uses the file system to save the data and does not need to care about how many data blocks the data is actually saved on the hard disk address, only need to remember the directory and file name of the file. Before writing new data, the user does not need to care that the block address on the hard disk is not used. The storage space management (allocation and release) function of the hard disk is automatically completed by the file system, and the user only needs to remember which file the data was written to. Just medium.

Distributed file system

Compared with the stand-alone file system, the distributed file system (Distributed file system). It is a file system that allows files to be shared on multiple hosts over the network, allowing multiple users on multiple computers to share files and storage space.

In such a file system, the client does not directly access the underlying data storage blocks and disks. But through the network, based on the stand-alone file system and with the help of a specific communication protocol, to realize the read and write to the file system.

The most basic ability that a distributed file system needs to have is to realize data replication and fault tolerance through unblocked network I/O. In other words, on the one hand, a file is divided into multiple data blocks and distributed in multiple devices. On the other hand, there are multiple copies of data blocks distributed on different devices. Even if a small part of the equipment is offline or down, the file system can continue to operate without data loss as a whole.

Note: The boundary between distributed file system and distributed data storage is blurred, but in general, distributed file system is designed to be used in local area networks, and the emphasis is on the extension of the traditional file system concept and achieved through software methods. The purpose of fault tolerance. Distributed data storage generally refers to systems that provide data storage services such as files and databases that apply distributed computing technology.

HDFS

HDFS is responsible for the distributed file system in Hadoop. HDFS uses a master/slave architecture. An HDFS cluster is composed of a Namenode and a certain number of Datanodes. Namenode is a central server responsible for managing the namespace of the file system and file access control. Datanodes in a cluster are generally deployed on one device and are responsible for managing the storage on the node where it is located. HDFS exposes the namespace of the file system, and users can store data on it in the form of files. In fact, a file is divided into one or more data blocks, which are stored on a group of Datanodes. Namenode performs namespace operations of the file system, such as opening, closing, and renaming files or directories. It is also responsible for determining the mapping of data blocks to specific Datanode devices. Datanode is responsible for handling read and write requests from file system clients. The creation, deletion, and replication of data blocks are performed under the unified scheduling of Namenode. In order to ensure the high reliability of the file system, another Standby Namenode is often required to take over the file system immediately after the Actived Namenode has a problem.

There are many installation and configuration manuals about hdfs on the Internet, so I won’t repeat them in this article. Only one deployment architecture used in previous projects is provided for your reference only.

This highly available HDFS architecture is composed of 3 zookeeper devices, 2 domain name service (DNS) and time service (NTP) devices, 2 Namenode devices (Standby can be more if necessary), a shared storage device (NFS) and N Consists of a DataNode.
Zookeeper is responsible for receiving the heartbeat of the NameNode. When the Actived namenode does not report the heartbeat to the zookeeper, the monitoring process of the Standby Namenode will receive this message, thereby activating the Standby NameNode and taking over the work of the Active NameNode.

NFS is responsible for storing EditLog files for the two NameNodes. (When the NameNode performs write operations such as creating or moving files submitted by the HDFS client, it will first record these operations in the EditLog file, and then update the file system in the memory. Mirror, and finally refresh to disk. EditLog only works when data is restored. Each operation recorded in EditLog is also called a transaction, and each transaction has an integer transaction id as a number. EditLog will be cut There are many segments, and each segment is called a segment.) When a NameNode switch occurs, after the Standby NameNode takes over, it will continue the unfinished write operation in the EditLog and start to write a new write operation record to the EditLog. (In addition, hadoop also provides another QJM EditLog solution)

The DNS&NTP distribution is responsible for the domain name service and time service of the entire system (including the client). This is a very necessary two existence in cluster deployment. Let me first talk about the necessity of DNS. First, Hadoop strongly advocates the use of machine names as identifiers in the HDFS environment. 2. Of course, you can indicate the mapping relationship between the machine name and IP in the /etc/hosts file, but please think about if you add a device to a cluster of thousands of devices, will the partner responsible for system maintenance scold the cluster What about the designer? The second is the necessity of NTP. About 90% of the problems I encountered when I first started contacting the Hadoop cluster were caused by the inconsistent time of each device. The time synchronization of each device is a basic guarantee for data consistency and management consistency.