Content
MapReduce is a programming model for parallel operations on large-scale data sets (greater than 1TB). The concepts "Map" and "Reduce" are their main ideas. Both are borrowed from functional programming languages, as well as features borrowed from vector programming languages. It greatly facilitates programmers to run their programs on distributed systems without distributed parallel programming. The current software implementation is to specify a Map (mapping) function to map a set of key-value pairs into a set of new key-value pairs, and specify a concurrent Reduce function to ensure that all mapped key-value pairs Each of them shares the same key set.
1. What is MapReduce?
Let me first talk about the four major components of Hadoop: HDFS: Distributed Storage System. MapReduce: Distributed computing system. YARN: Hadoop's resource scheduling system. Common: The underlying support components of the above three components, mainly providing basic toolkits and RPC frameworks. Mapreduce is a programming framework for distributed computing programs and the core framework for users to develop "hadoop-based data analysis applications". The core function of Mapreduce is to integrate the business logic code written by the user and its own default components into a complete distributed computing program, which runs concurrently on a hadoop cluster.
Two, MapReduce job running process
You can call the submit() method or waitForCompletion() method on the Job object to run a MapReduce job. These methods hide a lot of processing behind the scenes. Let's uncover the steps of running a job behind Hadoop. The whole process is shown in the figure below (each step will be explained later):
On the overall level, there are five independent entities:-Client, which submits MapReduce jobs. -YARN resource manager (YARN resource manager), responsible for coordinating the allocation of computer resources on the cluster. -YARN node manager (YARN node manager), responsible for starting and monitoring the computing container (container) on the machines in the cluster. -The application master of MapReduce, responsible for coordinating the tasks of MapReduce jobs. MRAppMaster and MapReduce tasks run in a container, which is scheduled by the resource manager (schedule) (understood as division and allocation is more appropriate here) and managed by the node manager. -Distributed file system (usually HDFS), used to share job files among other entities.
Job Submission
Call the submit() method on the Job object, create a JobSubmitter instance internally, and then call the submitJobInternal() method of the instance (Figure 1, step 1). If you use the waitForCompletion() method to submit the job, this method polls the progress of the job every 1 second. If the progress changes, report the progress to the console. When the job is successfully completed, the job counter is displayed. Otherwise, the error that caused the job to fail is logged to the console.
The job submission process implemented by JobSubmitter is as follows:-Request an application ID from the resource manager, which is used as the ID of the MapReduce job (step 2). -Check the output directory specified by the job. For example, if the output directory is not specified or already exists, the job will not be submitted and an error will be thrown to the MapReduce program to calculate input splits for the job. If the slice cannot be calculated (perhaps because the input paths do not exist), the job will not be submitted and an error will be thrown to the MapReduce program. -Copy the necessary resources for job operation, including job JAR files, configuration files, and calculated input fragments, to a shared file system directory named after the job ID (step 3). The job JAR file is copied with a high replication factor (controlled by the mapreduce.client.submit.file.replication property, the default value is 10), so when the job task is running, there are many jobs in the cluster A copy of the JAR is accessible to the node manager. -Submit the job by calling submitApplication on the resource manager (step 4).
Job Initialization
When the resource manager receives the submitApplication() method call, it submits the request to the YARN scheduler. The scheduler allocates a container, and the resource manager starts the application master process in the container, which is managed by the node manager (steps 5a and 5b).
The application master of the MapReduce job is a Java application, and its main class is MRAppMaster. It initializes the job (step 6) by creating a certain number of bookkeeping objects to track the progress of the job (step 6), which accepts the progress and completion of the task report. Next, the application master obtains the input fragment calculated by the client from the shared file system (step 7). Then it creates a map task for each shard, and also creates multiple reduce task objects controlled by the mapreduce.job.reduces property (or set by the setNumReduceTasks() method on the Job object). The task ID is assigned at this time.
The Applcation master must decide how to run the tasks that make up the MapReduce job. If the job is relatively small, the application master may choose to run these tasks on the JVM running with itself. The premise for this to happen is that the application master judges that the cost of allocating and running tasks on a new container exceeds the rewards of running these tasks in parallel, and compares them with running these tasks on the same node sequentially. Such a job is called uberized, or runs as an uber task.