Deep understanding of MapReduce Shuffle

1. How MapReduce works

We analyze the working principle of MapReduce from the perspective of the three roles of Client, JobTracker, and TaskTracker

The client starts a job and submits a JobId to the JobTracker. The resource files required to run the job, such as jar packages, and the fragmentation information input by the client after calculation, are copied to HDFS. The sharding information determines the number of map tasks and is assigned and calculated by the JobTracker.
After JobTracker receives the job task, it places the job in the job queue.

When the job scheduler schedules a job according to its own scheduling algorithm, it will follow the sharding information and provide data for each shard.
Create a map task and assign the map task to TaskTracker for execution. It should be noted that map tasks are not randomly assigned to TaskTracker. For specific reasons, see the principle of data localization introduced in the previous article. The number of maps that can be run on each node is limited. TaskTracker has a fixed number of map slots and reduce slots according to the number of host cores and the size of memory.
TaskTracker will periodically send heartbeats to JobTracker. If the heartbeat is not received for a certain period of time, JobTracker defaults that the TaskTracker node is hung up for resource recovery, and redistributes tasks on this node to other nodes. JobTracker will monitor the execution of the task, and when the execution is complete, the JobTracker will update the status of the job to complete. If a certain number of tasks fail to execute, the job will be marked as failed, and the JobTracker will send the running status information of the job to the Client.

2. In-depth Shuffle process

As shown in the green dashed box: from the output of map() to the input of reduce( ), the middle process is called the shuffle process. The Shuffle process is divided into the Shuffle process on the Map side and the Reduce side:

Map end process:

Ring memory buffer area: Each split data is processed by a map task. The processing result of map will not be directly written to the hard disk, but will be transferred to the ring memory buffer area first. The default size is 100M (can be modified by configuration). When the content of the buffer reaches 80%, it will start to overflow. At this time, the overflow content of the buffer will be written to the disk to form a spill file. Note that this file has no fixed size.
After partitioning and sorting in the memory, it overflows to the disk: The main function of the partition is to specify which reduce task the output result of the map is assigned to. The default is to use the key value of the map output result to take the hashcode and the number of redue tasks configured in the code. Modular operation, the same value is divided into one area, that is, one reduce task corresponds to the data of one partition.

The advantage of this is that it can prevent some reduce tasks from being allocated a large amount of data, and some reduce tasks are allocated only to a small amount or no data, and the processing power of the reduce is average. And in each partition (partition), there will be a sort by key sorting. If the Combiner is set at this time, the result of the sorting will be subjected to the Combine operation, which is equivalent to the local reduce in the map phase. The purpose of this is to make as much as possible Less data is written to disk.
Merging overflow files: With the execution of the map task, the files will continue to overflow until the last record is output. A large number of overflow files may be generated. At this time, these large overflow files need to be merged. The process of merging files will continue Sorting and Combine operations have two benefits: reducing the amount of data written to disk each time & reducing the amount of data transmitted over the network in the next reduce phase. Finally, it merges into a partitioned and sorted large file. At this time, configuration compression processing can be performed, which can reduce the amount of network transmission between different nodes.

After the merge is completed, start copying the data to the corresponding reduce processing, then how to find the reduce task corresponding to the partition data? Simply put, the macro information of the entire cluster is saved in the JobTracker, as long as the reduce task obtains the corresponding map output location from the JobTracker. For details, please refer to the working principle of MapReduce above.