Mapreduce execution process

One of the core ideas of Hadoop is mapreduce (distributed computing framework).

Introduction to MapReduce

MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the calculation problem of massive data.
The execution process of MapReduce mainly includes three phases: Map phase, Shuffle phase, Reduce phase

The execution steps of Mapreduce:

1.Map task processing

1.1 Read files in HDFS, and parse each line into one. The map function is called once for each key-value pair.

1.2 Overwrite map(), receive the output generated in 1.1, process it, and convert it to a new output.

1.3 Partition the 1.2 output. It is divided into one area by default.

1.4 Sort (according to k) and group data in different partitions. Grouping refers to putting the values of the same key into a set. After sorting <hello,1> <hello,1> <me,1> <you,1> After grouping: <hello,{1,1}><me,{1}><you,{1}>

1.5 (Optional) Reduce the grouped data.

2.Shuffle

Shuffle, translated into Chinese is shuffle. There is no soul if mr is not sorted, and shuffle is a very important process in mr. It occurs after the execution of the Map and before the execution of the Reduce.

The Shuffle phase spans the Map side and the Reduce side, including the Spill process on the Map side, and the copy and merge/sort processes on the Reduce side. It is generally considered that the Shuffle stage is a process in which the output of the map is used as the input of the reduce.

Copy: The Reduce end starts some copy threads, and pulls its own part of the output file on the map end to the local through HTTP. Reduce will pull data from multiple map ends, and the data of each map is in order.
Merge: The copied data will be put into the memory buffer first, where the buffer is relatively large; when the buffer data volume reaches a certain threshold, the data will be overwritten to the disk (similar to the map side, the overflow process will execute sort & combine). If multiple overflow files are generated, they will be merged into an ordered final file. This process will also continue to perform sort & combine operations.

3.Reduce task processing

3.1 The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions.

3.2 Combine and sort the output of multiple maps. Cover the reduce function, receive the grouped data, and implement your own business logic,

After processing, a new output is generated.

3.3 Write the output of reduce to HDFS.