hadoop
1) Introduction to hadoop
Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop implements a distributed file system HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; and it provides high throughput to access application data, suitable for applications with very large data sets. The core design of the Hadoop framework is: HDFS and MapReduce. HDFS provides storage for massive amounts of data, while MapReduce provides calculations for massive amounts of data
2) Advantages of hadoop
Hadoop performs data processing in a reliable, efficient, and scalable way.
Reliability: Hadoop stores data in multiple backups, and Hadoop provides high throughput to access application data.
High scalability: Hadoop distributes data and completes computing tasks among the available computer clusters. These clusters can be easily expanded to thousands of nodes.
High efficiency: Hadoop works in parallel, speeding up the processing speed through parallel processing.
High fault tolerance: Hadoop can automatically save multiple copies of data, and can automatically redistribute failed tasks.
Low cost: Hadoop can be deployed on low-cost hardware.
2. spark
1) Introduction to spark
Spark is a fast and universal computing engine designed for large-scale data processing. Spark has the advantages of Hadoop MapReduce. Spark's output results in the middle of the job can be stored in memory, so there is no need to read and write HDFS. Therefore, Spark performance and computing speed are higher than MapReduce.
2) Spark advantages
Fast calculation speed: Because spark reads data from the disk, puts the intermediate data in the memory, completes all necessary analysis and processing, and writes the results back to the cluster, spark is faster.
Spark provides a large number of libraries: including Spark Core, Spark SQL, Spark Streaming, MLlib, GraphX.
Support multiple resource managers: Spark supports Hadoop YARN, and its own independent cluster manager
Simple operation: The high-level API strips the focus of the cluster itself, and Spark application developers can focus on the calculations that the application needs to do
3. The difference between spark and hadoop
1) Different application scenarios
Both Hadoop and Spark are big data frameworks, but their respective application scenarios are different. Hadoop is a distributed data storage architecture that distributes huge data sets to multiple nodes in a cluster composed of ordinary computers for storage, reducing the cost of hardware. Spark is a tool specially used to process large data in distributed storage. It needs to use hdfs data storage.
2) Different processing speed
Hadoop's MapReduce processes the data step by step. It reads the data from the disk, performs a process, writes the result to the disk, and then reads the updated data from the disk, processes it again, and finally returns the result. Save to disk, the process of accessing the disk will affect the processing speed. Spark reads data from the disk, puts the intermediate data in memory, completes all necessary analysis and processing, and writes the results back to the cluster, so spark is faster.
3) Different fault tolerance
Hadoop writes the processed data to the disk every time, and there is basically no power outage or data loss due to errors. Spark's data objects are stored in a flexible distributed data set RDD. RDD is a collection of read-only objects distributed in a group of nodes. If part of the data set is lost, they can be reconstructed according to the data derivation process. Moreover, CheckPoint can be used to achieve fault tolerance during RDD calculation.
4. The connection between spark and hadoop
Hadoop provides HDFS, a distributed data storage function, and MapReduce for data processing. MapReduce can be processed without relying on spark data. Of course, spark can also operate without relying on HDFS, it can rely on other distributed file systems. But the two can be combined together. Hadoop provides distributed clusters and distributed file systems. Spark can rely on Hadoop's HDFS instead of MapReduce to make up for the lack of MapReduce computing power.