In-depth interpretation of spark VS Hadoop two big data analysis systems

Spark: fast and easy to use

Spark is good at performance, but it is also known for its ease of use, because it comes with an easy-to-use API that supports Scala (native language), Java, Python, and Spark SQL. Spark SQL is very similar to SQL 92, so you can get started right away without any learning experience.

Spark is a general-purpose parallel computing framework like Hadoop MapReduce open sourced by UC Berkeley AMP lab. Spark's distributed computing based on map reduce algorithm has the advantages of Hadoop MapReduce; but unlike MapReduce, the intermediate output results of Job can be saved. In memory, there is no need to read and write HDFS anymore, so Spark is better suited for algorithms that require iterative map reduce, such as data mining and machine learning.

Spark also has an interactive mode, so that both developers and users can get immediate feedback on queries and other operations. MapReduce has no interactive mode, but with additional modules such as Hive and Pig, it is easier for adopters to use MapReduce.

From the cost point of view: Spark requires a lot of memory, but a regular number of regular-speed disks can be used. Some users complained that temporary files will be generated and need to be cleaned up. These temporary files are usually kept for 7 days to speed up any processing on the same data set. Disk space is relatively cheap. Since Spark does not use disk input/input for processing, the used disk space can be used for SAN or NAS.

Fault tolerance: Spark uses Resilient Distributed Data Sets (RDD), which are fault-tolerant collections in which data elements can perform parallel operations. RDD can reference data sets in external storage systems, such as shared file systems, HDFS, HBase, or any data source that provides Hadoop InputFormat. Spark can create RDDs from any storage source supported by Hadoop, including the local file system, or one of the file systems listed above.

Hadoop: Distributed File System

Hadoop is a project of http://Apache.org. It is actually a software library and framework for distributed processing of huge data sets (big data) across computer clusters using a simple programming model. Hadoop can be flexibly expanded. It can easily support from a single computer system to thousands of commercial systems that provide local storage and computing capabilities. In fact, Hadoop is the heavyweight big data platform in the field of big data analysis.

Hadoop is composed of multiple modules that work together to build the Hadoop framework. The main modules of the Hadoop framework include the following:

•Hadoop Common

•Hadoop Distributed File System (HDFS)

•Hadoop YARN

•Hadoop MapReduce

Although the above four modules form the core of Hadoop, there are several other modules. These modules include: Ambari, Avro, Cassandra, Hive, Pig, Oozie, Flume, and Sqoop. They further enhance and expand the functions of Hadoop, allowing it to be extended to the field of big data applications and process huge data sets.

Many companies that use large data sets and analysis tools use Hadoop. It has become the de facto standard in big data application systems. The original intention of designing Hadoop was to handle this task: search and search billions of web pages, and collect this information into a database. It is precisely because of the desire to search and search the Internet that Hadoop's HDFS and distributed processing engine MapReduce are available.

Cost: MapReduce uses a regular amount of memory. Because data processing is based on disks, companies have to buy faster disks and a lot of disk space to run MapReduce. MapReduce also needs more systems to distribute disk input/output to multiple systems.

Fault tolerance: MapReduce uses TaskTracker nodes, which provide a heartbeat for JobTracker nodes. If there is no heartbeat, then the JobTracker node reschedules all the operations to be executed and ongoing operations to another TaskTracker node. This method is very effective in providing fault tolerance, but it will greatly extend the completion time of certain operations (even if there is only one failure).

Summary: Spark and MapReduce have a symbiotic relationship. Hadoop provides features that Spark does not have, such as a distributed file system, while Spark provides real-time memory processing for those data sets that need it. The perfect big data scenario is exactly what the designers had envisioned: let Hadoop and Spark run together in the same team.