Summary of popular big data engines

TeraData

The established data warehouse company, which has been listed for more than ten years, is the leader in data warehouse (from Gartner), and is currently working towards the cloud. Mainly provide all-in-one computers, MPP architecture, stable operation, before ICBC used TD system, the price is relatively expensive.

Greenplum

The first product in 2006, based on PostgreSQL, adopts the shared nothing MPP architecture, and is mainly used for data analysis OLAP. Acquired by EMC in 2010 and open sourced in 2015, it has a complete ecosystem. Greenplum is the only open source database among the world's top ten classic and real-time data analysis products.

Vertica

Shared No Shared Storing MPP, pioneering a Storing DBMS, version 1.0 in 2006, acquired by Hewlett-Packard in 2011, the commercial version is powerful and is purchased and used by many data-based companies. Mainly used for data warehouse and OLAP, support time series data, machine learning, etc., can also be adapted to hadoop, spark, etc., even if it is adapted to hadoop, the speed is significantly faster than impala, let alone hive on tez.

Hadoop (HDFS+MapReduce+Yarn)

HDFS and MapReduce were included in the Hadoop project in 2006, and Cloudera began to provide services based on Hadoop in 2008. Hadoop is a software framework that can perform distributed processing of large amounts of data. The characteristics of high scalability, high fault tolerance, and low cost can be regarded as opening another door for the big data field.

Hive

Hive is a data warehouse analysis system based on Hadoop. It provides a wealth of SQL query methods to analyze the data stored in the Hadoop distributed file system. Hive is more suitable for offline processing, because it has a slower response speed when converting SQL to MapReduce, and it can also speed up by reducing the number of orders through DAG through Hive on Tez.

HBase

Hadoop-based column storage database is characterized by support for large and wide tables, and supports structured and semi-structured data. In 2007, the first available HBase was released with Hadoop 0.15.0. HBase's LSM-Tree architecture has greatly improved the write performance, but it has affected the real-time read performance.

Impala

Open source MPP query analysis engine on Hadoop, C++, storage support hdfs, hbase, S3, etc. It mainly solves the problem of Hive speed being too slow. Cloudera support. It can be used with hdfs+parquet storage format or integrated with kudu.

Spark

A unified analysis engine for large-scale data processing. It not only supports job task processing, but also supports stream processing (SparkStreaming) and SQL (SparkSQL), as well as machine learning and graph processing, and the community ecology is active. It is generally believed that compared with MR, spark significantly speeds up through memory computing. The Spark community is very mature, and many of the platforms or big data components mentioned later are seamlessly integrated with spark.

Kylin

An open source distributed analysis engine that provides SQL query interfaces and multidimensional analysis (OLAP) capabilities on Hadoop/Spark to support ultra-large-scale data. It was originally developed by eBay Inc. and contributed to the open source community. It can query huge Hive tables in sub-seconds. The core is to pre-load and build the cube. The cube specifies the measurement dimension. I think it is essentially a materialized view.

Apache Kudu

Kudu is cloudera's open source column storage system (fast analytics on fast data) running on the hadoop platform, written in C++. With hdfs and hbase, why do we need kudu? One is that the table structure of kudu is similar to that of a relational database and is simple to use; the other is to provide an efficient insert/update mechanism, and the performance of a large number of random reads is significantly higher than that of hbase, so it can be applied to near real-time analysis and quickly analyze those rapidly changing data. Kudu is suitable for SQL-based OLAP, its storage does not depend on hdfs, and it can also be integrated with impala for use.

ClickHouse

The fastest open source OLAP engine. Column storage + fixed length, can give full play to the advantages of vector computing. Sparse matrix + approximate calculation improves response speed, but it cannot be used for spot checks. Vertica claims to be 3 to 5 times faster than Vertica and 300 times faster than HIVE. Made in Russia, the open source code is very valuable, but the detailed documentation is not rich enough.

SnappyData

A unified OLTP+OLAP+ streaming-write memory distributed database based on the combination of Spark+GemFire. Spark can adapt to the storage, it can adapt to support, and speed up by caching to memory. Some enhancements (such as approximate calculations, etc.) are provided by the commercial version.

Druid

OLAP database + time sequence support, support for high frequency inserts, intake data is divided into three parts: Timestamp, Dimensions, and Metrics, and pre-aggregate data according to time granularity, without exploding dimensions (compared to kylin). The index uses bitmap, which is fast. Data is stored in Deep Storage (permanent) and loaded into Historicals, which can be understood as cold data and hot data, and can be adapted to various back-end storage. It comes with a visual graphical interface, which is pretty good.

Presto

For the pure memory computing OLAP engine, the FaceBook team used HIVE before, but the MR that HIVE relies on was too slow, so they built a Presto to speed up through parallel memory calculations, without storing data itself. Used for data warehouse and OLAP. The TeraData team has supported this project, but it has not supported it anymore.

Google Mesa

Mesa is a distributed, multi-copy, highly available data processing, storage and query system for structured data. Generally, data is generated from upstream services (such as a batch of spark streaming jobs), and the data is aggregated and stored internally. Support near real-time update (compared with the Cube solution), the data is divided into dimension columns and index columns, and the index column specifies the aggregation function.

Apache Doris

The predecessor was Baidu's 2017 open source system PALO, which later contributed to Apache and changed its name to Doris. Doris is an MPP OLAP system, which mainly integrates the technologies of Google Mesa (data model), Apache Impala (MPP Query Engine) and Apache ORCFile (storage format, encoding and compression). Highly compatible with Mysql protocol. Metadata management updates impala's p2p model. Doris uses the Paxos protocol and the Memory + Checkpoint + Journal mechanism to ensure the high performance and reliability of metadata.

ElasticSearch

The first version in 2010, the company was established and operated in 2012. Currently Elasticsearch is the most popular enterprise search engine, followed by Apache Solr, which is also based on Lucene. Used for real-time retrieval. Housekeeping skill is full-text search supported by inverted index

Parquet

An exquisite columnar storage format based on Google Dremel. The core idea is to use "sharding and aggregation algorithm" to mark nested type data. It supports multiple query engines and calculation frameworks, and can smoothly convert multiple data models (such as Avro, Thrift, etc.). The core of the aggregation algorithm is to divide the r and d values, which represent the Repetition level and the definition level. R indicates the level of the repeated field, D:optional or repeated may not exist, but if it actually exists, the value of D is +1. With these two values, the vector machine can be used to restore the data structure.

CarbonData

An indexed column storage open sourced by Huawei for interactive query. Multi-dimensional OLAP and point queries are supported by establishing multi-level indexes. Support materialized views. It is recommended to define a good index dimension according to the characteristics of the column value to avoid negative results caused by strong distinct values. The integration with SPARK is smoother, but it is still necessary to pay attention to the workload that may be caused by using multiple back-end storage formats.

MongoDb

MongoDb is an open source NoSQL document database. The basic concepts are database, collection, and document. The document is a key-value pair (ie BSON). MongoDB documents do not need to set the same fields, and the same fields do not need the same data types, which is very different from relational databases. Support fixed size collection. Support MapReduce, aggregation, sharding and copying. The MongoDb ecosystem continues to improve, and it continues to rise in the database rankings, and is sought after by major Internet companies. The commercial company 10gen provides MongoDB support. In February 2009, 10gen officially open sourced the first version of MongoDB. Later, the company changed its name to MongoDB. The commercial version provides encryption, LDAP and Kerberos integration and other more complete functional support.

Cassandra

Apache Cassandra is an open source NoSQL database system. It was originally developed by Facebook and then open sourced. Use a wide column storage model, similar to HBase, but it does not need to be stored on HDFS, but directly stored locally, combined with memory and disk for reading and writing. The write performance is greater than the read performance. From the Benchmark test results, the overall performance is better than HBase.