Technology under big data

Content

First of all, let's look at Volume (large). In the past, data was basically stored in relational databases. The typical architecture is IOE (that is, IBM minicomputer + Oracle database + Emc storage). Under this architecture, the demand can be quickly met when the amount of data is small in the past, but when the data accumulates to a certain scale, it is difficult to process. For example, we found in the IOE architecture of a customer that for a historical table that has accumulated for many years, the amount of data exceeds 100 million, and the fuzzy or associated operations on this table will be very slow, and even unable to produce results. So at this time, people began to wonder whether there is an alternative. IOE is equivalent to an elephant. Although it is very strong, there will be bottlenecks in processing capacity. It is a typical centralized processing mode. Then it naturally corresponds to the distributed processing mode, the so-called ant colony tactics, which process massive amounts of data through multiple distributed PC servers. When it comes to technology under big data, we have to mention Google, the most technologically leading company of this era. Google published three technical papers on GFS, MapReduce, and BigTable from 2003 to 2004. Since then, the open source world has come out of Hadoop, which means that many of us will mention Hadoop when we talk about big data. Among them, GFS is the distributed file system corresponding to Hdfs, MapReduce is the parallel processing framework corresponding to mapreduce on hadoop, and bigtable corresponds to hbase.

The second is Velocity (high speed). We know that the big data era has produced a lot of sensor data. These data need to be uploaded in real time and need to be processed in real time. In traditional processing, there is basically no such application scenario. In the era of big data, we think of this kind of data like a river, which is continuously generated and processed, so there is stream processing technology. With the pace of business development and the complexity of business processes, our attention is increasingly focused on "data flow" rather than "data set". Decision makers are interested in keeping close to the lifeblood of their organization and obtaining real-time results. What they need is an architecture that can handle data streams that occur at any time. Current database technology is not suitable for data stream processing.

For example, to calculate the average value of a set of data, you can use a traditional script. But for the calculation of the average value of the moving data, whether it is arrival, growth, or unit after unit, there are more efficient algorithms. If you want to build a data warehouse and perform arbitrary data analysis and statistics, open source product R or commercial products similar to SAS can be achieved. But what you want to create is a data flow statistics set, to which data blocks are gradually added or removed, moving average calculations are performed, and the database does not exist or is not yet mature. The ecosystem surrounding the data stream is underdeveloped. In other words, if you are negotiating a big data project with a supplier, then you must know whether data stream processing is important to your project and whether the supplier has the ability to provide it.

Next is Variety (diversity). In the era of big data, in addition to the traditional structured data that needs to be processed, there are semi-structured data and unstructured data. For example, a large amount of log data belongs to unstructured data. Many Internet companies have their own massive data collection tools, which are mostly used for system log collection, such as Hadoop's Chukwa, Cloudera's Flume, Facebook's Scribe, etc. These tools all adopt a distributed architecture. , Can meet the log data collection and transmission requirements of hundreds of MB per second. MogoDB is also an important product in Nosql. MongoDB is a database based on distributed file storage. The data structure it supports is very loose, it is a bson format similar to json, so it can store more complex data types. The query language supported by MongoDB is very powerful, and its syntax is a bit similar to an object-oriented query language, which can almost achieve most of the functions similar to single-table queries in relational databases, and it also supports indexing of data. There are also a variety of products such as in-memory databases and graph databases suitable for various application scenarios. In the era of big data, data diversity and scene diversity have brought a variety of single technical products and forms.

Finally, there is Value. There are two technologies, one is to mine the value of data, and the other is to present the value of data.　The main purpose of data analysis and mining is to gather the information hidden in a large number of seemingly chaotic data, extract and refine, in order to find the potentially useful information and the process of the inner law of the research object.

Data mining algorithm is a set of heuristics and calculations to create a data mining model based on data. In order to create the model, the algorithm will first analyze the data provided by the user and look for specific types of patterns and trends. And use the analysis results to define the best parameters for creating a mining model, and apply these parameters to the entire data set to extract feasible patterns and detailed statistics. The theoretical core of big data analysis is data mining algorithms. There are many kinds of data mining algorithms. Different algorithms based on different data types and formats will present different characteristics of the data. Various statistical methods can go deep into the data and dig out the value of the data.

One of the most important application areas of big data analysis is predictive analysis. Predictive analysis combines a variety of advanced analysis functions, including special statistical analysis, predictive modeling, data mining, text analysis, entity analysis, optimization, real-time scoring, and machine Study and so on, so as to predict the future or other uncertain events. Mining its characteristics from the complex data can help us understand the current situation and determine the next course of action, from relying on guessing to make decisions to relying on predictions to make decisions. It can help analyze trends, patterns, and relationships in users' structured and unstructured data, use these indicators to gain insights and predict future events, and make corresponding measures.

Data visualization mainly uses graphical means to convey and communicate information clearly and effectively. Mainly used in massive data association analysis. Because the information involved is scattered and the data structure may not be uniform, with the help of a powerful visual data analysis platform, it can assist manual operations to analyze data and make complete analysis charts. , Simple and clear, clear and intuitive, and easier to accept.

D3 is a javascript document-oriented visualization class library. It is powerful and innovative so that we can directly see information and allow us to interact normally. Its author is Michael Bostock, a graphical interface designer for the New York Times. For example, you can use D3 to create H™l tables from any number of arrays. You can use any data to create interactive progress bars, etc. Using D3, programmers can create interfaces between programs and organize all types of data.