Introduction
The concept of "big data" has been proposed for many years. From the beginning of mystery to today's rapid development, it has had a huge impact on people's lives. From the long wait of people at bus stops and taxi stands to the quick pick-up of Didi taxis, from the beginning of Taobao shopping, looking for the desired product like a needle in a haystack, to the accurate product recommendation on Taobao homepage today, from the beginning for food, family by family The store’s attempts have now been to choose any hotel you like based on Meituan’s recommendations. It can be said that “big data” has penetrated into all aspects of life and has provided great convenience to people’s lives.
With the advent of the big data wave, the government has issued a series of big data industry development plans and strategic documents to continue to promote the full integration of the big data technology industry and traditional fields, promote the upgrading of my country’s economic structural transformation, and improve the quality of economic development and international development. Competitiveness.
With the advent of the information explosion era, the massiveness, high speed, and diversity of information data itself has brought natural application scenarios for data science and big data technology.
The system of big data technology is huge and complex. The basic technology includes data collection, data preprocessing, distributed storage, NoSQL database, data warehouse, machine learning, parallel computing, visualization and other technical categories and different technical aspects. First, a generalized big data processing framework is given, which is mainly divided into the following aspects: data collection and preprocessing, data storage, data cleaning, data query analysis, and data visualization.
Big data refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. It is a massive amount of data that requires a new processing model to have stronger decision-making power, insight and discovery, and process optimization capabilities. , High growth rate and diversified information assets. 5V characteristics of big data (proposed by IBM): Volume (large), Velocity (high speed), Variety (diversity), Value (low value density), Veracity (authenticity).
Definition of Big Data
"Big data" refers to a collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. Gartner, the research organization of “Big data”, gave this definition: “Big data” requires new processing models to have stronger decision-making power, insight and discovery, and process optimization capabilities to adapt to massive, high growth rates and Diversified information assets.
The strategic significance of big data technology is not to master huge data information, but to professionally process these meaningful data. In other words, if big data is compared to an industry, then the key to profitability of this industry lies in improving the "processing capacity" of data, and realizing the "value-added" of data through "processing".
Big data requires special technology to effectively process a large amount of data within a tolerable elapsed time. Technologies applicable to big data, including massively parallel processing (MPP) databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.
Characteristics of big data
Speaking of big data, it is estimated that many people have only heard the concept, but there is no standard thing about what it is and how to define it, because in our impression it seems that many companies are called big data companies, and there are hundreds of business forms. Kind.
In the "Big Data Era", four characteristics of big data are mentioned: Volume (large), Variety (diversity), Velocity (high speed), Value (value), generally we call it 4V.
Volume
The characteristics of big data are first reflected in "big". From the pre-Map3 era, a small MB-level Map3 can meet the needs of many people. However, with the passage of time, the storage unit has changed from GB to TB in the past, and even The current PB, EB level. With the rapid development of information technology, data began to increase exponentially
long. Social networks (Weibo, Twitter, Facebook), mobile networks, various smart tools, service tools, etc., have all become sources of data. Taobao's nearly 400 million members generate about 20TB of commodity transaction data every day; about 1 billion Facebook users generate more than 300TB of log data every day. There is an urgent need for intelligent algorithms, powerful data processing platforms and new data processing technologies to count, analyze, predict and process such large-scale data in real time.
Variety (various)
A wide range of data sources determines the diversity of big data forms. Any form of data can have an effect. At present, the most widely used recommendation systems, such as Taobao, NetEase Cloud Music, Toutiao, etc., will analyze the user's log data to further recommend the things users like. Log data is data with obvious structure, and some data is not structured obviously, such as pictures, audio, video, etc. These data have weak causality, so they need to be manually labeled.