Talking about Big Data - Data Source

in data •  4 years ago 

Content

In the era of the Internet of Things, every person/device is a producer and a user of data. Online connection is a data-based process, and interaction is the coming and going of data. The energy required for network collaboration comes from the tension and kinetic energy of data. --Professor Zeng

As mentioned above, since we talk about big data, the first priority of big data is to have data. Otherwise, how come "data is the first factor of production in the DT era". Data in the era of big data is as important as land in the agricultural era and capital in the industrial era.

Where does the data come from and where will it be generated?

Data is everywhere. Since humans invented characters, they have begun to record all kinds of data, but the storage medium is generally books, and it is difficult to analyze and process. With the rapid development of computer and storage technology, and the process of digitizing everything (audio digitization, graphics digitization, etc.), there has been an explosion of data, and the trend of data explosion, with the development of the Internet of Things technology, will increase The faster. At the same time, the requirements for data storage technology and processing technology will become higher and higher.

According to the Digital World Research Report published by IDC, the amount of data generated, copied and consumed by humans in 2013 reached 4.4ZB. By 2020, the amount of data will increase tenfold, reaching 44ZB. Big data has become the most precious wealth of mankind at present. How to use this data reasonably and effectively and play its due role is what big data will do.

Early companies were relatively simple. The data stored in relational databases was often their source of all data. At this time, their corresponding big data technology was the traditional OLAP data warehouse solution. Because the relational database basically contains all their data, often the big data technology is relatively simple, get statistics directly from the relational database, or at most build a unified OLAP data warehouse center.

Judging from the history of Taobao, the early data warehouse data basically comes from the main business OLTP database. The data is nothing more than user information (obtained through registration and authentication), product information (obtained through seller upload), and transaction data (through buying and selling behaviors). Obtained), collection data (obtained through user's collection behavior). From the perspective of the company’s business, the focus is on the statistics of these data, such as the total number of users, the number of active users, the number of transactions, the amount (you can drill down to categories, provinces, etc.), the number of Alipay transactions, the amount, etc. . Because at this time there is no marketing system, no advertising system, the company only pays attention to the relevant data of users, products, and transactions. The statistical processing of these data is all the big data of Taobao at that time.

However, with the development of business, such as personalized recommendation and the emergence of advertising systems, more data will be needed for support. The user data of the database, in addition to the collection, the shopping cart is the embodiment of user behavior, but the user’s Other behaviors, such as browsing data, search behavior, etc., are completely unknown at this time.

Here we need to introduce another data source, log data, which records the user's behavior data, through the cookie technology, as long as the user logs in once, it can be associated with the real user. For example, by acquiring the user's browsing behavior and purchasing behavior, the user can then recommend products that he may be interested in. After reading and watching, buying and buying is a recommendation algorithm based on these most basic user behavior data. These behavioral data can also be used to analyze the user's browsing path and browsing time. These data are an important basis for improving related Taobao products.

In 2009, the wireless Internet developed rapidly. With the large-scale emergence of apps based on native technology, it is no longer possible to obtain wireless user behavior data using traditional log methods. At this time, a number of new wireless data collection and analysis tools have emerged, such as Youmeng, Talkingdata, Taobao's internal wireless speed reading, etc., through the built-in SDK, they can count user behavior data on native.

The data has been counted, but new problems have also emerged, such as my user behavior on the PC, how do I correspond to the user behavior on the wireless, this is out of touch, because the PC is the standard on the PC, and the wireless uses the wireless If there is a unified user database, such as login name, email address, ID number, mobile phone number, imei address, mac address, etc., to uniquely identify a user, no matter where the data is generated, as long as it is the first Once connected, it can be matched later.

This involves an important topic-data standards. Data standards are not only about solving the problem of internal data association in the enterprise. For example, a good user database can solve many problems in the future big data association. It is assumed that the public security data wants to follow The hospital's data is connected and opened up to play a greater value. However, the public security identifies the user with the ID card, while the hospital identifies the user with the mobile phone number. With a unified user database, the data of both parties can be easily associated through idmapping technology.

Data standards are not only important for data association within the enterprise, but also for cross-organization and cross-enterprise data association. There are not many companies in the industry that have the ability to establish data standards such as user databases, and Alibaba is one of them. The government actually saw the value here very early. As early as July 2002, the second meeting of the National Informatization Leading Group reviewed and approved the "Guiding Opinions on China's E-government Construction" (hereinafter referred to as the "Opinions"). According to the guiding principles of the "Opinions", the Office of the State Council Information Leading Group formulated the "China's E-Government Phase I Project Construction Plan", which identified four basic and strategic resource databases to be built during the "10th Five-Year Plan" period. "Population Basic Information Database", "Basic Information Database of Legal Person Units", "Basic Information Database of Natural Resources and Spatial Geography", "Macroeconomic Information Database", referred to as the four basic information databases.

In the later stage of the development of big data, of course, the more data the better, the internal data of the enterprise can no longer meet the needs of the company, such as Taobao, and want to conduct a complete profile analysis of users, such as want to obtain the real-time position of users, hobbies , Constellation, consumption level, what kind of car to drive, etc., are used for precision marketing. Taobao’s own data is not enough. At this time, many companies will buy some data (some companies will also crawl some information by themselves, this is relatively simple), for example, Ali buys AutoNavi, Youmeng, and also purchases micro The relevant data of the blog is used for the user's label processing to obtain more accurate user portraits.

However, data transactions are not so simple.

Because data transactions involve several very big problems:

How to protect user privacy information

The European Union has issued harsh data protection regulations, and the United States has also imposed heavy penalties on operators who sell customer data. China's big data industry, which is still in its infancy, how can we ensure that user privacy information is not leaked? For some non-private information, such as geographic data, meteorological data, and map data, it is very valuable to open, trade, and analyze. However, once the user’s private data is involved, especially the private data of a single person, it will involve morality and Legal risks.
Desensitization before data transactions may be a solution, but it cannot completely solve this problem. Therefore, Ali also proposed another solution, based on the "available and invisible" technology guaranteed by the platform. For example, Alibaba Cloud, as a trading platform, is an intermediate guarantee agency like Alipay. Both parties’ data is uploaded to the Alibaba Cloud big data trading platform. Both parties can use each other’s data to obtain specific results, such as by uploading some algorithms and models. As a result, neither party can see any detailed data of the other party.

It is the problem of the owner of the data

As a means of production, data is different from land in the agricultural period and capital in the industrial period. It will not disappear after use. If it is a purchaser of data, who is the owner of this data? How to ensure that the purchaser of the data will not sell the data again? Or after the purchaser processes the data, who is the owner of the processed data?

It is the issue of the legality of data use

In big data marketing, precision marketing is currently the most used. In data transactions, personal data is also the most valuable. The purpose of the customer portraits we make in our daily analysis is to group and label a large number of customers, and then carry out targeted marketing and services. However, if the user’s personal information (such as age, gender, occupation, etc.) is used for marketing, the user’s consent must be obtained before the user can send advertising information, or can it be used directly?

Therefore, the transaction and related use of data must solve the problems of data standards, legislation and supervision. In the future, it is not ruled out that there will be special laws or even professional regulatory agencies, such as the establishment of a data supervision committee to supervise data transactions. Problems with usage. If this day is really reached, it is also a good thing. Data will not be of greater value until it is circulated. If every company has only its own data, even if it eliminates the information islands inside the company, there are also information islands outside the company.

If multi-party data can be used reasonably and appropriately, the so-called "wool out of pigs" will happen. For example, Ali Xiaodai uses B2B and Taobao data. In this case, for pigs (B2B, Taobaolai), this is a spillover effect of massive data in business scenarios, while for sheep (ant microloans), it is a different dimension at a lower cost. After data collection, the value of chemical reactions rises, which is a typical feature of intelligent business in the era of big data.

image.png

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!