Why we "mine" voice data

in blockchain •  7 years ago  (edited)

Your voice is valuable. Even as virtual assistants and smart speakers bring language technologies into the mainstream, more language data is required to expand their application across additional languages, dialects and use cases. Your voice can help make this happen.

Take Amazon’s Alexa, which started with three languages (American English, German and Japanese) and broad consumer use cases such as playing music and checking the weather. Amazon is gradually rolling out more languages and use cases such as healthcare, but they will need to continuously collect more data to make this happen.

There is additionally a huge amount of latent demand from enterprises who want virtual assistants but have specific needs around security, privacy and industry domain knowledge that may be best addressed by alternative, enterprise focused solutions. Such enterprise solutions are likely to be far more company- and industry-specific and developed by a broader set of tech companies, startups and consultancies. All of these solutions will require more language data, much of which is unavailable today.

We as a community can aggregate basic language data and resources and help advance language technologies more quickly. When you contribute your voice data — the way you speak and express an intent — it provides the raw materials developers need to train a system that will better understand the way you and people similar to you speak. The more data we can aggregate from a broader set of people, the more robust these voice assistants will become.

Everything on LangNet is kept open source so more companies and developers can start building for new languages, dialects and uses cases. While open source resources may not always be sufficient for fully optimized, commercial grade systems, they should start to address the data sparsity problem that today makes it very expensive to take on projects in new languages and domains. In turn, accessible data and broader participation should result in faster innovation cycles and more interesting, useful apps across more languages and markets.

Changing the economic model for open source through blockchain

Blockchain allows us to create a new economic model for open source that distributes value through tokens rather than rely on donations or services revenue.

Because open source resources are shared and not owned, the aim is to transmit value at an ecosystem level rather than individual or transactional level. Resource consumers such as companies or researchers lock up tokens to access the platform. New tokens are created weekly through inflation to incentivize resource providers, such as data storage nodes or NLP model developers. Everyone who holds tokens effectively pays through inflation to support the open source ecosystem.

This is where voice data mining comes in. Again, because all data is open source and not owned by any single entity, it makes most sense to compensate data contributions through token inflation. However, the aggregation of voice-based intents is a somewhat defined problem. We generally know what languages and use cases need to be addressed, and once aggregated, this voice data becomes a non-scarce resource that can be reused by all stakeholders. Therefore, we have defined the problem set as 50 languages, 100,000 hours of voice data per language and assigned a fixed number of tokens to pay for this data.

By fixing the number of tokens and amount of data, we can create a mining schedule, much like Bitcoin, that decays over defined periods. Instead of being time based, however, our periods are based on the amount of data accumulated for that language. We split the 100,000 hours per language into 20 periods of 5,000 hours each.


Mining schedule for the first 5 periods of a single language

This allows us to frontload payouts to reward early adopters. Furthermore, as more data is accumulated, the value of the network should increase and put upward pressure on the price of the token, even as the number of tokens paid out decreases. Finally, because each language’s payout decays independently, languages that remain unaddressed should become more lucrative over time. More details are available in our whitepaper.

We have set the initial price of LANG tokens to $0.01 per token, which means you can be earning 1,000 LANG per hour x $0.01 = $10 per hour of contributed voice data for the first 5,000 hours of data for each language. Again, this is a fixed token payout, so the fiat value may change in the future.

This also means that we will need to start capping individual contributions per period, as it is important for us to aggregate a diverse set of voices for each language. We will soon be announcing new data campaigns, individual caps and allocations, and opportunities to earn additional LANG through our referral program.

For now, we have allocated a separate pool of tokens, independent of the mining schedules, for building community through voice data, so please keep contributing through our Telegram bot, earn LANG, and spread the word!


Talk to us on:
Twitter | Telegram | Facebook

Visit us: langnet.io

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!