Datascience and cryptocurrencies : Finding similar altcoins

in crytocurrency •  7 years ago  (edited)

In the past few months, the number of cryptocurrencies and ICO's has gone up significantly and it's really hard to keep up with all of the news and hype around them.

Machine learning/Data science has played a really important role in understanding more about text and gaining insights from them. I wanted to use this technique in mining important information from ICO whitepapers and comments from different forums.

This is part one of the many posts I will be doing in understanding more about cryptocurrencies and automating the extraction of information from the same.

Understanding the need to find similar cryptocurrencies

  • Diversifying portfolios. I really don't like to bet on a single sector or application. Currently, the blockchain is used for building cryptocurrencies, coins for mining or understanding for user behavior, coins for storage etc.
  • All if you have missed the train on a coin. You can find similar altcoins in the same area and invest in them. For example, let's say you want to invest in semiconductors stocks and missed an opportunity buying Nvidia. You could find similar stocks like AMD, Intel etc.

Clustering

Clustering is a traditional method for grouping together similar data. In my example, I have downloaded around 50 to 60 ICO whitepapers and clustered them together.

Steps:

  • Downloading ICO papers: This is one of the tedious steps but unfortunately, there are no API's which from where we can download the data and need to do this step manually.
  • Using the scikit libraries TfidfVectorizer we convert text to vectors of number that can be used by algorithms like KMeans to cluster the documents.
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

   def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens
.....
.....
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
....
...
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

Clusters Founds

  • Cluster 1 ( Ads and consumer privacy)
  • Cluster 2 ( Financial domain)
    • Keywords: minting, white, white, paper, bank, true, holder, liquidity, voting
    • Whitepapers: Chronobank, TrueFlip, Vivacoin
  • Cluster 3 (Prediction market and risks)
    • Keywords: business, rewards, values, event, purchased, applications, price, smart, risks
    • Whitepapers: Adel, Augur, Bancor, Civic, Gnosis
  • Cluster 4 ( Decentralized applications)
    • Keywords: organizes, upgradeability, released, decentralized, government, page, voting, ethereum, run
    • Whitepaper: Aragon
  • Cluster 5 (Storage and mining)
    • Keywords: data, miners, nodes, computing, agent, dividends, tasks, storage, obligation, message
    • Whitepaper: Filecoin, Sonm, Storj
  • Cluster 6 (Gaming and mobile related)
    • Keywords: game, item, players, purchased, mobile, money, sales, eth, monetization, smart
    • Whitepaper: Dmarket, Skincoin, Mobilego
  • Cluster 7 (Investment platforms)
    • Keywords: trade, investments, ico, investors, assets, crypto, coin, profits
    • Whitepaper: Coindash, Ethbits, Iconomi

As you can see there are some clear sectors that show up for example cluster 1 is all about privacy, ads and cluster 5 is about storage, mining etc.

Let me know if you find this post useful and also what you would like to know more about?

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Very cool. I hadn't thought of looking into whitepapers. Did you notice any trends in the whitepapers' time-series as people churn out new ideas?