Datascience and cryptocurrencies : Finding similar altcoins

In the past few months, the number of cryptocurrencies and ICO's has gone up significantly and it's really hard to keep up with all of the news and hype around them.

Machine learning/Data science has played a really important role in understanding more about text and gaining insights from them. I wanted to use this technique in mining important information from ICO whitepapers and comments from different forums.

This is part one of the many posts I will be doing in understanding more about cryptocurrencies and automating the extraction of information from the same.

Understanding the need to find similar cryptocurrencies

Diversifying portfolios. I really don't like to bet on a single sector or application. Currently, the blockchain is used for building cryptocurrencies, coins for mining or understanding for user behavior, coins for storage etc.
All if you have missed the train on a coin. You can find similar altcoins in the same area and invest in them. For example, let's say you want to invest in semiconductors stocks and missed an opportunity buying Nvidia. You could find similar stocks like AMD, Intel etc.

Clustering

Clustering is a traditional method for grouping together similar data. In my example, I have downloaded around 50 to 60 ICO whitepapers and clustered them together.

Steps:

Downloading ICO papers: This is one of the tedious steps but unfortunately, there are no API's which from where we can download the data and need to do this step manually.
Using the scikit libraries TfidfVectorizer we convert text to vectors of number that can be used by algorithms like KMeans to cluster the documents.

def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

   def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens
.....
.....
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
....
...
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

Clusters Founds

Cluster 1 ( Ads and consumer privacy)
- Keywords: data, cared, advertising, person, costs, communication, publishing, privacy, number, encrypted
- Whitepapers: Basic Attention Token, encryptotel, Patientory, Pillar Project, ScriptDrop
Cluster 2 ( Financial domain)
- Keywords: minting, white, white, paper, bank, true, holder, liquidity, voting
- Whitepapers: Chronobank, TrueFlip, Vivacoin
Cluster 3 (Prediction market and risks)
- Keywords: business, rewards, values, event, purchased, applications, price, smart, risks
- Whitepapers: Adel, Augur, Bancor, Civic, Gnosis
Cluster 4 ( Decentralized applications)
- Keywords: organizes, upgradeability, released, decentralized, government, page, voting, ethereum, run
- Whitepaper: Aragon
Cluster 5 (Storage and mining)
- Keywords: data, miners, nodes, computing, agent, dividends, tasks, storage, obligation, message
- Whitepaper: Filecoin, Sonm, Storj
Cluster 6 (Gaming and mobile related)
- Keywords: game, item, players, purchased, mobile, money, sales, eth, monetization, smart
- Whitepaper: Dmarket, Skincoin, Mobilego
Cluster 7 (Investment platforms)
- Keywords: trade, investments, ico, investors, assets, crypto, coin, profits
- Whitepaper: Coindash, Ethbits, Iconomi

As you can see there are some clear sectors that show up for example cluster 1 is all about privacy, ads and cluster 5 is about storage, mining etc.

Let me know if you find this post useful and also what you would like to know more about?