In the past few months, the number of cryptocurrencies and ICO's has gone up significantly and it's really hard to keep up with all of the news and hype around them.
Machine learning/Data science has played a really important role in understanding more about text and gaining insights from them. I wanted to use this technique in mining important information from ICO whitepapers and comments from different forums.
This is part one of the many posts I will be doing in understanding more about cryptocurrencies and automating the extraction of information from the same.
Understanding the need to find similar cryptocurrencies
- Diversifying portfolios. I really don't like to bet on a single sector or application. Currently, the blockchain is used for building cryptocurrencies, coins for mining or understanding for user behavior, coins for storage etc.
- All if you have missed the train on a coin. You can find similar altcoins in the same area and invest in them. For example, let's say you want to invest in semiconductors stocks and missed an opportunity buying Nvidia. You could find similar stocks like AMD, Intel etc.
Clustering
Clustering is a traditional method for grouping together similar data. In my example, I have downloaded around 50 to 60 ICO whitepapers and clustered them together.
Steps:
- Downloading ICO papers: This is one of the tedious steps but unfortunately, there are no API's which from where we can download the data and need to do this step manually.
- Using the scikit libraries TfidfVectorizer we convert text to vectors of number that can be used by algorithms like KMeans to cluster the documents.
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
.....
.....
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)
....
...
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
Clusters Founds
- Cluster 1 ( Ads and consumer privacy)
- Keywords: data, cared, advertising, person, costs, communication, publishing, privacy, number, encrypted
- Whitepapers: Basic Attention Token, encryptotel, Patientory, Pillar Project, ScriptDrop
- Cluster 2 ( Financial domain)
- Keywords: minting, white, white, paper, bank, true, holder, liquidity, voting
- Whitepapers: Chronobank, TrueFlip, Vivacoin
- Cluster 3 (Prediction market and risks)
- Cluster 4 ( Decentralized applications)
- Keywords: organizes, upgradeability, released, decentralized, government, page, voting, ethereum, run
- Whitepaper: Aragon
- Cluster 5 (Storage and mining)
- Cluster 6 (Gaming and mobile related)
- Cluster 7 (Investment platforms)
As you can see there are some clear sectors that show up for example cluster 1 is all about privacy, ads and cluster 5 is about storage, mining etc.
Let me know if you find this post useful and also what you would like to know more about?
Very cool. I hadn't thought of looking into whitepapers. Did you notice any trends in the whitepapers' time-series as people churn out new ideas?
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit