If somebody want to train a machine learning model on textual input or if they need to find relations between words, they first need to convert text to vectors, these are called Word Embeddings. There are many way of doing this, one of them being taking a large sample of textual data and finding all unique words and assigning an Id to them. That works great but how how do you cluster similar or related words together? You need to run another clustering algorithms over the data to group related words.
Word2Vec is a machine learning model used to generate Word Embeddings with words which are similar to each other are in close proximity in vector space. Word2Vec is developed by team led by Mikolov et al while he was at google. You can read his paper on word2vec here
Word2Vec can automatically capture the relations between words like Paris, Beijing, Tokyo, Delhi, New York
are all clustered together in vector space. Similarly Cat, Dog, Rat, Duck
are all clustered together in vector space.
This also helps us finding interesting relations between many words. What if you remove Man
from King
and add Women
? You get Queen
King - Man + Women = Queen
Word2Vec can be used to perform the above relational operations. It can also extract and provide the most similar words, how similar are two words, pick the odd word out of group of words, if the model is trained with large enough corpus of data.
Here are two online demos to try out Word2Vec and what it can perform. One demo is a based on model trained using Google News and other on a much less data. You can see the difference of what happens if Word2Vec model is trained with low data and words are missing from its vocabulary.
http://bionlp-www.utu.fi/wv_demo/ (Model based on Google News)
http://simpledemos.online/word2vec (Model trained on small corpus of data, missing vocabulary)
Sources:
Image: www.samyzaf.com/ML/nlp/word2vec2.png
Ps: This is my first public post ever not just steemit, I would appreciate any comments and tips on how to improve. I would mostly be doing python tutorials, from GUI creation using PyQt to Sentiment analysis. If someone wants a basic tutorial or example please comment it.
Hi @oddpotato! it's good to meet a new steemian.
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit