Words are usually used as the basic semantic units in NLP, which is good for standardization. Meanwhile, word roots do carry meanings already. So we can try anchoring word representations with the roots it contains. The word-root-relationship, however, is not uniform, some words do not contain any root and is in itself a meaning unit, while others may contain different numbers of roots. Therefore, I picture a root vectorization that is affiliated with a word vectorization–making the roots not exactly their own level, but only making sense in conjunction with the words. That also means if we take sets of word vectors as given, the root vectors would change depending on the different word vectorization.
It makes obvious sense that the root vectors should contribute to the word vectors by making the latter a function of the former. While a linear mapping would be desired for simplicity, a word might not always be a “summation” of its roots. So I will be open to various function forms as I dig into the data. Nevertheless, I imagine a relatively simple function as what we are working on should be very small “lego pieces” in an NLP model. I will start with a linear mapping and try some other functions, while getting cues from the results I see.
As per the nature of the relationship between words and roots, many words are not made of roots. Yet, some other words do contain roots, but their meanings include something more than the roots’ meanings. So there should be an important “residual” component of the root-to-word function. In many cases, the residual simply covers the meaning of the entire word. This setup ties in with the affiliative nature of the word root relationship. In fact, let’s call the residual the “idiosyncratic” denotation of the words.
Modelling:
Let’s first look at the data we’ll be working with. I’m using a pre-trained GloVe word representation.
glove = torchtext.vocab.GloVe(name="6B", # trained on Wikipedia 2014 corpus
dim=50)
As for word roots, I’m starting with a very small set of data found online here. It does contain the meaning as well as sample words, which is handy as you will see below. This is just to get the snowball rolling, and I will be looking for better data as I build onto my model. Of course, if you know of a good word root data set, or if you have a dataset and would like me to work on it, please kindly drop me a note.
Since I have the meaning of the roots, I can build my root vectors using the words appearing in its definition. As a start, I’m using a linear mapping. In the absence of further information, I’m simply initializing with the weights as equaling to 1/M, with M being the length of the definition. In matrix forms, let’s call these weights Wrm, with the entries initialized at 1/M if word m is in the definition of root r, and at 0 otherwise. Since the definitions are usually a couple of words or a short phrase, with minimal sight words, I may not need a lot of adjustment from a non-semantic point of view. During later fittings, or other tasks to be accomplished, I would like these weight to be fixed or variable depending on what I’m doing. Depending on the specific purposes, root vectors, word vectors and word-to-root weights could each be seen as a constant or a variable.
Results and Discussions:
After initializing my roots as the average of the words appearing in its definition, I ran a similarity search for my roots. Below are some of the results. The word root along with its definition is shown in the first row, followed by the 5 closest words and their distances to the initialized root vector.
ian---related to; like
well 1.8041103
as 1.9017267
instance 1.9191492
and 1.9456112
example 1.9542902
ile---related to
to 2.1357174
related 2.1357174
instance 2.1470504
which 2.18688
for 2.221224
ism---condition; belief in
however 2.1781738
indeed 2.2207263
although 2.2463272
taken 2.2809665
reasons 2.3036516
ist---person who does
why 2.142225
thought 2.1434772
neither 2.1972415
person 2.2037826
but 2.2165987
I’m omitting those roots with a one-word definition, because they would just be initialized the same as that word. Among the examples shown, although Wrm are still unadjusted, we can already see signs of digging into the meanings. For instance, there is some level of intelligence in the relationship between “ism” and the words “indeed” and “reasons”. Comparing the results for “ian” and “ile”, which share a portion of their definitions, we can see GloVe at working: when “like” is added to the definition, words like “as” and “example” became more relevant. In the case of “ist”, whose definition is a phrase, randomness in the results are expected since we are only working with word data, and not considering the syntax. However, the fact that “person” did show up as being relevant implies that certain relations among words are already captured in the word data.