5 Days ~ Further Efforts on Automated Curation

in blog •  6 years ago  (edited)

Alright, I'm sorry that I don't write as much on #steemsilvergold recently, but my curation bot @aicu absorbs a lot of time. I'll write something soon on SSG, I promise :)

Anyway, after some sleepless nights, I made some progress on the performance of my bot aicu. I stumbled upon an interesting paper talking about improvements of the TF-IDF model. Aside from the basic stuff like stop words and stemming there were more interesting concepts. One was the use of Singular Value Decomposition. In combination with TF-IDF, this method is also known as Latent Semantic Analysis/Indexing. Albeit fascinating, this wasn't the only interesting thing mentioned there.

Log Normalisation of TF-IDF


The other interesting fact must have slipped the mind of my lecturer back in Uni. That is the normalisation of the term frequencies in TF-IDF.

It works as follows:

if log(TF) >= 1
TF-IDF = log(TF) * log(IDF)

else
TF-IDF = 0


That way higher and lower frequencies don't contribute as much to the model. After typing out a couple lines of code, the normalization was implemented and ready for training. This small step improved the F1 score of my model by 0.011 from 0.821 to 0.832. I also noticed way less noise in the feature importance.

SVD meets normalised TF-IDF

Singular Value Decomposition is a convenient tool for improving training data. It helps single out the most informative features. Luckily sci-kit learn already has a SVD implementation for sparse matrices. The truncated SVD. After some data preparation, I plugged in the TF-IDF training data and computed the smaller training data matrix. On a first attempt, an SVD with 100 components achieved the highest performance with an F-Score of 0.85. This leads to just 213, which is a massive improvement from over 130000.

I think it's fascinating that such a comparable simple approach like LSA leads to a model which captures some basic semantic features. All that emerges from a bit of word counting. Pretty fascinating if you ask me. Aicu is now running with this version, and I'll see whether this improvement on the test dataset translates to improvements in a real-life dataset as well

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  
  ·  6 years ago Reveal Comment