RE: Text Analysis On The Dr. Seuss - The Cat In The Hat Book With R Programming

You are viewing a single comment's thread from:

Text Analysis On The Dr. Seuss - The Cat In The Hat Book With R Programming

in programming •  7 years ago  (edited)

Nice work for you! However, it seems like the sentiment analysis is reeeeally subjective. Why is 'mother' sometimes negative and sometimes positive? Why is 'sir' positive? Why is 'funny' negative? Why is 'fast' positive? Is it positive if you 'drive too fast'? Is 'dark' negative in 'dark-haired beauty'? Is 'eat' positive if you "have nothing to eat"? Well, perhaps we can count more on 'nothing'. But then, is 'nothing' negative if you have "nothing to fear"? This last phrase is comprised of two supposed negatives but in overall it has a positive sentiment.

Perhaps there is something I don't understand well with how this works but it seems very simplistic and misleading to take words out of context like that. Is sentiment analysis a thing? I guess it could be improved if it worked more like automated translators do, working with large volumes of data and learning to more or less associate words depending on their context, but it still seems weird to me to decide on "sentiment". This is the first time I'm seeing this so please let me know if I'm wrong.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Sentiment analysis is indeed subjective. The lexicons look at single words without context.

This reference (Chapter 2) may answer a few questions. https://www.tidytextmining.com/sentiment.html

Here are a few parts from that reference.

The three general-purpose lexicons are
AFINN from Finn Årup Nielsen,
bing from Bing Liu and collaborators, and
nrc from Saif Mohammad and Peter Turney.

All three of these lexicons are based on unigrams, i.e., single words. These lexicons contain many English words and the words are assigned scores for positive/negative sentiment, and also possibly emotions like joy, anger, sadness, and so forth. The nrc lexicon categorizes words in a binary fashion (“yes”/“no”) into categories of positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The bing lexicon categorizes words in a binary fashion into positive and negative categories. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. All of this information is tabulated in the sentiments dataset, and tidytext provides a function get_sentiments() to get specific sentiment lexicons without the columns that are not used in that lexicon.

How were these sentiment lexicons put together and validated? They were constructed via either crowdsourcing (using, for example, Amazon Mechanical Turk) or by the labor of one of the authors, and were validated using some combination of crowdsourcing again, restaurant or movie reviews, or Twitter data. Given this information, we may hesitate to apply these sentiment lexicons to styles of text dramatically different from what they were validated on, such as narrative fiction from 200 years ago. While it is true that using these sentiment lexicons with, for example, Jane Austen’s novels may give us less accurate results than with tweets sent by a contemporary writer, we still can measure the sentiment content for words that are shared across the lexicon and the text.