Natural Language Processing (Part 3 - WordNets & Info Extraction)

in nlp •  7 years ago 

Introduction

Again, continuing this tutorial series on Natural language processing, I'll introduce wordnets with the Python module, nltk. For reference, you can check out previous posts here and here.

0.0 Setup

This guide was written in Python 3.6.

0.1 Python & Pip

If you haven't already, please download Python and Pip.

0.2 Libraries

We'll be working with the re library for regular expressions and nltk for natural language processing techniques, so make sure to install them! To install these libraries, enter the following commands into your terminal:

pip3 install nltk==3.2.4

0.3 Other

Sentence boundary detection requires the dependency parse, which requires data to be installed, so enter the following command in your terminal.

python3 -m spacy.en.download all

Cool, now we're ready to start!

1.0 Background

1.1 Polarity Flippers

Polarity flippers are words that change positive expressions into negative ones or vice versa.

1.1.1 Negation

Negations directly change an expression's sentiment by preceding the word before it. An example would be

The cat is not nice.

1.1.2 Constructive Discourse Connectives

Constructive Discourse Connectives are words which indirectly change an expression's meaning with words like "but". An example would be

I usually like cats, but this cat is evil.

1.2 Multiword Expressions

Multiword expressions are important because, depending on the context, can be considered positive or negative. For example,

This song is shit.

is definitely considered negative. Whereas

This song is the shit.

is actually considered positive, simply because of the addition of 'the' before the word 'shit'.

1.3 WordNet

WordNet is an English lexical database with emphasis on synonymy - sort of like a thesaurus. Specifically, nouns, verbs, adjectives and adjectives are grouped into synonym sets.

1.3.1 Synsets

nltk has a built-in WordNet that we can use to find synonyms. We import it as such:

from nltk.corpus import wordnet as wn

If we feed a word to the synsets() method, the return value will be the class to which belongs. For example, if we call the method on motorcycle,

print(wn.synsets('motorcar'))

we get:

[Synset('car.n.01')]

Awesome stuff! But if we want to take it a step further, we can. We've previously learned what lemmas are - if you want to obtain the lemmas for a given synonym set, you can use the following method:

print(wn.synset('car.n.01').lemma_names())

This will get you:

['car', 'auto', 'automobile', 'machine', 'motorcar']

Even more, you can do things like get the definition of a word:

print(wn.synset('car.n.01').definition())

Again, pretty neat stuff.

'a motor vehicle with four wheels; usually propelled by an internal combustion engine'

1.3.2 Negation

With WordNet, we can easily detect negations. This is great because it's not only fast, but it requires no training data and has a fairly good predictive accuracy. On the other hand, it's not able to handle context well or work with multiple word phrases.

1.4 SentiWordNet

Based on WordNet synsets, SentiWordNet is a lexical resource for opinion mining, where each synset is assigned three sentiment scores: positivity, negativity, and objectivity.

from nltk.corpus import sentiwordnet as swn
cat = swn.senti_synset('cat.n.03')
cat.pos_score()
cat.neg_score()
cat.obj_score()

1.5 Stop Words

Stop words are extremely common words that would be of little value in our analysis are often excluded from the vocabulary entirely. Some common examples are determiners like the, a, an, another, but your list of stop words (or stop list) depends on the context of the problem you're working on.

2.0 Information Extraction

Information Extraction is the process of acquiring meaning from text in a computational manner.

2.1 Data Forms

2.1.1 Structured Data

Structured Data is when there is a regular and predictable organization of entities and relationships.

2.1.2 Unstructured Data

Unstructured data, as the name suggests, assumes no organization. This is the case with most written textual data.

2.2 What is Information Extraction?

With that said, information extraction is the means by which you acquire structured data from a given unstructured dataset. There are a number of ways in which this can be done, but generally, information extraction consists of searching for specific types of entities and relationships between those entities.

An example is being given the following text,

Martin received a 98% on his math exam, whereas Jacob received a 84%. Eli, who also took the same test, received an 89%. Lastly, Ojas received a 72%.

This is clearly unstructured. It requires reading for any logical relationships to be extracted. Through the use of information extraction techniques, however, we could output structured data such as the following:

Name     Grade
Martin   98
Jacob    84
Eli      89
Ojas     72

Final Words

In the next tutorial, we'll go deeper into information extraction, named entity extraction, and relationship extraction. Stay tuned for more!

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Hi are you for real or what? Your github is a 404. I like your stuff but you need to make sure you put your own twist on things. Otherwise @Cheetah and worse, @steemcleaners will visit you ;)