Python code for March 20/22

Hidden Markov Models can be used to do part-of-speech tagging based on sequences: What is the most likely sequence of tags, given the sequence of words? The central problem there is to compute the most likely tag sequence.

Here we show how to compare two tag sequences and see which one is more likely.

Hidden Markov Models for part-of-speech tagging rely on two types of probabilities: P(current tag | previous tag) and P(current word | current tag).
Here is how we can estimate these probabilities from the Brown corpus using the Natural Language Toolkit. (Note: In a real-world application, you would not use the Brown corpus but something much bigger.)

Here is how to estimate all probabilities P(current tag | previous tag):
import nltk
brown_word_tags = nltk.corpus.brown.tagged_words()
brown_tags = [tag for (word, tag) in brown_word_tags]
cfd_tags= nltk.ConditionalFreqDist(nltk.bigrams(brown_tags))
cpd_tags = nltk.ConditionalProbDist(cfd_tags, nltk.MLEProbDist)

You can now get the estimated probability of seeing the tag "VB" given that the previous tag was "TO":
cpd_tags["TO"].prob("VB")

Here is how to estimate all probabilities P(current word | current tag). One problem is that nltk.corpus.brown.tagged_words() consists of (word, tag) pairs, while we need pairs (tag, word) in order to estimate the probability of a word given a tag, rather than the probability of a tag given a word:

brown_tag_words = [ (tag, word) for (word, tag) in brown_word_tags ]
cfd = nltk.ConditionalFreqDist(brown_tag_words)
cpd_wordtags = nltk.ConditionalProbDist(cfd, nltk.MLEProbDist)

Now we can get the estimated probability of seeing the word "race" given that the current tag is "VB":
cpd_wordtags["VB"].prob("race")

Example sentence from Jurafsky and Martin (p 142ff)
Secretariat is expected to race tomorrow

What is more likely, the sequence
NNP VBZ VBN TO VB NR

or the sequence
NNP VBZ VBN TO NN NR
?

So we need to compare the following two numbers:
For the sequence with "VB",
P(Secretariat | NNP) P(VBZ | NNP) P(is | VBZ) P(VBN | VBZ) P(expected | VBN) P(TO | VBN) P(to | TO) P(VB | TO) P(race | VB) P(NR | VB) P(tomorrow | NR)

For the sequence with "NN",
P(Secretariat | NNP) P(VBZ | NNP) P(is | VBZ) P(VBN | VBZ) P(expected | VBN) P(TO | VBN) P(to | TO) P(NN | TO) P(race | NN) P(NR | NN) P(tomorrow | NR)

Most of the elements in these two products are the same. They only differ in the following:

For the sequence with "VB", we have
P(VB | TO) P(race | VB) P(NR | VB)

While for the sequence with "NN", we have
P(NN | TO) P(race | NN) P(NR | NN)

If we compute these two products, we will know which sequence is more likely. We can compute the two products with the Python objects that we computed above.

For VB, we get
>>> cpd_tags["TO"].prob("VB") * cpd_wordtags["VB"].prob("race") * cpd_tags["VB"].prob("NR")
2.4385773143342077e-07

For NN, we get
>>> cpd_tags["TO"].prob("NN") * cpd_wordtags["NN"].prob("race") * cpd_tags["NN"].prob("NR")
2.6671260955826017e-10

So the tag sequence with "VB" is overall more likely.

Comments