Courses‎ > ‎Python worksheets‎ > ‎

Worksheet: an introduction to computational linguistics

What is computational linguistics about? Computational linguistics is an interdisciplinary area inbetween linguistics and computer science. It uses methods from computer science to do two different things. First, to build systems that can automatically understand natural language (at least to some degree). Second, to understand better how language works by using mathematical models.

This worksheet introduces some of the main tasks and methods from computational linguistics by way of a programmed example, relying heavily on the Natural Language Toolkit.

###############
# text and words

import nltk

from nltk.book import *

# text consists of words
# here are the first 100 of the Moby Dick text
text1[:100]

# and here is the start of the collection of inaugural addresses
text4[:100]

# by just using word counts, we can get an idea of the topic of a text.
# for example like this
# (Warning: this command requires matplotlib)
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

# or by hand:
from nltk.corpus import inaugural

# the list of inaugural addresses
inaugural.fileids()

# The first inaugural address
inaugural.words(inaugural.fileids()[0])

# relative frequency of "freedom"
for address in inaugural.fileids():
    address_words = inaugural.words(address)
    percentage = address_words.count("freedom") / len(address_words)
    print(address, "freedom:", round(percentage, 3))

# relative frequency of "duties"
for address in inaugural.fileids():
    address_words = inaugural.words(address)
    percentage = address_words.count("duties") / len(address_words)
    print(address, "duties:", round(percentage, 3))

# What are the most frequent words in the 1st inaugural address,
# Washington 1789
from nltk.probability import FreqDist

address1 = inaugural.words(inaugural.fileids()[0])
fd = FreqDist(address1)

# 10 most frequent words
fd.pprint(10)
# or much more readably:
fd.tabulate(10)
# or like this:
fd.most_common(10)

# hm, that is not very informative. The problem is that we are seeing
# a lot of stopwords.
# Let's get rid of those
from nltk.corpus import stopwords
en_stopwords = stopwords.words("english")
for stopwords in en_stopwords: del fd[stopword]

# now let's try again
fd.most_common(20)
# better, though we haven't caught everything.

# compare that to the most recent inaugural address:
# Obama 2009
addressO = inaugural.words(inaugural.fileids()[-1])
fdO = FreqDist(addressO)
for stopword in en_stopwords: del fdO[stopword]
fdO.most_common(20)

# or we can look at positive and negative sentiment
from nltk.corpus import opinion_lexicon
# we have two wordlists, one positive and one negative
opinion_lexicon.fileids()
# here is what is in there:
opinion_lexicon.words("negative-words.txt")
opinion_lexicon.words("positive-words.txt")

# What fraction of each inaugural address
# consists of positive or negative words?
poswords = set(opinion_lexicon.words("positive-words.txt"))
negwords = set(opinion_lexicon.words("negative-words.txt"))

for address in inaugural.fileids():
    address_words = inaugural.words(address)
    percpos = len([w for w in address_words if w in poswords  ]) / len(address_words)
    percneg = len([w for w in address_words if w in negwords  ]) / len(address_words)
    print(address, "pos:", round(percpos, 3), "neg:", round(percneg, 3))

# Looks like everybody uses more positive than negative opinion words.

##########
# But a text is more than a "bag of words"! What else can we do?

# The simplest next step that we can take is to take
# word sequence into account
# This is based on counting word sequences
for ngram in list(nltk.ngrams(list(text4), 3))[:10]: print(ngram)

# some magic that determines the probability of one word
# following another
ns = list(nltk.ngrams(text4, 2))
cpd = nltk.ConditionalProbDist(nltk.ConditionalFreqDist(ns), nltk.MLEProbDist)

# what words do we know about?
cpd.conditions()

# for example, what can follow "selfish"?
cpd["selfish"].samples()

# if we've just seen "selfish", what's the most likely next word?
cpd["selfish"].max()

# and how likely is it that the next word is going to be "men"?
cpd["selfish"].prob("men")

# So, what do you think this is good for?

# Let's generate some text at random
word = "I"
for i in range(100):
    print(word, end = " ")
    word = cpd[ word].generate()

print(word)

# And what could this be good for?

###############
# Beyond raw text: linguistic analysis

# A bit of text from the Washington Post, January 15, 2015

# <<< at this point, grab some text from a current news article and store it
# in the variable 'text', like so: >>>

text = """ This is a placeholder.
Please place some more interesting text here."""

# Splitting the text into words
text.split()

# Or like this: do you see the difference?
nltk.word_tokenize(text)

# Part of speech: is this a noun, an adjective, a verb?
words = nltk.word_tokenize(text)

nltk.pos_tag(words)

####
# syntactic analysis
# Here is an example of what we want the system to do
from nltk.corpus import treebank
print(treebank.parsed_sents('wsj_0001.mrg')[0])

# or like this
treebank.parsed_sents('wsj_0001.mrg')[0].draw()

###########
# Can we learn about semantic similarity from data?
text4.similar("freedom")
text2.similar("freedom")

# Here's how this works:
text4.common_contexts(["freedom", "peace"])
text2.common_contexts(["freedom", "hair"])
# (This is a pretty simplistic approach.
# You can do the same thing in more sophisticated ways,
# and get better answers. In particular you should use more data
# than a single book. Yes, a book counts as "too little data".)


Comments