Playing with TextBlob

January 22, 2014

TextBlob is a fun Python library that allows one to parse blocks of text in neat ways.

To use it, all you need is a computer with Python on it. I’m using Linux Mint with Python 2.7.3. Installation of TextBlob is covered pretty well on Steve Loria’s TextBlob page.

To begin I open my Python interpreter and import TextBlob.

>>> from textblob import TextBlob

Then I load my text. I’m using a chunk of The Brothers Karamazov.

>>> with open(r"/home/sean/Documents/text-blobs/the-brothers-karamazov/brothers-044") as infile: ...   data = infile.read() ...   myblob = TextBlob(data) ... 

Now I have a TextBlob object named “myblob” and I can do fun stuff with it. For instance, I can loop through it and pull out all the adjectives.

>>> for value,key in sorted(set(myblob.tags)): ...   if key == "JJ": ...     print key,value ... JJ back JJ back-way JJ black JJ certain JJ civil JJ clear --and so on... 

By setting up my for loop with the sorted() and set() methods, the output is alphabetized and will contain no duplicates.

But suppose I only want to see the adjectives that are five characters long. Then I use Python’s len() method. Like so:

>>> for value,key in sorted(set(myblob.tags)): ...   if key == "JJ" and len(value) == 5: ...     print key,value ... JJ black JJ civil JJ clear JJ equal JJ first --and so on... 

I can sort for verbs, too; in fact, any part of speech listed in the Penn Treebank II tag set will work.

The Penn Treebank code for gerunds is VBG. But sometimes I want all the words that end in “ing” even if it’s not a gerund. In that case, I use Python’s string methods instead. Like so:

>>> for value,key in sorted(set(myblob.tags)): ...   if value[-3:] == "ing": ...     print key,value ... VBG according NN anything VBG behaving VBG bringing --and so on... 

Using Python’s handy string methods I can easily test for a word that begins with a particular letter, too. Here I’ll throw in the lower() method to match regardless of case:

>>> for value,key in sorted(set(myblob.tags)): ...   if value[0].lower() == "a": ...     print key,value ... DT A IN Among NNP April IN At DT a IN about VBG according VBN accustomed --and so on... 

But what if I want to match all the words that start with vowels? Well, I think I’m going to need a regular expression to do that. (I love regular expressions.)

First I’ll import Python’s regex library and then create my regular expression.

`>>> import re

reg = re.compile('^[aeiou]\w*', re.IGNORECASE)
`

As you can see, I’m looking for any word that begins “^” with a vowel “[aeiou]” and is followed by zero or more “*” alphanumeric characters “\w” and I want to ignore case. Then I just use another for loop, only this time with my new regex. Like so:

>>> for value,key in sorted(set(myblob.tags)): ...   if reg.match(value): ...     print key,value ... DT A IN Among NNP April IN At DT Every IN If IN In PRP It IN Of DT a IN about --and so on... 

All the base form verbs that start with a vowel:

>>> for value,key in sorted(set(myblob.tags)): ...   if key == "VB" and reg.match(value): ...     print key,value ... VB act VB entertain VB estrange VB in VB into 

Pretty cool, right?