Playing with TextBlob

January 22, 2014

TextBlob is a fun Python library that allows one to parse blocks of text in neat ways.

To use it, all you need is a computer with Python on it. I’m using Linux Mint with Python 2.7.3. Installation of TextBlob is covered pretty well on Steve Loria’s TextBlob page.

To begin I open my Python interpreter and import TextBlob.

>>> from textblob import TextBlob

Then I load my text. I’m using a chunk of The Brothers Karamazov.

>>> with open(r"/home/sean/Documents/text-blobs/the-brothers-karamazov/brothers-044") as infile:<br /> ... &nbsp;&nbsp;data = infile.read()<br /> ... &nbsp;&nbsp;myblob = TextBlob(data)<br /> ...<br />

Now I have a TextBlob object named “myblob” and I can do fun stuff with it. For instance, I can loop through it and pull out all the adjectives.

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if key == "JJ":<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> JJ back<br /> JJ back-way<br /> JJ black<br /> JJ certain<br /> JJ civil<br /> JJ clear<br /> --and so on...<br />

By setting up my for loop with the sorted() and set() methods, the output is alphabetized and will contain no duplicates.

But suppose I only want to see the adjectives that are five characters long. Then I use Python’s len() method. Like so:

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if key == "JJ" and len(value) == 5:<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> JJ black<br /> JJ civil<br /> JJ clear<br /> JJ equal<br /> JJ first<br /> --and so on...<br />

I can sort for verbs, too; in fact, any part of speech listed in the Penn Treebank II tag set will work.

The Penn Treebank code for gerunds is VBG. But sometimes I want all the words that end in “ing” even if it’s not a gerund. In that case, I use Python’s string methods instead. Like so:

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if value[-3:] == "ing":<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> VBG according<br /> NN anything<br /> VBG behaving<br /> VBG bringing<br /> --and so on...<br />

Using Python’s handy string methods I can easily test for a word that begins with a particular letter, too. Here I’ll throw in the lower() method to match regardless of case:

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if value[0].lower() == "a":<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> DT A<br /> IN Among<br /> NNP April<br /> IN At<br /> DT a<br /> IN about<br /> VBG according<br /> VBN accustomed<br /> --and so on...<br />

But what if I want to match all the words that start with vowels? Well, I think I’m going to need a regular expression to do that. (I love regular expressions.)

First I’ll import Python’s regex library and then create my regular expression.

`>>> import re

reg = re.compile('^[aeiou]\w*', re.IGNORECASE)
`

As you can see, I’m looking for any word that begins “^” with a vowel “[aeiou]” and is followed by zero or more “*” alphanumeric characters “\w” and I want to ignore case. Then I just use another for loop, only this time with my new regex. Like so:

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if reg.match(value):<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> DT A<br /> IN Among<br /> NNP April<br /> IN At<br /> DT Every<br /> IN If<br /> IN In<br /> PRP It<br /> IN Of<br /> DT a<br /> IN about<br /> --and so on...<br />

All the base form verbs that start with a vowel:

>>> for value,key in sorted(set(myblob.tags)):<br /> ... &nbsp;&nbsp;if key == "VB" and reg.match(value):<br /> ... &nbsp;&nbsp;&nbsp;&nbsp;print key,value<br /> ...<br /> VB act<br /> VB entertain<br /> VB estrange<br /> VB in<br /> VB into<br />

Pretty cool, right?