TextBlob is a fun Python library that allows one to parse blocks of text in neat ways.
To use it, all you need is a computer with Python on it. I’m using Linux Mint with Python 2.7.3. Installation of TextBlob is covered pretty well on Steve Loria’s TextBlob page.
To begin I open my Python interpreter and import TextBlob.
>>> from textblob import TextBlob
Then I load my text. I’m using a chunk of The Brothers Karamazov.
>>> with open(r"/home/sean/Documents/text-blobs/the-brothers-karamazov/brothers-044") as infile:<br /> ... data = infile.read()<br /> ... myblob = TextBlob(data)<br /> ...<br />
Now I have a TextBlob object named “myblob” and I can do fun stuff with it. For instance, I can loop through it and pull out all the adjectives.
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if key == "JJ":<br /> ... print key,value<br /> ...<br /> JJ back<br /> JJ back-way<br /> JJ black<br /> JJ certain<br /> JJ civil<br /> JJ clear<br /> --and so on...<br />
By setting up my for
loop with the sorted()
and set()
methods, the output is alphabetized and will contain no duplicates.
But suppose I only want to see the adjectives that are five characters long. Then I use Python’s len()
method. Like so:
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if key == "JJ" and len(value) == 5:<br /> ... print key,value<br /> ...<br /> JJ black<br /> JJ civil<br /> JJ clear<br /> JJ equal<br /> JJ first<br /> --and so on...<br />
I can sort for verbs, too; in fact, any part of speech listed in the Penn Treebank II tag set will work.
The Penn Treebank code for gerunds is VBG. But sometimes I want all the words that end in “ing” even if it’s not a gerund. In that case, I use Python’s string methods instead. Like so:
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if value[-3:] == "ing":<br /> ... print key,value<br /> ...<br /> VBG according<br /> NN anything<br /> VBG behaving<br /> VBG bringing<br /> --and so on...<br />
Using Python’s handy string methods I can easily test for a word that begins with a particular letter, too. Here I’ll throw in the lower()
method to match regardless of case:
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if value[0].lower() == "a":<br /> ... print key,value<br /> ...<br /> DT A<br /> IN Among<br /> NNP April<br /> IN At<br /> DT a<br /> IN about<br /> VBG according<br /> VBN accustomed<br /> --and so on...<br />
But what if I want to match all the words that start with vowels? Well, I think I’m going to need a regular expression to do that. (I love regular expressions.)
First I’ll import Python’s regex library and then create my regular expression.
`>>> import re
reg = re.compile('^[aeiou]\w*', re.IGNORECASE)
`
As you can see, I’m looking for any word that begins “^
” with a vowel “[aeiou]
” and is followed by zero or more “*
” alphanumeric characters “\w
” and I want to ignore case. Then I just use another for
loop, only this time with my new regex. Like so:
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if reg.match(value):<br /> ... print key,value<br /> ...<br /> DT A<br /> IN Among<br /> NNP April<br /> IN At<br /> DT Every<br /> IN If<br /> IN In<br /> PRP It<br /> IN Of<br /> DT a<br /> IN about<br /> --and so on...<br />
All the base form verbs that start with a vowel:
>>> for value,key in sorted(set(myblob.tags)):<br /> ... if key == "VB" and reg.match(value):<br /> ... print key,value<br /> ...<br /> VB act<br /> VB entertain<br /> VB estrange<br /> VB in<br /> VB into<br />
Pretty cool, right?