I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?
I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:
import nltk
my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")
Then I've loaded a text-file with one word per line into NLTK. That also works fine.
my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")
The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:
Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …
bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.
wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven
EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:
counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)
This produces the error:
IOError Traceback (most recent call last)
in ()
----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)
182 def words(self, fileids=None, categories=None):
183 return PlaintextCorpusReader.words(
--> 184 self, self._resolve(fileids, categories))
185 def sents(self, fileids=None, categories=None):
186 return PlaintextCorpusReader.sents(
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)
89 encoding=enc)
90 for (path, enc, fileid)
---> 91 in self.abspaths(fileids, True, True)])
92
93
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)
165 fileids = [fileids]
166
--> 167 paths = [self._root.join(f) for f in fileids]
168
169 if include_encoding and include_fileid:
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)
174 def join(self, fileid):
175 path = os.path.join(self._path, *fileid.split('/'))
--> 176 return FileSystemPathPointer(path)
177
178 def __repr__(self):
/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)
152 path = os.path.abspath(path)
153 if not os.path.exists(path):
--> 154 raise IOError('No such file or directory: %r' % path)
155 self._path = path
IOError: No such file or directory: '/Users/mos/Documents/Blogs'
A fix is to assign my_corpus(categories=["Blogs"] to a variable:
blogs_text = my_corpus.words(categories=["Blogs"])
It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.
import collections
counter = collections.Counter()
for word in my_corpus.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
else:
continue
Any help to improve my coding skills is much appreciated!
解决方案
It seems like your double loop could be improved:
for word in mycorp.words(categories="Blogs"):
for token in my_wordlist.words():
if token == word:
counter[token]+=1
This would be much faster as:
words = set(my_wordlist.words()) # call once, make set for fast check
for word in mycorp.words(categories="Blogs"):
if word in words:
counter[word] += 1
This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.
You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:
counter = Counter(word for word in mycorp.words(categories="Blogs")
if word in words)