怎么用python统计字数_如何在Python中优化字数统计?

I'm taking my first steps writing code to do linguistic analysis of texts. I use Python and the NLTK library. The problem is that the actual counting of words takes up close to 100 % of my CPU (iCore5, 8GB RAM, macbook air 2014) and ran for 14 hours before I shut the process down. How can I speed the looping and counting up?

I have created a corpus in NLTK out of three Swedish UTF-8 formatted, tab-separated files Swe_Newspapers.txt, Swe_Blogs.txt, Swe_Twitter.txt. It works fine:

import nltk

my_corpus = nltk.corpus.CategorizedPlaintextCorpusReader(".", r"Swe_.*", cat_pattern=r"Swe_(\w+)\.txt")

Then I've loaded a text-file with one word per line into NLTK. That also works fine.

my_wordlist = nltk.corpus.WordListCorpusReader("/Users/mos/Documents/", "wordlist.txt")

The text-file I want to analyse (Swe_Blogs.txt) has this structure, and works fine to parse:

Wordpress.com 2010/12/08 3 1,4,11 osv osv osv …

bloggagratis.se 2010/02/02 3 0 Jag är utled på plogade vägar, matte är lika utled hon.

wordpress.com 2010/03/10 3 0 1 kruka Sallad, riven

EDIT: The suggestion to produce the counter as below, does not work, but can be fixed:

counter = collections.Counter(word for word in my_corpus.words(categories=["Blogs"]) if word in my_wordlist)

This produces the error:

IOError Traceback (most recent call last)

in ()

----> 1 counter = collections.Counter(word for word in my_corpus.words("Blogs") if word in my_wordlist)

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, categories)

182 def words(self, fileids=None, categories=None):

183 return PlaintextCorpusReader.words(

--> 184 self, self._resolve(fileids, categories))

185 def sents(self, fileids=None, categories=None):

186 return PlaintextCorpusReader.sents(

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site- packages/nltk/corpus/reader/plaintext.pyc in words(self, fileids, sourced)

89 encoding=enc)

90 for (path, enc, fileid)

---> 91 in self.abspaths(fileids, True, True)])

92

93

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/corpus/reader/api.pyc in abspaths(self, fileids, include_encoding, include_fileid)

165 fileids = [fileids]

166

--> 167 paths = [self._root.join(f) for f in fileids]

168

169 if include_encoding and include_fileid:

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in join(self, fileid)

174 def join(self, fileid):

175 path = os.path.join(self._path, *fileid.split('/'))

--> 176 return FileSystemPathPointer(path)

177

178 def __repr__(self):

/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/ lib/python2.7/site-packages/nltk/data.pyc in __init__(self, path)

152 path = os.path.abspath(path)

153 if not os.path.exists(path):

--> 154 raise IOError('No such file or directory: %r' % path)

155 self._path = path

IOError: No such file or directory: '/Users/mos/Documents/Blogs'

A fix is to assign my_corpus(categories=["Blogs"] to a variable:

blogs_text = my_corpus.words(categories=["Blogs"])

It's when I try to count all occurrences of each word (about 20K words) in the wordlist within the blogs in the corpus (115,7 MB) that my computer get's a little tired. How can I speed up the following code? It seems to work, no error messages, but it takes >14h to execute.

import collections

counter = collections.Counter()

for word in my_corpus.words(categories="Blogs"):

for token in my_wordlist.words():

if token == word:

counter[token]+=1

else:

continue

Any help to improve my coding skills is much appreciated!

解决方案

It seems like your double loop could be improved:

for word in mycorp.words(categories="Blogs"):

for token in my_wordlist.words():

if token == word:

counter[token]+=1

This would be much faster as:

words = set(my_wordlist.words()) # call once, make set for fast check

for word in mycorp.words(categories="Blogs"):

if word in words:

counter[word] += 1

This takes you from doing len(my_wordlist.words()) * len(mycorp.words(...)) operations to closer to len(my_wordlist.words()) + len(mycorp.words(...)) operations, as building the set is O(n) and checking whether a word is in the set is O(1) on average.

You can also build the Counter direct from an iterable, as Two-Bit Alchemist points out:

counter = Counter(word for word in mycorp.words(categories="Blogs")

if word in words)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值