单元格可能包含多个单词,也可能不包含多个单词,因此必须在替换标点符号后split。在这里,这是一个翻译地图:import xlrd
import os
from string import punctuation, translate
from collections import Counter
def count_words_trans():
filename = u'test.xls'
sheet_no = 1 # sheet is selected by index here
path = '.'
punctuation_map = dict((ord(c), u' ') for c in punctuation)
for filename in os.listdir(path):
if filename.endswith('.xls'):
print filename
workbook = xlrd.open_workbook(filename)
sheet = workbook.sheet_by_index(sheet_no)
values = []
for row in range(sheet.nrows):
for col in range(sheet.ncols):
c = sheet.cell(row, col)
if c.ctype == xlrd.XL_CELL_TEXT:
cv = unicode(c.value)
wordlist = cv.translate(punctuation_map).split()
values.extend(wordlist)
cnt = Counter(values)
print sum(cnt.values()),' words counted,',len(cnt),' unique'
像这样的文字操作:运行'被正确地分成两个单词(不同于仅仅删除标点符号)。翻译方法是unicode安全的。为了提高效率,只读取包含文本的单元格(无空格、无日期、无数字)。
您可以通过以下方式获得单词频率列表:
^{pr2}$