用的是ipython notebook
1.框架是打开文件,写入文件
for line in open(in_file):
continue
out = open(out_file, 'w')
out.write()```
2.简单的统计词频大致模板
def count(in_file,out_file):
#读取文件并统计词频
word_count={}#统计词频的字典
for line in open(in_file):
words = line.strip().split(" ")
for word in words:
if word in word_count:
word_count[word]+=1
else:
word_count[word]=1
out = open(out_file,'w')#打开一个文件
for word in word_count:
print word,word_count[word]#输出字典的key值和value值
out.write(word+"--"+str(word_count[word])+"\n")#写入文件
out.close()
count(in_file,out_file)```
一段很长的英文文本,此代码都是用split(" ")空格区分一个单词,显然是不合格的比如: "I will endeavor," said he,那么"I 和he,等等会被看成一个词,此段代码就是告诉你基本的统计词频思路。看如下一道题
1.在网上摘录一段英文文本(尽量长一些),粘贴到input.txt,统计其中每个单词的词频(出现的次数),并按照词频的顺序写入out.txt文件,每一行的内容为“单词:频次”
用的模板
#统计词频,按词频顺序写入文件
in_file = 'input_word.txt'
out_file = 'output_word.txt'
def count_word(in_file,out_file):
word_count={}#统计词频的字典
for line in open(in_file):
words = line.strip().split(" ")
for word in words:
if word in word_count:
word_count[word]+=1
else:
word_count[word]=1
out = open(out_file,'w')
for word in sorted(word_count.keys()):#按词频的顺序遍历字典的每个元素
print word,word_count[word]
out.write('%s:%d' % (word, word_count.get(word)))
out.write('\n')
out.close()
count_word(in_file,out_file)```
正则表达式的方法
import re
f = open('input_word.txt')
words = {}
rc = re.compile('\w+')
for l in f:
w_l = rc.findall(l)
for w in w_l:
if words.has_key(w):
words[w] += 1
else:
words[w] = 1
f.close()
f = open('out.txt', 'w')
for k in sorted(words.keys()):
print k,words[k]
f.write('%s:%d' % (k, words.get(k)))
f.write('\n')
f.close()```