问题1:
编写程序,对这个《天龙八部》文本中出现的汉字和标点符号进行统计,字符与出现次数之间用冒号:分隔,输出保存到“天龙八部-汉字统计.txt”文件中,该文件要求采用 CSV 格式存储,参考格式如下(注意,不统计空格和回车字符):
txt=open("天龙八部.txt").read()
d = {}
for ch in " \n":
txt=txt.replace(ch,"")
for ch in txt:
d[ch]=d.get(ch,0)+1
ls = list(d.items())
ls.sort(key=lambda x:x[1], reverse=True) # 此行可以按照词频由高到低排序
string=""
for i in range(len(ls)):
s=str(ls[i]).strip("()")
string=string+s[1]+":"+s[5:]+","
f=open("天龙八部-汉字统计.txt","w")
f.write(string)
f.close()
问题2:
请编写程序,对《天龙八部》文本中出现的中文词语进行统计,采用
jieba 库分词,词语与出现次数之间用冒号:分隔,输出保存到“天龙八部-词语统计.txt”文件中。参考格式如下(注意,不统计任何标点符号):
天龙:100, 八部:10 (略)
import jieba
fi = open("天龙八部-网络版.txt", "r", encoding='utf-8')
fo = open("天龙八部-词语统计.txt", "w", encoding='utf-8')
txt = fi.read()
words = jieba.lcut(txt)
for ch in " \n":
txt=txt.replace(ch,"")
d = {}
for ch in words:
d[ch] = d.get(ch, 0) + 1
ls = []
for key in d:
ls.append("{}:{}".format(key,[key]))
fo.write(",".join(ls))
fi.close()
fo.close()