统计文本里的出现的单词

最新推荐文章于 2018-11-01 21:29:00 发布

青果HA

最新推荐文章于 2018-11-01 21:29:00 发布

阅读量610

点赞数

分类专栏： Python基本语法面试相关

本文链接：https://blog.csdn.net/Strive_0902/article/details/82316408

版权

Python基本语法同时被 2 个专栏收录

55 篇文章 0 订阅

订阅专栏

面试相关

35 篇文章 0 订阅

订阅专栏

上面为文本文件的内容

# coding=utf-8
import re
from collections import Counter

def words(filename='qq.txt'):
    with open(filename, 'r') as f:
        data = f.read()
    data = data.lower()
    print(data)
    # 用re.split函数将所有的逗号点号引号以及跨行符号都替换掉
    datalist = re.split(r"[,.' \n?]", data)
    print(datalist)
    return Counter(datalist).most_common()
    
if __name__ == '__main__':
    dic = words(r'C:\Users\DELL\Desktop\python\words.txt')
    for i  in range(len(dic)):
        print('%15s  ---->   %3s' % (dic[i][0],dic[i][1]))

上面为代码

上图为运行结果

http://www.runoob.com/python/python-reg-expressions.html 正则化的详细资料

分析：

首先：导入两个包 import re（正则化）；from collections import Counter

说明一下，这个counter的包是干什么的，Counter 集成于 dict 类，是使用字典的方法，返回一个以元素为 key 、元素个数为 value 的 Counter 对象集合。下面例子就能说明这个Counter玩意还是挺强大的。

此外，Counter方法还有elements() 和most_common（n）的方法：

前者是将之前生成的Counter 类型转化成可迭代类型，可以对其进行排序或者for i in new_dict :print（i）(可迭代类型)

后者是对Counter 类型生成的字典，返回键对应的value最大的键值对，如果不带参数的化，是对Counter 返回的所有键值对按value的大小进行排序

说了这么多还是停留在导入包的这两行，往下继续啦

其次：读取文本文件的内容

filename='qq.txt' 是先给函数words初始化参数为‘qq.txt’，其实在后面真正调用该函数的时候，用新的参数初始化就好了，不用管它；当然，在后面调用的时候也可以不传递参数（但，一定要保证该文件在运行的路径之下）；

with open(filename, 'r') as f: 打开filename的文件，赋值给f，这里with 的好处时，不用再在后面对该文件进行关闭 f.close()的操作

data = f.read() 读取刚打开的文件f 的内容如下所示：

Last sunday, it was a fine day. My friend and I went to Mount Daifu.
In the morning,we rode bikes to the foot of the mountain.After a short rest,we climbed the mountain.on its peak,
we shared the beautiful scenery in our eyes.
 
There were lots of light foggy clouds around us.What's more, varieties of birds flying around us were pretty.
What a harmony situation! At noon,we had lunch in a restaurant. With a happy emotion, we finished our climbing.
Although tired, we still felt happy and relaxed.we hope our country will become more and more beautiful.

data = data.lower() 因为有的单词是首字母大写，所以先对获取到的内容全部转换成小写字母如下所示：

last sunday, it was a fine day. my friend and i went to mount daifu.
in the morning,we rode bikes to the foot of the mountain.after a short rest,we climbed the mountain.on its peak,
we shared the beautiful scenery in our eyes.
 
there were lots of light foggy clouds around us.what's more, varieties of birds flying around us were pretty.
what a harmony situation! at noon,we had lunch in a restaurant. with a happy emotion, we finished our climbing.
although tired, we still felt happy and relaxed.we hope our country will become more and more beautiful.

用re.split函数将所有的逗号点号引号以及跨行符号都替换掉，datalist = re.split(r"[,.' \n?]", data) 正则化处理之后结果如下

['last', 'sunday', '', 'it', 'was', 'a', 'fine', 'day', '', 'my', 'friend', 'and', 'i', 'went', 'to', 'mount', 'daifu', '', 'in', 'the', 'morning', 'we', 'rode', 'bikes', 'to', 'the', 'foot', 'of', 'the', 'mountain', 'after', 'a', 'short', 'rest', 'we', 'climbed', 'the', 'mountain', 'on', 'its', 'peak', '', 'we', 'shared', 'the', 'beautiful', 'scenery', 'in', 'our', 'eyes', '', '', '', 'there', 'were', 'lots', 'of', 'light', 'foggy', 'clouds', 'around', 'us', 'what', 's', 'more', '', 'varieties', 'of', 'birds', 'flying', 'around', 'us', 'were', 'pretty', '', 'what', 'a', 'harmony', 'situation!', 'at', 'noon', 'we', 'had', 'lunch', 'in', 'a', 'restaurant', '', 'with', 'a', 'happy', 'emotion', '', 'we', 'finished', 'our', 'climbing', '', 'although', 'tired', '', 'we', 'still', 'felt', 'happy', 'and', 'relaxed', 'we', 'hope', 'our', 'country', 'will', 'become', 'more', 'and', 'more', 'beautiful', '']

然后再使用Counter将列表中的单词加入字典中，再利用Counter(datalist).most_common()让输出进行排序，那打印出具体某个单词出现多少次