统计文本里的出现的单词

上面为文本文件的内容

# coding=utf-8
import re
from collections import Counter

def words(filename='qq.txt'):
    with open(filename, 'r') as f:
        data = f.read()
    data = data.lower()
    print(data)
    # 用re.split函数将所有的逗号点号引号以及跨行符号都替换掉
    datalist = re.split(r"[,.' \n?]", data)
    print(datalist)
    return Counter(datalist).most_common()
    
if __name__ == '__main__':
    dic = words(r'C:\Users\DELL\Desktop\python\words.txt')
    for i  in range(len(dic)):
        print('%15s  ---->   %3s' % (dic[i][0],dic[i][1]))

上面为代码

上图为运行结果

http://www.runoob.com/python/python-reg-expressions.html 正则化的详细资料

分析:

首先:导入两个包 import re(正则化);from collections import Counter

说明一下,这个counter的包是干什么的,Counter 集成于 dict 类,是使用字典的方法,返回一个以元素为 key 、元素个数为 value 的 Counter 对象集合。下面例子就能说明这个Counter玩意还是挺强大的。

此外,Counter方法还有elements() 和most_common(n)的方法:

前者是将之前生成的Counter 类型转化成可迭代类型,可以对其进行排序或者for i in new_dict :print(i)(可迭代类型)

后者是对Counter 类型生成的字典,返回键对应的value最大的键值对,如果不带参数的化,是对Counter 返回的所有键值对按value的大小进行排序

说了这么多还是停留在导入包的这两行,往下继续啦

其次:读取文本文件的内容

filename='qq.txt' 是先给函数words初始化参数为‘qq.txt’,其实在后面真正调用该函数的时候,用新的参数初始化就好了,不用管它;当然,在后面调用的时候也可以不传递参数(但,一定要保证该文件在运行的路径之下);

with open(filename, 'r') as f: 打开filename的文件,赋值给f,这里with 的好处时,不用再在后面对该文件进行关闭 f.close()的操作

data = f.read() 读取刚打开的文件f 的内容 如下所示:

Last sunday, it was a fine day. My friend and I went to Mount Daifu.
In the morning,we rode bikes to the foot of the mountain.After a short rest,we climbed the mountain.on its peak,
we shared the beautiful scenery in our eyes.
 
There were lots of light foggy clouds around us.What's more, varieties of birds flying around us were pretty.
What a harmony situation! At noon,we had lunch in a restaurant. With a happy emotion, we finished our climbing.
Although tired, we still felt happy and relaxed.we hope our country will become more and more beautiful.

data = data.lower() 因为有的单词是首字母大写,所以先对获取到的内容全部转换成小写字母 如下所示:

last sunday, it was a fine day. my friend and i went to mount daifu.
in the morning,we rode bikes to the foot of the mountain.after a short rest,we climbed the mountain.on its peak,
we shared the beautiful scenery in our eyes.
 
there were lots of light foggy clouds around us.what's more, varieties of birds flying around us were pretty.
what a harmony situation! at noon,we had lunch in a restaurant. with a happy emotion, we finished our climbing.
although tired, we still felt happy and relaxed.we hope our country will become more and more beautiful.

用re.split函数将所有的逗号点号引号以及跨行符号都替换掉,datalist = re.split(r"[,.' \n?]", data)  正则化处理之后结果如下

['last', 'sunday', '', 'it', 'was', 'a', 'fine', 'day', '', 'my', 'friend', 'and', 'i', 'went', 'to', 'mount', 'daifu', '', 'in', 'the', 'morning', 'we', 'rode', 'bikes', 'to', 'the', 'foot', 'of', 'the', 'mountain', 'after', 'a', 'short', 'rest', 'we', 'climbed', 'the', 'mountain', 'on', 'its', 'peak', '', 'we', 'shared', 'the', 'beautiful', 'scenery', 'in', 'our', 'eyes', '', '', '', 'there', 'were', 'lots', 'of', 'light', 'foggy', 'clouds', 'around', 'us', 'what', 's', 'more', '', 'varieties', 'of', 'birds', 'flying', 'around', 'us', 'were', 'pretty', '', 'what', 'a', 'harmony', 'situation!', 'at', 'noon', 'we', 'had', 'lunch', 'in', 'a', 'restaurant', '', 'with', 'a', 'happy', 'emotion', '', 'we', 'finished', 'our', 'climbing', '', 'although', 'tired', '', 'we', 'still', 'felt', 'happy', 'and', 'relaxed', 'we', 'hope', 'our', 'country', 'will', 'become', 'more', 'and', 'more', 'beautiful', '']

然后再使用Counter将列表中的单词加入字典中,再利用Counter(datalist).most_common()让输出进行排序,那打印出具体某个单词出现多少次

就ok 了

 

 

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值