Python简单示例-词频统计（分词）

最新推荐文章于 2023-04-07 15:37:24 发布

一口一个小孩

最新推荐文章于 2023-04-07 15:37:24 发布

阅读量1.5k

点赞数 1

本文链接：https://blog.csdn.net/Zong_zongzong/article/details/110915455

版权

1、英文文本词频统计，并输出出现次数最多的15个单词

txt="'Hooray!It's snowing!It's time to make a snowman.James runs out.He makes a big pile of snow." \
    "He puts a big snowball on top.He " \
    "adds a scarf and a hat.He adds an orange for thenoe.He adds coal for the eyes and buttons." \
    "Inthe evening,James opens the door.What does he see?The snowman is moving!James invites him in." \
    "The snowman hasneever been inside a house.He says hello to the cat.He plays with paper towels." \
    "A moment later,the snowman takes Jame's hand and goes out.They go up,up into the air!They are" \
    "flying! What a wonderful night! The next morning, James jumps out of the bed.He runs to the door." \
    "He wants to thank the snowman.But he's gone.'"

for ch in '~’!#$%^&*()_+-=|\';"：/.,?><~!@#￥%……&*（）——+-=":‘；、。，？《》{}':
    #用空格代替各种特殊字符
    txt=txt.replace(ch," ")
txt = txt.lower()

#根据空格分隔每一个单词，存成一个列表
words_list=txt.split()

counts={}
#对比列表和字典的键，如果键不在字典中，设置默认值为1
for word in words_list:
    counts[word]=counts.get(word,0)+1
items=list(counts.items())
#对items列表中元组数据的第二个值（单词出现次数），从大到小排序，
#sort方法：参数lambda用来指定列表中
#使用元组的哪一个数据作为排序依据
#默认排序是从小到大，当reverse设为True时，排序变为从大到小
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):#输出出现次数最多的15个单词
    word,count=items[i]
    #单词左对齐，词频值右对齐，中间位置用空格填充，并格式化输出
    print("{0:<10}{1:>5}".format(word,count))

运行结果：

he           11
the          11
a             8
snowman       5
s             4
to            4
james         4
out           3
adds          3
and           3
it            2
runs          2
big           2
of            2
for           2

2、中文文本词频统计
①Python中的中文分词模块jieba：全分词模式、精确分词模式、搜索引擎分词模式
（jieba分词后的结果不能直接展示，需要用字符串拼接的方式拼接后再打印出来）

import jieba
ex='南京市长江大桥'
#全分词模式
all_cut=jieba.cut(ex,cut_all=True)
#精确分词模式
precise_cut=jieba.cut(ex,cut_all=False)
#当省略掉cut_all参数时
#cut_all默认值为False，此时分词模式为精确分词模式
default_precise_cut=jieba.cut(ex)
#搜索引擎分词模式
search_cut=jieba.cut_for_search(ex)

print("全分词模式：","/".join(all_cut))
print("精确分词模式：","/".join(precise_cut))
print("默认精确分词模式：","/".join(default_precise_cut))
print("搜索引擎分词模式：","/".join(search_cut))

分词结果:

全分词模式： 南京/南京市/京市/市长/长江/长江大桥/大桥
精确分词模式： 南京市/长江大桥
默认精确分词模式： 南京市/长江大桥
搜索引擎分词模式： 南京/京市/南京市/长江/大桥/长江大桥

②中文文本词频分词，并输出出现次数最多的15个词
考虑到中文的词语通常都是两个字以上，所以直接把词语长度为1的词过滤即可

import jieba
txt="好棒哦！下雪了!是时候堆个雪人了。詹姆斯跑了出去。他弄了一大堆雪。他把一个大雪球放到了最上面来充当头部。" \
    "他给雪人加了一条围巾和一顶帽子，又给雪人添了一个桔子当鼻子。他又加了煤炭来充当眼睛和纽扣。傍晚，詹姆斯打开了门。" \
    "他看见了什么？雪人在移动！詹姆斯邀请他进来。雪人从来没有去过房间里面。它对喵咪打了个招呼。猫咪玩着纸巾。不久之后，" \
    "雪人牵着詹姆斯的手出去了。他们一直向上升，一直升到空中！他们在飞翔！多么美妙的夜晚！第二天早上，詹姆斯从床上蹦了起来。" \
    "他向门口跑去。他想感谢雪人，但是它已经消失了。"

words=jieba.cut(txt)
counts={}
for word in words:
    if len(word)==1:
        continue
    else:
        counts[word]=counts.get(word,0)+1

items=list(counts.items())

#进行词频排序
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word,counts=items[i]
    print("{0:<5}\t{1:>5}".format(word,counts))

分词结果：

雪人   	    7
詹姆斯  	    5
出去   	    2
一个   	    2
充当   	    2
他们   	    2
一直   	    2
下雪   	    1
时候   	    1
堆个   	    1
一大堆  	    1
雪球   	    1
放到   	    1
上面   	    1
头部   	    1