【Python】词频统计(written in python and Mapreduce)

最新推荐文章于 2024-07-30 07:25:40 发布

Manchesterr

最新推荐文章于 2024-07-30 07:25:40 发布

阅读量4.6k

点赞数 6

分类专栏：笔记

本文链接：https://blog.csdn.net/weixin_44115606/article/details/108359994

版权

笔记专栏收录该内容

3 篇文章 0 订阅

订阅专栏

一、利用Python进行词频统计

（一）计算机等级考试中常用的方法
首先是一个比较标准的考试中使用的方法，针对英文文本：

def getText():
    txt = open("E:\hamlet.txt", "r").read()   #读取Hamlet文本文件，并返回给txt
    txt = txt.lower()          #将文件中的单词全部变为小写
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': 
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split() #按照空格，将文本分割
counts = {}
for word in words:  #统计单词出现的次数，并存储到counts字典中         
    counts[word] = counts.get(word,0) + 1  #先给字典赋值，如果字典中没有word这个键，则返回0
items = list(counts.items())   #将字典转换为列表，以便操作
items.sort(key=lambda x:x[1], reverse=True)  # 见下面函数讲解
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

针对中文文本则一般使用jieba库，下面是一个示例（但不算很常考）：

#使用Jieba库进行词频统计
import jieba
txt = open("Jieba词频统计素材.txt", "r", encoding='utf-8').read()#防止出现编码问题而使用encoding
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue#不希望统计到单个词，比如说“的”，“好”等
   counts[word] = counts.get(word,0) + 1
   #将分词放入字典中
#如果有不希望统计到的词，那就在开始时创建一个包含所有你不想统计到的词语列表，例如
#exclude_words=["统计","排除"]
#for word in exclude_words:
#    del counts[word]
#这样就可以避免统计到不希望出现的词了
#以下开始对字典中词语进行统计
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

（二）升级方法

利用Python进行词频统计的核心语法
要掌握好利用python词频统计（特指上述的最简单的方法），我认为有以下几个重要的点需要熟悉
（1）将词放入字典，并同时统计频数的过程

words  = txt_file.split() #以" "为分隔符分隔文件
words2 = txt_file.lcut()#或者将中文文件用jieba库分词
for word in words:
	counts[word]=counts.get(word,0)+1#dict.get(寻找值,找不到则返回的值);这一行代码同时实现计数

（2）将字典的键值对以列表形式输出，中途进行排序的过程

items = list(counts.items())#items方法返回键值对
items.sort(key=lambda x:x[1], reverse=True)

先简单讲lambda函数，lambda x:y，输入x返回y，可以理解成sort函数的key参数的值等于lambda函数的返回值；lambda函数输入值x相当于items列表，输出的是列表的第二列也就是itmes[1]，即返回的是词的频数。
也就是说，按照频数对items排序。
3. 利用Python进行词频统计的三种方法示例

import pandas as pd
from collections import Counter
words_list = ["Monday","Tuesday","Thursday","Zeus","Venus","Monday","Monday","Zeus","Venus","Venus"]
dict = {} 
for word in words_list:         
    dict[word] = dict.get(word, 0) + 1 
print ("Result1:\n",dict) 
result2 =Counter(words_list)
print("Result2:\n",result2)
result3 =pd.value_counts(words_list)
print("Result3:\n",result3)
Result1:
 {'Monday': 3, 'Tuesday': 1, 'Thursday': 1, 'Zeus': 2, 'Venus': 3}
Result2:
 Counter({'Monday': 3, 'Venus': 3, 'Zeus': 2, 'Tuesday': 1, 'Thursday': 1})
Result3:
 Monday      3
Venus       3
Zeus        2
Thursday    1
Tuesday     1
dtype: int64

二、Mapreduce的方法进行词频统计

面对大型的文件的统计需求，需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序，分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。

（一）代码示例以及解释
Map:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from operator import itemgetter
from itertools import groupby

def main():
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        # split the line into words
        words = line.split()
        # increase counters
        for word in words:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            # tab-delimited; the trivial word count is 1
            print('%s\t%s' % (word, 1))

if (__name__ == "__main__" ):
    main()

Reduce:

#!/usr/bin/env python
 
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
    # count was not a number, so silently
    # ignore/discard this line
        continue

    
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
          # write result to STDOUT
            print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

 # do not forget to output the last word if needed!
if current_word == word:
    print('%s,%s' % (current_word, current_count))