【Python】词频统计(written in python and Mapreduce)

一、利用Python进行词频统计

(一)计算机等级考试中常用的方法
首先是一个比较标准的考试中使用的方法,针对英文文本:

def getText():
    txt = open("E:\hamlet.txt", "r").read()   #读取Hamlet文本文件,并返回给txt
    txt = txt.lower()          #将文件中的单词全部变为小写
    for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_‘{|}~': 
        txt = txt.replace(ch, " ")   #将文本中特殊字符替换为空格
    return txt
 
hamletTxt = getText()
words  = hamletTxt.split() #按照空格,将文本分割
counts = {}
for word in words:  #统计单词出现的次数,并存储到counts字典中         
    counts[word] = counts.get(word,0) + 1  #先给字典赋值,如果字典中没有word这个键,则返回0
items = list(counts.items())   #将字典转换为列表,以便操作
items.sort(key=lambda x:x[1], reverse=True)  # 见下面函数讲解
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

针对中文文本则一般使用jieba库,下面是一个示例(但不算很常考):

#使用Jieba库进行词频统计
import jieba
txt = open("Jieba词频统计素材.txt", "r", encoding='utf-8').read()#防止出现编码问题而使用encoding
words  = jieba.lcut(txt)
counts = {}
for word in words:
    if len(word) == 1:
        continue#不希望统计到单个词,比如说“的”,“好”等
   counts[word] = counts.get(word,0) + 1
   #将分词放入字典中
#如果有不希望统计到的词,那就在开始时创建一个包含所有你不想统计到的词语列表,例如
#exclude_words=["统计","排除"]
#for word in exclude_words:
#    del counts[word]
#这样就可以避免统计到不希望出现的词了
#以下开始对字典中词语进行统计
items = list(counts.items())
items.sort(key=lambda x:x[1], reverse=True) 
for i in range(10):
    word, count = items[i]
    print ("{0:<10}{1:>5}".format(word, count))

(二)升级方法

  1. 利用Python进行词频统计的核心语法
    要掌握好利用python词频统计(特指上述的最简单的方法),我认为有以下几个重要的点需要熟悉
    (1)将词放入字典,并同时统计频数的过程
words  = txt_file.split() #以" "为分隔符分隔文件
words2 = txt_file.lcut()#或者将中文文件用jieba库分词
for word in words:
	counts[word]=counts.get(word,0)+1#dict.get(寻找值,找不到则返回的值);这一行代码同时实现计数

(2)将字典的键值对以列表形式输出,中途进行排序的过程

items = list(counts.items())#items方法返回键值对
items.sort(key=lambda x:x[1], reverse=True) 

先简单讲lambda函数,lambda x:y,输入x返回y,可以理解成sort函数的key参数的值等于lambda函数的返回值;lambda函数输入值x相当于items列表,输出的是列表的第二列也就是itmes[1],即返回的是词的频数。
也就是说,按照频数对items排序。
3. 利用Python进行词频统计的三种方法示例

import pandas as pd
from collections import Counter
words_list = ["Monday","Tuesday","Thursday","Zeus","Venus","Monday","Monday","Zeus","Venus","Venus"]
dict = {} 
for word in words_list:         
    dict[word] = dict.get(word, 0) + 1 
print ("Result1:\n",dict) 
result2 =Counter(words_list)
print("Result2:\n",result2)
result3 =pd.value_counts(words_list)
print("Result3:\n",result3)
Result1:
 {'Monday': 3, 'Tuesday': 1, 'Thursday': 1, 'Zeus': 2, 'Venus': 3}
Result2:
 Counter({'Monday': 3, 'Venus': 3, 'Zeus': 2, 'Tuesday': 1, 'Thursday': 1})
Result3:
 Monday      3
Venus       3
Zeus        2
Thursday    1
Tuesday     1
dtype: int64

二、Mapreduce的方法进行词频统计

面对大型的文件的统计需求,需要使用到集群来进行词频统计。我们打算在Hadoop平台上运行Python程序,分布计算从而提高我们词频统计的效率。因此使用了写MapReduce的方法。

(一)代码示例以及解释
Map:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
from operator import itemgetter
from itertools import groupby

def main():
    # input comes from STDIN (standard input)
    for line in sys.stdin:
        # remove leading and trailing whitespace
        line = line.strip()
        # split the line into words
        words = line.split()
        # increase counters
        for word in words:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            # tab-delimited; the trivial word count is 1
            print('%s\t%s' % (word, 1))

if (__name__ == "__main__" ):
    main()

Reduce:

#!/usr/bin/env python
 
from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
    # count was not a number, so silently
    # ignore/discard this line
        continue

    
    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
          # write result to STDOUT
            print('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

 # do not forget to output the last word if needed!
if current_word == word:
    print('%s,%s' % (current_word, current_count))

(二)核心语法的学习探究

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值