09 使用python完成词频统计

最新推荐文章于 2022-10-14 09:05:31 发布

张力的程序园

最新推荐文章于 2022-10-14 09:05:31 发布

阅读量398

点赞数

文章标签： python linux 大数据 hadoop ubuntu

本文链接：https://blog.csdn.net/langli204910/article/details/113789157

版权

1 系统、软件以及前提约束

CentOS-7 64
为减少linux权限对初学者造成影响，所有命令均在linux的root权限下进行操作。
已安装hadoop-2.5.2 https://www.jianshu.com/p/5707c5ccd85b
CentOS7当中已经默认安装python3.7.3

2 操作步骤

创建mapper.py文件

#!/usr/bin/python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print ('%s\t%s' % (word, 1))

验证，执行以下语句：

echo aa bb cc dd aa cc|python mapper.py

得到以下结果：

查看统计结果

创建reducer.py文件：

#!/usr/bin/python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print ('%s\t%s' % (current_word, current_count))
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print ('%s\t%s' % (current_word, current_count))

验证，执行以下语句：

echo aa bb cc dd aa cc|python mapper.py|sort|python reducer.py

得到以下结果：

查看统计结果

创建一个文件info.txt，内容如下：

aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc
aa bb cc dd aa cc cc dd

上传该文件到HDFS的/data的info文件中

hdfs dfs -mkdir /data
hdfs dfs -put info.txt /data/info

执行以下命令，确保hdfs下/out99不存在

$HADOOP_HOME/bin/hadoop jar 
 $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar 
 -input "/data/*" 
 -output "/out99" 
 -mapper "python mapper.py" 
 -reducer "python reducer.py" 
 -file "/root/mapper.py" 
 -file "/root/reducer.py"

注意：$HADOOP_HOME就是hadoop的家目录。
以上就是通过python完成词频统计的过程。

张力的程序园

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
1
评论
09 使用python完成词频统计

1 系统、软件以及前提约束CentOS-7 64为减少linux权限对初学者造成影响，所有命令均在linux的root权限下进行操作。已安装hadoop-2.5.2 https://www.jianshu.com/p/5707c5ccd85bCentOS7当中已经默认安装python3.7.32 操作步骤创建mapper.py文件#!/usr/bin/pythonimp...
复制链接

扫一扫