hadooppython编程_Hadoop：使用原生python编写MapReduce

最新推荐文章于 2022-08-28 01:08:08 发布

Alvis-lby

最新推荐文章于 2022-08-28 01:08:08 发布

阅读量106

点赞数

文章标签： Python Hadoop MapReduce 文本分析单词频率

本文链接：https://blog.csdn.net/weixin_33990147/article/details/113896742

版权

功能实现

功能：统计文本文件中所有单词出现的频率功能。

下面是要统计的文本文件

【/root/hadooptest/input.txt】

foo foo quux labs foo bar quux abc bar see you by test welcome test

abc labs foo me python hadoop ab ac bc bec python

编写Map代码

Map代码，它会从标准输入(stdin)读取数据，默认以空格分割单词，然后按行输出单词机器出现频率到标准输出(stdout)，不过整个Map处理过程并不会统计每个单词出现的总次数，而是直接输出“word,1”，以便作为Reduce的输入进行统计，要求mapper.py具备执行权限。

【/root/hadooptest/mapper.py】

#!/usr/bin/env python#-*- coding:utf-8 -*-

importsys#输入为标准输入stdin

for line insys.stdin:#删除开头和结尾的空行

line =line.strip()#以默认空格分隔单词到words列表

words =line.split()for word inwords:#输出所有单词，格式为“单词,1”以便作为Reduce的输入

print '%s\t%s' % (word,1)0

编写Reduce代码

Reduce代码，它会从标准输入(stdin)读取mapper.py的结果，然后统计每个单词出现的总次数并输出到标准输出(stdout)，要求reducer.py同样具备可执行权限。

【/root/hadooptest/reducer.py】

#!/usr/bin/env python#-*- coding:utf-8 -*-

from operator importitemgetterimportsys

current_word=None

current_count=0

word=None#获取标准输入，即mapper.py的标准输出

for line insys.stdin:#删除开头和结尾的空行

line =line.strip()#解析mapper.py输出作为程序的输入，以tab作为分隔符

word,count = line.split('\t',1)#转换count从字符型到整型

try:

count=int(count)exceptValueError:#count非数字时，忽略此行

continue

#要求mapper.py的输出做排序(sort)操作，以便对连续的word做判断

if current_word ==word:

current_count+=countelse:ifcurrent_word:#输出当前word统计结果到标准输出

print '%s\t%s' %(current_word,current_count)

current_count=count

current_word=word#输出最后一个word统计

if current_word ==word:print '%s\t%s' % (current_word,current_count)

测试代码

在Hadoop平台运行前进行本地测试

[root@wx ~]#cd /root/hadooptest/

[root@wx hadooptest]#cat input.txt | ./mapper.py

foo 1foo1quux1labs1foo1bar1quux1abc1bar1see1you1by1test1welcome1test1abc1labs1foo1me1python1hadoop1ab1ac1bc1bec1python1[root@wx hadooptest]#cat input.txt | ./mapper.py | sort -k1,1 | ./reducer.py

ab 1abc2ac1bar2bc1bec1by1foo4hadoop1labs2me1python2quux2see1test2welcome1you1

Hadoop平台运行

在HDFS上创建文本文件存储目录，本示例中为/user/root/word

/usr/local/hadoop-2.6.4/bin/hadoop fs -mkdir -p /user/root/word

将输入文件上传到HDFS，本例中是/root/hadooptest/input.txt

/usr/local/hadoop-2.6.4/bin/hadoop fs -put /root/hadooptest/input.txt /user/root/word

查看/user/root/word目录下的文件

/usr/local/hadoop-2.6.4/bin/hadoop fs -ls /user/root/word#结果：

Found 1items-rw-r--r-- 2 root supergroup 118 2016-03-22 13:36 /user/root/word/input.txt

执行MapReduce任务，输出结果文件制定为/output/word

/usr/local/hadoop-2.6.4/bin/hadoop jar /usr/local/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar -files 'mapper.py,reducer.py' -input /user/root/word -output /output/word -mapper ./mapper.py -reducer ./reducer.py

参数说明：

/usr/local/hadoop-2.6.4/bin/hadoop jar /usr/local/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar \-input \ #可以指定多个输入路径，例如：-input '/user/foo/dir1' -input '/user/foo/dir2'

-inputformat \-output \-outputformat \-mapper \-reducer \-combiner \-partitioner \-cmdenv \ #可以传递环境变量，可以当作参数传入到任务中，可以配置多个

-file \ #配置文件，字典等依赖

-D \ #作业的属性配置

查看生成的分析结果文件清单，其中/output/word/part-00000为分析结果文件

[root@wx hadooptest]#/usr/local/hadoop-2.6.4/bin/hadoop fs -ls /output/word

Found 2items-rw-r--r-- 2 root supergroup 0 2016-03-22 13:47 /output/word/_SUCCESS-rw-r--r-- 2 root supergroup 110 2016-03-22 13:47 /output/word/part-00000

查看结果数据

[root@wx hadooptest]#/usr/local/hadoop-2.6.4/bin/hadoop fs -cat /output/word/part-00000

ab 1abc2ac1bar2bc1bec1by1foo4hadoop1labs2me1python2quux2see1test2welcome1you1

参考资料：

根据刘天斯《Python自动化运维技术与最佳实践》整理

Alvis-lby

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadooppython编程_Hadoop：使用原生python编写MapReduce

功能实现功能：统计文本文件中所有单词出现的频率功能。下面是要统计的文本文件【/root/hadooptest/input.txt】foo foo quux labs foo bar quux abc bar see you by test welcome testabc labs foo me python hadoop ab ac bc bec python编写Map代码Map代码，它会从标准输...
复制链接

扫一扫