Python之——使用原生Python编写Hadoop MapReduce程序(基于Hadoop 2.5.2)

最新推荐文章于 2024-08-08 10:34:27 发布

冰河

最新推荐文章于 2024-08-08 10:34:27 发布

阅读量8.3k

点赞数 3

分类专栏：精通Python系列精通大数据系列文章标签： Python Hadoop HDFS

本文链接：https://blog.csdn.net/l1028386804/article/details/79055459

版权

精通大数据系列同时被 2 个专栏收录

269 篇文章 88 订阅

订阅专栏

精通Python系列

76 篇文章 78 订阅

订阅专栏

转载请注明出处：http://blog.csdn.net/l1028386804/article/details/79055459

一、简单说明

本例中我们实现一个统计文本文件中所有单词出现的词频功能，这里我们使用原生的Python来编写MapReduce。同时，本例中我们将要输入的单词文本input.txt和Python脚本放到/usr/local/python/source目录下。文本内容如下：

hello hello liuyazhuang lyz liuyazhuang lyz where is your home home see you by test welcome test adc abc labs me python hadoop ab bc bec python hadoop bar ccc bar ccc bbb aaa bbb iii ooo xxx yyy xxyy xxx iii ooo yyy

二、安装Zookeeper集群

参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)》或《Storm之——搭建Storm集群》

三、安装Hadoop

这篇博文中，我也是在单节点上安装的Hadoop，将HBase和Hadoop安装在了同一台服务器上。由于HBase的运行依赖于Zookeeper，所以，在同一台服务器上，又安装了单节点的Zookeeper。

四、安装Storm集群

参考博文《Storm之——搭建Storm集群》

五、编写Map代码

这里我们创建一个mapper.py脚本，从标准输入(stdin)读取数据，默认以空格分隔单词，然后按行输出单词机器词频到标准输出(stdout)，整个Map处理过程不会统计每个单词出现的总次数，而是直接输出“word 1”,以便作为Reduce的输入进行统计，要求mapper.py具备可执行权限，执行chmod +x /usr/local/python/source/mapper.py。

【/usr/local/python/source/mapper.py】

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日

@author: liuyazhuang
'''
import sys
#输入为标准输入stdin
for line in sys.stdin:
    #删除开头和结尾的空格
    line = line.strip()
    #以默认空格分隔行单词到words列表
    words = line.split()
    for word in words:
        #输出所有单词，格式为“单词，1”以便作为Reduce的输入
        print '%s\t%s' % (word, 1)

六、编写Reduce代码

这里我们创建一个reducer.py脚本，从标准输入(stdin)读取mapper.py的结果，然后统计每个单词出现的总次数并输出到标准输出(stdout)，要求reducer.py具备可执行执行，执行chmod +x /usr/local/python/source/reducer.py

【/usr/local/python/source/reducer.py】

#!/usr/bin/env python
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日

@author: liuyazhuang
'''

#from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

#获取标准输入，即mapper.py的输出
for line in sys.stdin:
    #删除开头和结尾的空格
    line = line.strip()
    
    #解析mapper.py输出作为程序的输入，以tab作为分隔符
    word, count = line.split('\t', 1)
    
    #转换count从字符型成整型
    try:
        count = int(count)
    except ValueError:
        #count不是数据时，忽略此行
        continue
    
    #要求mapper.py的输出做排序操作，以便对连续的word做判断，hadoop会自动排序
    if current_word  == word:
        current_count += count
    else:
        if current_word:
            #输出当前word统计结果到标准输出
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word
        
#输出最后一个word统计
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

七、测试代码

我们可以在Hadoop平台运行之前在本地测试，校验mapper.py与reducer.py运行的结果是否正确。

注意：测试reducer.py时需要对mapper.py的输出做排序(sort)操作，不过，Hadoop环境会自动实现排序

1、本地运行mapper.py

[root@liuyazhuang121 source]# cat input.txt | ./mapper.py 
hello   1
hello   1
liuyazhuang     1
lyz     1
liuyazhuang     1
lyz     1
where   1
is      1
your    1
home    1
home    1
see     1
you     1
by      1
  st    1
welcome 1
test    1
adc     1
abc     1
labs    1
me      1
python  1
hadoop  1
ab      1
bc      1
bec     1
python  1
hadoop  1
bar     1
ccc     1
bar     1
ccc     1
bbb     1
aaa     1
bbb     1
iii     1
ooo     1
xxx     1
yyy     1
xxyy    1
xxx     1
iii     1
ooo     1
yyy     1

输出了Map的结果

2、本地运行reducer.py

[root@liuyazhuang121 source]# cat input.txt  | ./mapper.py | sort -k1,1 | ./reducer.py 
aaa     1
ab      1
abc     1
adc     1
bar     2
bbb     2
bc      1
bec     1
by      1
ccc     2
hadoop  2
hello   2
home    2
iii     2
is      1
labs    1
liuyazhuang     2
lyz     2
me      1
ooo     2
python  2
see     1
test    2
welcome 1
where   1
xxx     2
xxyy    1
you     1
your    1
yyy     2

输出了Reduce的结果。

八、在Hadoop平台运行代码

1、创建目录并上传文件

首先在HDFS上创建文本文件存储目录，本实例为/user/root/word,执行如下命令：

hdfs dfs -mkdir /user/root/word

上传文本文件到HDFS，本实例中为/usr/local/python/source/input.txt，如果有多个文件，可采用以下方法进行操作，Hadoop分析目标默认针对目录，目录下的文件都在运算范围中。

[root@liuyazhuang121 source]# hadoop fs -put /usr/local/python/source/input.txt /user/root/word/
[root@liuyazhuang121 source]# hadoop fs -ls /user/root/word/           
Found 1 items
-rw-r--r--   1 root supergroup        215 2018-01-14 09:59 /user/root/word/input.txt

2、执行MapReduce程序

这里，我们输出结果文件制定/output/word，执行以下命令：

[root@liuyazhuang121 source]# hadoop jar /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word

可以看到map及reducer的百分比，打印出的log如下：

18/01/14 10:54:19 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /usr/local/hadoop-2.5.2/tmp/hadoop-unjar3958497380381943575/] [] /tmp/streamjob1400075475828443108.jar tmpDir=null
18/01/14 10:54:22 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 10:54:22 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 10:54:24 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/14 10:54:25 INFO mapreduce.JobSubmitter: number of splits:2
18/01/14 10:54:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515893542122_0001
18/01/14 10:54:26 INFO impl.YarnClientImpl: Submitted application application_1515893542122_0001
18/01/14 10:54:26 INFO mapreduce.Job: The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0001/
18/01/14 10:54:26 INFO mapreduce.Job: Running job: job_1515893542122_0001
18/01/14 10:54:43 INFO mapreduce.Job: Job job_1515893542122_0001 running in uber mode : false
18/01/14 10:54:43 INFO mapreduce.Job:  map 0% reduce 0%
18/01/14 10:55:16 INFO mapreduce.Job:  map 33% reduce 0%
18/01/14 10:55:17 INFO mapreduce.Job:  map 100% reduce 0%
18/01/14 10:55:31 INFO mapreduce.Job:  map 100% reduce 100%
18/01/14 10:55:32 INFO mapreduce.Job: Job job_1515893542122_0001 completed successfully
18/01/14 10:55:32 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=398
                FILE: Number of bytes written=302280
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=529
                HDFS: Number of bytes written=202
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=62800
                Total time spent by all reduces in occupied slots (ms)=11416
                Total time spent by all map tasks (ms)=62800
                Total time spent by all reduce tasks (ms)=11416
                Total vcore-seconds taken by all map tasks=62800
                Total vcore-seconds taken by all reduce tasks=11416
                Total megabyte-seconds taken by all map tasks=64307200
                Total megabyte-seconds taken by all reduce tasks=11689984
        Map-Reduce Framework
                Map input records=1
                Map output records=44
                Map output bytes=304
                Map output materialized bytes=404
                Input split bytes=206
                Combine input records=0
                Combine output records=0
                Reduce input groups=30
                Reduce shuffle bytes=404
                Reduce input records=44
                Reduce output records=30
                Spilled Records=88
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=159
                CPU time spent (ms)=3040
                Physical memory (bytes) snapshot=571060224
                Virtual memory (bytes) snapshot=2657177600
                Total committed heap usage (bytes)=378011648
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=323
        File Output Format Counters 
                Bytes Written=202
18/01/14 10:55:32 INFO streaming.StreamJob: Output directory: /output/word

这里，我们输入如下命令查看结果：

[root@liuyazhuang121 source]# hadoop fs -ls /output/word
Found 2 items
-rw-r--r--   1 root supergroup          0 2018-01-14 10:55 /output/word/_SUCCESS
-rw-r--r--   1 root supergroup        202 2018-01-14 10:55 /output/word/part-00000
[root@liuyazhuang121 source]#

其中，part-00000存放了我们的分析结果，下面我们查看结果：

[root@liuyazhuang121 source]# hadoop fs -cat /output/word/part-00000
aaa     1
ab      1
abc     1
adc     1
bar     2
bbb     2
bc      1
bec     1
by      1
ccc     2
hadoop  2
hello   2
home    2
iii     2
is      1
labs    1
liuyazhuang     2
lyz     2
me      1
ooo     2
python  2
see     1
test    2
welcome 1
where   1
xxx     2
xxyy    1
you     1
your    1
yyy     2

可见，结果与我们在测试的时候结果一致。

为了简化我们执行Hadoop MapReduce的命令，我们可以将Hadoop的hadoop-streaming-*.jar加入到系统环境变量/etc/profile中，在/etc/profile文件中添加如下配置：

HADOOP_STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
export HADOOP_STREAM

这里我们之前就配置了Hadoop的环境变量。

此时，我们执行以下命令来运行MapReduce程序

[root@liuyazhuang121 source]# hadoop jar $HADOOP_STREAM -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word1

我们同样可以看到Map和Reduce执行的百分比，执行的log日志如下：

18/01/14 11:04:46 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /usr/local/hadoop-2.5.2/tmp/hadoop-unjar2463144927504143769/] [] /tmp/streamjob3106204875058057023.jar tmpDir=null
18/01/14 11:04:47 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 11:04:48 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 11:04:48 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/14 11:04:48 INFO mapreduce.JobSubmitter: number of splits:2
18/01/14 11:04:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515893542122_0002
18/01/14 11:04:49 INFO impl.YarnClientImpl: Submitted application application_1515893542122_0002
18/01/14 11:04:49 INFO mapreduce.Job: The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0002/
18/01/14 11:04:49 INFO mapreduce.Job: Running job: job_1515893542122_0002
18/01/14 11:04:55 INFO mapreduce.Job: Job job_1515893542122_0002 running in uber mode : false
18/01/14 11:04:55 INFO mapreduce.Job:  map 0% reduce 0%
18/01/14 11:05:05 INFO mapreduce.Job:  map 100% reduce 0%
18/01/14 11:05:19 INFO mapreduce.Job:  map 100% reduce 100%
18/01/14 11:05:19 INFO mapreduce.Job: Job job_1515893542122_0002 completed successfully
18/01/14 11:05:20 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=398
                FILE: Number of bytes written=302283
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=529
                HDFS: Number of bytes written=202
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=15700
                Total time spent by all reduces in occupied slots (ms)=10749
                Total time spent by all map tasks (ms)=15700
                Total time spent by all reduce tasks (ms)=10749
                Total vcore-seconds taken by all map tasks=15700
                Total vcore-seconds taken by all reduce tasks=10749
                Total megabyte-seconds taken by all map tasks=16076800
                Total megabyte-seconds taken by all reduce tasks=11006976
        Map-Reduce Framework
                Map input records=1
                Map output records=44
                Map output bytes=304
                Map output materialized bytes=404
                Input split bytes=206
                Combine input records=0
                Combine output records=0
                Reduce input groups=30
                Reduce shuffle bytes=404
                Reduce input records=44
                Reduce output records=30
                Spilled Records=88
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=167
                CPU time spent (ms)=3260
                Physical memory (bytes) snapshot=598515712
                Virtual memory (bytes) snapshot=2668818432
                Total committed heap usage (bytes)=429916160
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=323
        File Output Format Counters 
                Bytes Written=202
18/01/14 11:05:20 INFO streaming.StreamJob: Output directory: /output/word1

此时，我们查看结果，也是和之前一样的。