转载请注明出处:http://blog.csdn.net/l1028386804/article/details/79055459
一、简单说明
本例中我们实现一个统计文本文件中所有单词出现的词频功能,这里我们使用原生的Python来编写MapReduce。同时,本例中我们将要输入的单词文本input.txt和Python脚本放到/usr/local/python/source目录下。文本内容如下:
hello hello liuyazhuang lyz liuyazhuang lyz where is your home home see you by test welcome test adc abc labs me python hadoop ab bc bec python hadoop bar ccc bar ccc bbb aaa bbb iii ooo xxx yyy xxyy xxx iii ooo yyy
二、安装Zookeeper集群
参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)》或 《Storm之——搭建Storm集群》
三、安装Hadoop
1、伪分布式安装
请参考博文:《Hadoop之——Hadoop2.4.1伪分布搭建》
2、 集群安装
请参考博文《Hadoop之——CentOS + hadoop2.5.2分布式环境配置》
3、 高可用集群安装
请参考博文《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)前期准备》和《Hadoop之——Hadoop2.5.2 HA高可靠性集群搭建(Hadoop+Zookeeper)》
这篇博文中,我也是在单节点上安装的Hadoop,将HBase和Hadoop安装在了同一台服务器上。由于HBase的运行依赖于Zookeeper,所以,在同一台服务器上,又安装了单节点的Zookeeper。四、安装Storm集群
参考博文《Storm之——搭建Storm集群》
五、编写Map代码
这里我们创建一个mapper.py脚本,从标准输入(stdin)读取数据,默认以空格分隔单词,然后按行输出单词机器词频到标准输出(stdout),整个Map处理过程不会统计每个单词出现的总次数,而是直接输出“word 1”,以便作为Reduce的输入进行统计,要求mapper.py具备可执行权限,执行chmod +x /usr/local/python/source/mapper.py。
【/usr/local/python/source/mapper.py】
#!/usr/bin/env python
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日
@author: liuyazhuang
'''
import sys
#输入为标准输入stdin
for line in sys.stdin:
#删除开头和结尾的空格
line = line.strip()
#以默认空格分隔行单词到words列表
words = line.split()
for word in words:
#输出所有单词,格式为“单词,1”以便作为Reduce的输入
print '%s\t%s' % (word, 1)
六、编写Reduce代码
这里我们创建一个reducer.py脚本,从标准输入(stdin)读取mapper.py的结果,然后统计每个单词出现的总次数并输出到标准输出(stdout),要求reducer.py具备可执行执行,执行chmod +x /usr/local/python/source/reducer.py
【/usr/local/python/source/reducer.py】
#!/usr/bin/env python
# -*- coding:UTF-8 -*-
'''
Created on 2018年1月14日
@author: liuyazhuang
'''
#from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
#获取标准输入,即mapper.py的输出
for line in sys.stdin:
#删除开头和结尾的空格
line = line.strip()
#解析mapper.py输出作为程序的输入,以tab作为分隔符
word, count = line.split('\t', 1)
#转换count从字符型成整型
try:
count = int(count)
except ValueError:
#count不是数据时,忽略此行
continue
#要求mapper.py的输出做排序操作,以便对连续的word做判断,hadoop会自动排序
if current_word == word:
current_count += count
else:
if current_word:
#输出当前word统计结果到标准输出
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
#输出最后一个word统计
if current_word == word:
print '%s\t%s' % (current_word, current_count)
七、测试代码
我们可以在Hadoop平台运行之前在本地测试,校验mapper.py与reducer.py运行的结果是否正确。
注意:测试reducer.py时需要对mapper.py的输出做排序(sort)操作,不过,Hadoop环境会自动实现排序
1、本地运行mapper.py
[root@liuyazhuang121 source]# cat input.txt | ./mapper.py
hello 1
hello 1
liuyazhuang 1
lyz 1
liuyazhuang 1
lyz 1
where 1
is 1
your 1
home 1
home 1
see 1
you 1
by 1
st 1
welcome 1
test 1
adc 1
abc 1
labs 1
me 1
python 1
hadoop 1
ab 1
bc 1
bec 1
python 1
hadoop 1
bar 1
ccc 1
bar 1
ccc 1
bbb 1
aaa 1
bbb 1
iii 1
ooo 1
xxx 1
yyy 1
xxyy 1
xxx 1
iii 1
ooo 1
yyy 1
输出了Map的结果
2、本地运行reducer.py
[root@liuyazhuang121 source]# cat input.txt | ./mapper.py | sort -k1,1 | ./reducer.py
aaa 1
ab 1
abc 1
adc 1
bar 2
bbb 2
bc 1
bec 1
by 1
ccc 2
hadoop 2
hello 2
home 2
iii 2
is 1
labs 1
liuyazhuang 2
lyz 2
me 1
ooo 2
python 2
see 1
test 2
welcome 1
where 1
xxx 2
xxyy 1
you 1
your 1
yyy 2
输出了Reduce的结果。
八、在Hadoop平台运行代码
1、创建目录并上传文件
首先在HDFS上创建文本文件存储目录,本实例为/user/root/word,执行如下命令:
hdfs dfs -mkdir /user/root/word
上传文本文件到HDFS,本实例中为/usr/local/python/source/input.txt,如果有多个文件,可采用以下方法进行操作,Hadoop分析目标默认针对目录,目录下的文件都在运算范围中。
[root@liuyazhuang121 source]# hadoop fs -put /usr/local/python/source/input.txt /user/root/word/
[root@liuyazhuang121 source]# hadoop fs -ls /user/root/word/
Found 1 items
-rw-r--r-- 1 root supergroup 215 2018-01-14 09:59 /user/root/word/input.txt
2、执行MapReduce程序
这里,我们输出结果文件制定/output/word,执行以下命令:
[root@liuyazhuang121 source]# hadoop jar /usr/local/hadoop-2.5.2/share/hadoop/tools/lib/hadoop-streaming-2.5.2.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word
可以看到map及reducer的百分比,打印出的log如下:
18/01/14 10:54:19 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /usr/local/hadoop-2.5.2/tmp/hadoop-unjar3958497380381943575/] [] /tmp/streamjob1400075475828443108.jar tmpDir=null
18/01/14 10:54:22 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 10:54:22 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 10:54:24 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/14 10:54:25 INFO mapreduce.JobSubmitter: number of splits:2
18/01/14 10:54:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515893542122_0001
18/01/14 10:54:26 INFO impl.YarnClientImpl: Submitted application application_1515893542122_0001
18/01/14 10:54:26 INFO mapreduce.Job: The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0001/
18/01/14 10:54:26 INFO mapreduce.Job: Running job: job_1515893542122_0001
18/01/14 10:54:43 INFO mapreduce.Job: Job job_1515893542122_0001 running in uber mode : false
18/01/14 10:54:43 INFO mapreduce.Job: map 0% reduce 0%
18/01/14 10:55:16 INFO mapreduce.Job: map 33% reduce 0%
18/01/14 10:55:17 INFO mapreduce.Job: map 100% reduce 0%
18/01/14 10:55:31 INFO mapreduce.Job: map 100% reduce 100%
18/01/14 10:55:32 INFO mapreduce.Job: Job job_1515893542122_0001 completed successfully
18/01/14 10:55:32 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=398
FILE: Number of bytes written=302280
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=529
HDFS: Number of bytes written=202
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=62800
Total time spent by all reduces in occupied slots (ms)=11416
Total time spent by all map tasks (ms)=62800
Total time spent by all reduce tasks (ms)=11416
Total vcore-seconds taken by all map tasks=62800
Total vcore-seconds taken by all reduce tasks=11416
Total megabyte-seconds taken by all map tasks=64307200
Total megabyte-seconds taken by all reduce tasks=11689984
Map-Reduce Framework
Map input records=1
Map output records=44
Map output bytes=304
Map output materialized bytes=404
Input split bytes=206
Combine input records=0
Combine output records=0
Reduce input groups=30
Reduce shuffle bytes=404
Reduce input records=44
Reduce output records=30
Spilled Records=88
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=159
CPU time spent (ms)=3040
Physical memory (bytes) snapshot=571060224
Virtual memory (bytes) snapshot=2657177600
Total committed heap usage (bytes)=378011648
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=323
File Output Format Counters
Bytes Written=202
18/01/14 10:55:32 INFO streaming.StreamJob: Output directory: /output/word
这里,我们输入如下命令查看结果:
[root@liuyazhuang121 source]# hadoop fs -ls /output/word
Found 2 items
-rw-r--r-- 1 root supergroup 0 2018-01-14 10:55 /output/word/_SUCCESS
-rw-r--r-- 1 root supergroup 202 2018-01-14 10:55 /output/word/part-00000
[root@liuyazhuang121 source]#
其中,part-00000存放了我们的分析结果,下面我们查看结果:
[root@liuyazhuang121 source]# hadoop fs -cat /output/word/part-00000
aaa 1
ab 1
abc 1
adc 1
bar 2
bbb 2
bc 1
bec 1
by 1
ccc 2
hadoop 2
hello 2
home 2
iii 2
is 1
labs 1
liuyazhuang 2
lyz 2
me 1
ooo 2
python 2
see 1
test 2
welcome 1
where 1
xxx 2
xxyy 1
you 1
your 1
yyy 2
可见,结果与我们在测试的时候结果一致。
为了简化我们执行Hadoop MapReduce的命令,我们可以将Hadoop的hadoop-streaming-*.jar加入到系统环境变量/etc/profile中,在/etc/profile文件中添加如下配置:
HADOOP_STREAM=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar
export HADOOP_STREAM
这里我们之前就配置了Hadoop的环境变量。
此时,我们执行以下命令来运行MapReduce程序
[root@liuyazhuang121 source]# hadoop jar $HADOOP_STREAM -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word1
我们同样可以看到Map和Reduce执行的百分比,执行的log日志如下:
18/01/14 11:04:46 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapper.py, ./reducer.py, /usr/local/hadoop-2.5.2/tmp/hadoop-unjar2463144927504143769/] [] /tmp/streamjob3106204875058057023.jar tmpDir=null
18/01/14 11:04:47 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 11:04:48 INFO client.RMProxy: Connecting to ResourceManager at liuyazhuang121/192.168.209.121:8032
18/01/14 11:04:48 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/14 11:04:48 INFO mapreduce.JobSubmitter: number of splits:2
18/01/14 11:04:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1515893542122_0002
18/01/14 11:04:49 INFO impl.YarnClientImpl: Submitted application application_1515893542122_0002
18/01/14 11:04:49 INFO mapreduce.Job: The url to track the job: http://liuyazhuang121:8088/proxy/application_1515893542122_0002/
18/01/14 11:04:49 INFO mapreduce.Job: Running job: job_1515893542122_0002
18/01/14 11:04:55 INFO mapreduce.Job: Job job_1515893542122_0002 running in uber mode : false
18/01/14 11:04:55 INFO mapreduce.Job: map 0% reduce 0%
18/01/14 11:05:05 INFO mapreduce.Job: map 100% reduce 0%
18/01/14 11:05:19 INFO mapreduce.Job: map 100% reduce 100%
18/01/14 11:05:19 INFO mapreduce.Job: Job job_1515893542122_0002 completed successfully
18/01/14 11:05:20 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=398
FILE: Number of bytes written=302283
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=529
HDFS: Number of bytes written=202
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=15700
Total time spent by all reduces in occupied slots (ms)=10749
Total time spent by all map tasks (ms)=15700
Total time spent by all reduce tasks (ms)=10749
Total vcore-seconds taken by all map tasks=15700
Total vcore-seconds taken by all reduce tasks=10749
Total megabyte-seconds taken by all map tasks=16076800
Total megabyte-seconds taken by all reduce tasks=11006976
Map-Reduce Framework
Map input records=1
Map output records=44
Map output bytes=304
Map output materialized bytes=404
Input split bytes=206
Combine input records=0
Combine output records=0
Reduce input groups=30
Reduce shuffle bytes=404
Reduce input records=44
Reduce output records=30
Spilled Records=88
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=167
CPU time spent (ms)=3260
Physical memory (bytes) snapshot=598515712
Virtual memory (bytes) snapshot=2668818432
Total committed heap usage (bytes)=429916160
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=323
File Output Format Counters
Bytes Written=202
18/01/14 11:05:20 INFO streaming.StreamJob: Output directory: /output/word1
此时,我们查看结果,也是和之前一样的。