1,先
安装好hadoop集群
2,编写
python的map程序
mapper.py
#!/usr/bin/env python
#
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
编写reduce程序:reducer.py
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
3,复制连个要统计单词的数据文件到hdfs中
hadoop dfs -mkdir input2
hadoop dfs -copyFromLocal data01 data02 input2
4, 运行
map-reduce程序
编写一个脚本:runwork.sh
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \
-file /home/hadoop-user/dopython/wordcount/mapper.py \ #注意这个选项,开始我没加,job中被kill掉加了就好了
-mapper /home/hadoop-user/dopython/wordcount/mapper.py \
-file /home/hadoop-user/dopython/wordcount/reducer.py \
-reducer /home/hadoop-user/dopython/wordcount/reducer.py \
-input input2/* \
-output pywordcount
5,运行
./runwork.sh
观察输出
packageJobJar: [/home/hadoop-user/dopython/wordcount/mapper.py, /home/hadoop-user/dopython/wordcount/reducer.py, /home/hadoop-user/tmp
/hadoop-unjar6462492368598143954/] [] /tmp/streamjob6590748737962071984.jar tmpDir=null
12/07/14 21:03:38 INFO mapred.FileInputFormat: Total input paths to process : 2
12/07/14 21:03:38 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop-user/tmp/mapred/local]
12/07/14 21:03:38 INFO streaming.StreamJob: Running job: job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: To kill this job, run:
12/07/14 21:03:38 INFO streaming.StreamJob: /home/hadoop-user/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=192.168.201.12
8:9001 -kill job_201207142005_0008
12/07/14 21:03:38 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201207142005_0008
12/07/14 21:03:39 INFO streaming.StreamJob: map 0% reduce 0%
12/07/14 21:03:49 INFO streaming.StreamJob: map 50% reduce 0%
12/07/14 21:03:51 INFO streaming.StreamJob: map 100% reduce 0%
12/07/14 21:04:06 INFO streaming.StreamJob: map 100% reduce 100%
12/07/14 21:04:11 INFO streaming.StreamJob: Job complete: job_201207142005_0008
12/07/14 21:04:11 INFO streaming.StreamJob: Output: pywordcount
6,查看结构
hadoop dfs -cat pywordcount/part-00000
:数据文件不一样统计出来的结构不一样
Bye 1
Goodbye 1
Hadoop 1
Hadoop” 1
World 1
World” 1
“Hello 2
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/27202748/viewspace-738646/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/27202748/viewspace-738646/