http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
,这篇文章写得不错,不过在服务器上有些Python的库根本都没有安装,所以我将代码小小修改了一下。Python比Java更适合做快速开发,学学怎么通过Python语言编写Map-Reduce程序是很有价值的,
首先编写一个实现map功能的Python程序,代码如下:
#!/usr/bin/env python
imp ort sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
这个程序是非常简单,它一句句地从终端读取数据,然后将一句的字符分离出来,然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序,因为原文对key排序时使用了itemgetter,我测试的服务器安装的Python的operator模块并没有这个itemgetter方法,所以我感觉将排序去掉了,修改后的代码如下:
#!/usr/bin/env python
#from operator imp
ort itemgetter
imp
ort sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
#print '%s\t%s'% (word2count[word], count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in word2count.items():
print '%s\t%s'% (word, count)
这个程序也是非常直观的,它生成了一个字典,字典的key是字段名称,字典的value为字段出现的次数。统计完成之后,使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序,过程如下:
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py
labs 1
quux 2
foo 3
bar 1
生成测试所用数据,然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
测试命令如下(一定要加上"-file"参数,否则Job会失败):
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
测试打印数据如下:
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob: map 0% reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob: map 50% reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob: map 100% reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob: map 100% reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r-- 1 henshao supergroup 26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs 1
quux 2
foo 3
首先编写一个实现map功能的Python程序,代码如下:
#!/usr/bin/env python
imp
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
这个程序是非常简单,它一句句地从终端读取数据,然后将一句的字符分离出来,然后再将这些字符打印到终端上。接着写实现reduce功能的Python程序,因为原文对key排序时使用了itemgetter,我测试的服务器安装的Python的operator模块并没有这个itemgetter方法,所以我感觉将排序去掉了,修改后的代码如下:
#!/usr/bin/env python
#from operator imp
imp
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
#print '%s\t%s'% (word2count[word], count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
#sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in word2count.items():
print '%s\t%s'% (word, count)
这个程序也是非常直观的,它生成了一个字典,字典的key是字段名称,字典的value为字段出现的次数。统计完成之后,使用循环将字典中的记录打印出来。文章还分别测试了一下两个程序,过程如下:
1、测试mapper.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1
2、测试reducer.py
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" | /home/henshao/python_hadoop/mapper.py | sort | /home/henshao/python_hadoop/reducer.py
labs 1
quux 2
foo 3
bar 1
生成测试所用数据,然后将文件上传到HDFS上。
[henshao@test208011 python_hadoop]$ echo "foo foo quux labs foo bar quux" > element.txt
[henshao@test208011 python_hadoop]$ cat element.txt
foo foo quux labs foo bar quux
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -put element.txt /home/python_test/
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python_test/element.txt
foo foo quux labs foo bar quux
测试命令如下(一定要加上"-file"参数,否则Job会失败):
~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
测试打印数据如下:
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop jar ~/hadoop/contrib/streaming/hadoop-0.19.2-streaming.jar -file /home/henshao/python_hadoop/mapper.py -mapper mapper.py -file /home/henshao/python_hadoop/reducer.py -reducer reducer.py -input /home/python_test/element.txt -output /home/python
packageJobJar: [/home/henshao/python_hadoop/mapper.py, /home/henshao/python_hadoop/reducer.py, /home/henshao/hadoop-datastore/hadoop-henshao/hadoop-unjar5362045099634515320/] [] /tmp/streamjob7670340198799210833.jar tmpDir=null
10/01/21 19:00:51 INFO mapred.FileInputFormat: Total input paths to process : 1
10/01/21 19:00:51 INFO streaming.StreamJob: getLocalDirs(): [/home/henshao/hadoop-datastore/hadoop-henshao/mapred/local]
10/01/21 19:00:51 INFO streaming.StreamJob: Running job: job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: To kill this job, run:
10/01/21 19:00:51 INFO streaming.StreamJob: /home/henshao/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=127.0.0.1:9001 -kill job_201001211801_0013
10/01/21 19:00:51 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201001211801_0013
10/01/21 19:00:52 INFO streaming.StreamJob: map 0% reduce 0%
10/01/21 19:00:56 INFO streaming.StreamJob: map 50% reduce 0%
10/01/21 19:00:57 INFO streaming.StreamJob: map 100% reduce 0%
10/01/21 19:01:03 INFO streaming.StreamJob: map 100% reduce 100%
10/01/21 19:01:04 INFO streaming.StreamJob: Job complete: job_201001211801_0013
10/01/21 19:01:04 INFO streaming.StreamJob: Output: /home/python
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -ls /home/python/part-00000
Found 1 items
-rw-r--r-- 1 henshao supergroup 26 2010-01-21 19:01 /home/python/part-00000
[henshao@test208011 python_hadoop]$ ~/hadoop/bin/hadoop fs -cat /home/python/part-00000
labs 1
quux 2
foo 3
bar 1
保存到此处,学习。
转自http://blog.163.com/ecy_fu/blog/static/4445126201002191329467/