python-编写hadoop程序[转]

最新推荐文章于 2020-11-28 02:30:04 发布

clf16222

最新推荐文章于 2020-11-28 02:30:04 发布

阅读量81

点赞数

文章标签：大数据 python

1，先安装好hadoop集群

2，编写 python的map程序

mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

编写reduce程序：reducer.py

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

# input comes from STDIN

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# parse the input we got from mapper.py

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:

count = int(count)

except ValueError:

# count was not a number, so silently

# ignore/discard this line

continue

# this IF-switch only works because Hadoop sorts map output

# by key (here: word) before it is passed to the reducer

if current_word == word:

current_count += count

else:

if current_word:

# write result to STDOUT

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

# do not forget to output the last word if needed!

if current_word == word:

print '%s\t%s' % (current_word, current_count)

3，复制连个要统计单词的数据文件到hdfs中

hadoop dfs -mkdir input2

hadoop dfs -copyFromLocal data01 data02 input2

4, 运行 map-reduce程序

编写一个脚本：runwork.sh

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.2-streaming.jar \

-file /home/hadoop-user/dopython/wordcount/mapper.py \ #注意这个选项，开始我没加，job中被kill掉加了就好了

-mapper /home/hadoop-user/dopython/wordcount/mapper.py \

-file /home/hadoop-user/dopython/wordcount/reducer.py \

-reducer /home/hadoop-user/dopython/wordcount/reducer.py \

-input input2/* \

-output pywordcount

5，运行

./runwork.sh

观察输出

packageJobJar: [/home/hadoop-user/dopython/wordcount/mapper.py, /home/hadoop-user/dopython/wordcount/reducer.py, /home/hadoop-user/tmp

/hadoop-unjar6462492368598143954/] [] /tmp/streamjob6590748737962071984.jar tmpDir=null

12/07/14 21:03:38 INFO mapred.FileInputFormat: Total input paths to process : 2

12/07/14 21:03:38 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop-user/tmp/mapred/local]

12/07/14 21:03:38 INFO streaming.StreamJob: Running job: job_201207142005_0008

12/07/14 21:03:38 INFO streaming.StreamJob: To kill this job, run:

12/07/14 21:03:38 INFO streaming.StreamJob: /home/hadoop-user/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=192.168.201.12

8:9001 -kill job_201207142005_0008

12/07/14 21:03:38 INFO streaming.StreamJob: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201207142005_0008

12/07/14 21:03:39 INFO streaming.StreamJob: map 0% reduce 0%

12/07/14 21:03:49 INFO streaming.StreamJob: map 50% reduce 0%

12/07/14 21:03:51 INFO streaming.StreamJob: map 100% reduce 0%

12/07/14 21:04:06 INFO streaming.StreamJob: map 100% reduce 100%

12/07/14 21:04:11 INFO streaming.StreamJob: Job complete: job_201207142005_0008

12/07/14 21:04:11 INFO streaming.StreamJob: Output: pywordcount

6，查看结构

hadoop dfs -cat pywordcount/part-00000

：数据文件不一样统计出来的结构不一样

Bye 1

Goodbye 1

Hadoop 1

Hadoop” 1

World 1

World” 1

“Hello 2

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/27202748/viewspace-738646/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/27202748/viewspace-738646/

clf16222

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫