python写mapreduce_使用python写一个最基本的mapreduce程序

最新推荐文章于 2024-05-27 16:07:15 发布

weixin_39641697

最新推荐文章于 2024-05-27 16:07:15 发布

阅读量344

点赞数

文章标签： python写mapreduce

一个mapreduce程序大致分成三个部分，第一部分是mapper文件，第二个就是reducer文件，第三部分就是使用hadoop command 执行程序。

在这个过程中，困惑我最久的一个问题就是在hadoop command中hadoop-streaming 也就是streaming jar包的路径。

路径大概是这样的:

cd ~

cd /usr/local/hadoop-2.7.3/share/hadoop/tools/lib

#在这个文件下，我们可以找到你 hadoop-streaming-2.7.3.jar

这个路径是参考的这里

这个最基本的mapreduce程序我主要参考了三个博客:

首先对于mapper文件

mapper.py

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words

words = line.split()

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print '%s\t%s' % (word, 1)

#上面这个文件我们得到的结果大概是每个单词对应一个数字1

对于reducer文件:reducer.py

#!/usr/bin/env python

from operator import itemgetter

import sys

current_word = None

current_count = 0

word = None

# input comes from STDIN

for line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# parse the input we got from mapper.py

word, count = line.split('\t', 1)

# convert count (currently a string) to int

try:

count = int(count)

except ValueError:

# count was not a number, so silently

# ignore/discard this line

continue

# this IF-switch only works because Hadoop sorts map output

# by key (here: word) before it is passed to the reducer

if current_word == word:

current_count += count

else:

if current_word:

# write result to STDOUT

print '%s\t%s' % (current_word, current_count)

current_count = count

current_word = word

# do not forget to output the last word if needed!

if current_word == word:

print '%s\t%s' % (current_word, current_count)

对上面两个代码先进行一个本地的检测

vim test.txt

foo foo quux labs foo bar quux

cat test.txt|python mapper.py

cat test.txt|python mapper.py|sort|python reducer.py

##注意在这里我们执行万mapper之后我们进行了一个排序，所以对于相同单词是处于相邻位置的，这样在执行reducer文件的时候代码可以写的比较简单一点

然后在hadoop集群中跑这个代码

首先讲这个test.txt 上传到相应的hdfs文件系统中，使用的命令模式如下:

hadoop fs -put ./test.txt /dw_ext/weibo_bigdata_ugrowth/mds/

然后写一个run.sh

HADOOP_CMD="/usr/local/hadoop-2.7.3/bin/hadoop" # hadoop的bin的路径

STREAM_JAR_PATH="/usr/local/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar" ## streaming jar包的路径

INPUT_FILE_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/src.txt" #hadoop集群上的资源输入路径

#需要注意的是intput文件必须是在hadooop集群上的hdfs文件中的，所以必须将本地文件上传到集群上

OUTPUT_PATH="/dw_ext/weibo_bigdata_ugrowth/mds/output"

#需要注意的是这output文件必须是不存在的目录，因为我已经执行过一次了，所以这里我把这个目录通过下面的代码删掉

$HADOOP_CMD fs -rmr $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH \

-input $INPUT_FILE_PATH \

-output $OUTPUT_PATH \

-mapper "python mapper.py" \

-reducer "python reducer.py" \

-file ./mapper.py \

-file ./reducer.py

# -mapper：用户自己写的mapper程序，可以是可执行文件或者脚本

# -reducer：用户自己写的reducer程序，可以是可执行文件或者脚本

# -file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。

weixin_39641697

关注

0
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
python写mapreduce_使用python写一个最基本的mapreduce程序

一个mapreduce程序大致分成三个部分，第一部分是mapper文件，第二个就是reducer文件，第三部分就是使用hadoop command 执行程序。在这个过程中，困惑我最久的一个问题就是在hadoop command中hadoop-streaming 也就是streaming jar包的路径。路径大概是这样的:cd ~cd /usr/local/hadoop-2.7.3/share/ha...
复制链接

扫一扫