一 介绍
MapReduce 是一种分布式编程模型,用于处理大规模的数据。用户主要通过指定一个 map 函数和一个 reduce 函数来处理一个基于key/value pair的数据集合,输出中间的基于key/value pair的数据集合;然后 再创建一个Reduce函数用来合并所有的具有相同中间key值的中间value值。
使用python写MapReduce的“诀窍”是利用Hadoop流的API,通过STDIN(标准输入),STDOUT(标准输出)在Map函数和Reduce函数之间传递数据。
我们唯一需要做的是利用Python的sys.stdin读取输入数据,并把我们的输出传送给sys.stdout。Hadoop流将会帮助我们处理别的任何事情。
二 mapreduce程序
1)用户编写的python程序分成2个部分:Mapper,Reducer;然后将程序提交到集群上跑
(2)Mapper的输入数据是KV对的形式(KV的类型可自定义)
(3)Mapper的输出数据是KV对的形式(KV的类型可自定义)
(4)Mapper中的业务逻辑写在map()方法中
(5)map()方法(maptask进程)对每一个<K,V>调用一次
(6)Reducer的输入数据类型对应Mapper的输出数据类型,也是KV
(7)Reducer的业务逻辑写在reduce()方法中
(8)Reducetask进程对每一组相同k的<k,v>组调用一次reduce()方法
2.1.map.py
# !/usr/bin/python
# 第一行一定要注明python运行位置,否则需要在运行python程序时,在程序前面加上python命令
# -*- coding: utf-8 -*-
# @Time : 2018/10/25 下午11:42
# @Author : Einstein Yang!!
# @Nickname : 穿着开裆裤上大学
# @FileName: map.py.py
# @Software: PyCharm
# @PythonVersion: python3.5
# @Blog :https://blog.csdn.net/weixin_41734687
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words while removing any empty strings
words = filter(lambda word: word, line.split())
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print('%s\t%s' % (word, 1))
# !/usr/bin/python
# -*- coding: utf-8 -*-
# @Time : 2018/10/25 下午11:54
# @Author : Einstein Yang!!
# @Nickname : 穿着开裆裤上大学
# @FileName: reduce.py
# @Software: PyCharm
# @PythonVersion: python3.5
# @Blog :https://blog.csdn.net/weixin_41734687
from operator import itemgetter
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split()
# convert count (currently a string) to int
try:
count = int(count)
word2count[word] = word2count.get(word, 0) + count
except ValueError:
# count was not a number, so silently
# ignore/discard this line
pass
# sort the words lexigraphically;
#
# this step is NOT required, we just do it so that our
# final output will look more like the official Hadoop
# word count examples
# 实现按字典key对word进行排序,这样输出的结果是有序的
sorted_word2count = sorted(word2count.items(), key=itemgetter(0))
# write the results to STDOUT (standard output)
for word, count in sorted_word2count:
print('%s\t%s' % (word, count))
本地测试
cat wordcount.csv | python map.py |sort -k 1|python reduce.py
把代码放到集群上跑
# 提交集群完整代码实例,下面用shell启动
# /root/apps/hadoop-2.6.4/bin/hadoop jar /root/apps/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar -mapper map.py -reducer reduce.py -input /data/data_coe/data_asset/bigdata/*.csv -output /data/data_coe/data_asset/bigdata/output -file /root/Desktop/map.py -file /root/Desktop/reduce.py
HADOOP_CMD="/root/apps/hadoop-2.6.4/bin/hadoop"
STREAM_JAR_PATH="/root/apps/hadoop-2.6.4/share/hadoop/tools/lib/hadoop-streaming-2.6.4.jar"
INPUT_FILE_PATH="/data/data_coe/data_asset/bigdata/*.csv"
OUTPUT_PATH="/data/data_coe/data_asset/bigdata/output"
hdfs dfs -rmr $OUTPUT_PATH
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_PATH \
-mapper "python map.py" \
-reducer "python reduce.py" \
-file /root/Desktop/map.py \
-file /root/Desktop/reduce.py
脚本解释
HADOOP_CMD: hadoop的bin的路径
STREAM_JAR_PATH:streaming jar包的路径
INPUT_FILE_PATH:hadoop集群上的资源输入路径
OUTPUT_PATH:hadoop集群上的结果输出路径。(注意:这个目录不应该存在的,因此在脚本加了先删除这个目录。**注意****注意****注意**:若是第一次执行,没有这个目录,会报错的。可以先手动新建一个新的output目录。)
hdfs dfs -rmr $OUTPUT_PATH
$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_PATH \
# 若干map.py第一行指定了运行的python路径,即第一行写了!/usr/bin/python,可以以以下方式运行 -mapper map.py -reducer reduce.py
-mapper "python map.py" \
-reducer "python reduce.py" \
-file /root/Desktop/map.py \
-file /root/Desktop/reduce.py #这里固定格式,指定输入,输出的路径;指定mapper,reducer的文件;并分发mapper,reducer角色的我们用户写的代码文件,因为集群其他的节点还没有mapper、reducer的可执行文件。