一,MapReduce:
mapReduce 的思想去统计时要写两个程序,一个是Mapper,一个是Reduce。
例如,统计文本中每个词的个数.
Mapper.py
import sys
for line in sys.stdin:
//删除开头和结尾的空行
line = line.strip()
//以默认空格分隔单词到words列表
words = line.split()
for word in words:
//输出所有单词,格式为“单词,1”以便作为Reduce的输入
print '%s %s' %(word,1)
Mapper中是不需要计数的,碰到词就丢出来,在后面的Reduce中是进行计数。
Reduce.py
import sys
current_word = None
current_count = 0
word = None
#获取标准输入,即Mapper.py的标准输出
for line in stdin:
#删除开头和结尾的空行
line = line.strip()
#解析mapper.python输出作为程序的输入,以tab作为分隔符
word, count = line.split( )
#转换count从字符型到整型
try:
count = int(count)
except ValueError:
#count非数字时,忽略此行
continue
#要求mapper.python的输出做排序(sort)操作,以便对连续的word做判断
if current_word == word:
current_count += count
else:
#出现一个新词
#输出当前word统计结果到标准输出
if current_word :
print '%s \t %s' % (current_word, current_count)
#开始对新词的统计
current_count = count
current_word = word
#输出最后一个word统计
if current_word == word:
print '%s \t %s' % (current_word, current_count)
在unix下执行脚本:
cat xxx.txt | ./Mapper.py //通过管道符(|),把数据丢给mapper
cat xxx.txt | ./Mapper.py | sort //unix的sort排序 streaming会自动排序
cat xxx.txt | ./Mapper.py | sort | ./reduce.py //把mapper的输出作为reduce的输入丢给reduce执行
Hadoop下执行:
hadoop fs -ls /data/data1 //查看输入的文件
hadoop fs -ls /output/data1 //查看输出的文件目录 如果有这个目录,先要删除掉
hadoop fs -rmr /output/data1 //删除
clear //清空
执行命令:
hadoop jar /usr/hadoop/hadoop/hadoop-2.6.0-cdh.5.8.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.8.0.jar -input /data/data1 -output /output/data1 -file mapper.py - file reduce.py -mapper 'mapper.py' -reduce 'reduce.py'
注意:输出到part-**文件中, 一个reduce对应输出一个part-****文件
二:用mrjob来优化hadoop的MapReduce代码
1.mrjob的介绍
1).mrjob 是python对对hadoop的streaming封装
2).mrjob 的文档参照:https://pythonhosted.org/mrjob/
3).安装:pip install mrjob
4).运行模式:内嵌(-r inline), 本地(-r local), Hadoop(-r hadoop), Amazon EMR (-r emr)
2.用mrjob接着上面的一的例子实现top n
TopNWords.py
from mrjob.job improt MRJob, MRStep
import heapq //python中的堆排序
class TopNWords(MRJob):
def mapper(self, _, line):
if line.strip() != "":
for word in line.strip().split():
yield word, 1
def combiner(self, word, counts):
yield word, sum(counts)
def reducer_sum(self, word, counts):
yield None, (sum(counts), word)
def top_n_reducer(self, _, word_cnts):
for cnt, word in heapq.nlargest(2, word_cnts): //这样的优点是省内存, 我们默认看2个词的top
yield word, cnt
def steps(self):
return [
MRStep(mapper=self.mapper,
combiner=self.combiner,
reducer=self.reducer_sum),
MRStep(reducer=self.top_n_reducer)
]
def main():
TopNWords.run()
if __name__=='__main__'
main()
unix下执行脚本
1).切换格式:set ff=unix 变为unix
2).给个执行权限: chmod +x TopNWords.py
3).执行:./TopNWords.py -r hadoop hdfs:///data/data1 -o hdfs:///output/data3
三. 用mapreduce实现数据的连接(JOIN)
优化用python提供的groupby库。
****mapper_opt.py****
import sys
def main():
for line in sys.stdin:
line = line.strip()
if line == '':
continue
fields = line.split('\t')
if len(fields) == 3:
source = 'A'
user_id,_,user_loc = fields
print '{0}\t{1}:{2}'.format(user_id,source,user_loc)
elif len(fields) == 4:
source = 'B'
order_id,user_id,product_id,price = fields
print '{0}\t{1}:{2}:{3}'.format(user_id,source,order_id,price)
if __name__=='__main__':
main()
****reducer_opt.py****
from itertools import groupby #对拿进来的数据进行分组
from operator import itemgetter
improt sys
def read_line(file):
for line in file:
line = line.strip()
if line == '':
continue
fields = line.split('\t') #格式 key tab value
yield fields
def main():
data_iter = read_line(sys.stdin)
for key,kviter in groupby(data_iter, itemgetter(0)):
user_id = key
user_loc = None
order_cnt = 0
order_sum = 0
for line in kviter:
fields = line[1].split(':')
if len(fields) == 2:
user_loc = fields[1]
elif len(fields) == 3:
order_cnt +=1
order_sum +=int(fields[2])
print '{0}\t{1}:{2}:{3}'.format(user_id,user_loc,order_cnt,order_sum)
if __name__=='__main__':
main()