大数据--hadoop--MapReducer编程学习笔记

最新推荐文章于 2021-01-25 15:55:34 发布

Susie娘

最新推荐文章于 2021-01-25 15:55:34 发布

阅读量331

点赞数

本文链接：https://blog.csdn.net/zhenzhen823/article/details/91511553

版权

一，MapReduce：

mapReduce 的思想去统计时要写两个程序，一个是Mapper，一个是Reduce。

例如,统计文本中每个词的个数.

Mapper.py

import sys

for line in sys.stdin:
  //删除开头和结尾的空行
  line = line.strip()
  //以默认空格分隔单词到words列表
  words = line.split()
  for word in words:
     //输出所有单词，格式为“单词，1”以便作为Reduce的输入
     print '%s %s' %(word,1)

Mapper中是不需要计数的，碰到词就丢出来，在后面的Reduce中是进行计数。

Reduce.py

import sys

current_word = None
current_count = 0
word = None
#获取标准输入，即Mapper.py的标准输出
for line in stdin:
  #删除开头和结尾的空行
  line = line.strip()
  
  #解析mapper.python输出作为程序的输入，以tab作为分隔符
  word, count = line.split( )
  
  #转换count从字符型到整型
  try：
    count = int(count)
  except ValueError:
    #count非数字时，忽略此行
	continue
	
  #要求mapper.python的输出做排序（sort）操作，以便对连续的word做判断
  if current_word == word:
     current_count += count
  else:
     #出现一个新词
	 #输出当前word统计结果到标准输出
	 if current_word :
	   print '%s \t %s' % (current_word, current_count)
	   #开始对新词的统计
	   current_count = count
	   current_word = word
	   
#输出最后一个word统计
if current_word == word:
  print '%s \t %s' % (current_word, current_count)

在unix下执行脚本：

cat xxx.txt | ./Mapper.py //通过管道符(|),把数据丢给mapper
cat xxx.txt | ./Mapper.py | sort  //unix的sort排序  streaming会自动排序
cat xxx.txt | ./Mapper.py | sort | ./reduce.py //把mapper的输出作为reduce的输入丢给reduce执行

Hadoop下执行：

hadoop fs -ls /data/data1  //查看输入的文件
hadoop fs -ls /output/data1  //查看输出的文件目录 如果有这个目录，先要删除掉
hadoop fs -rmr /output/data1 //删除
clear //清空
执行命令：
hadoop jar /usr/hadoop/hadoop/hadoop-2.6.0-cdh.5.8.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.8.0.jar -input /data/data1 -output /output/data1 -file mapper.py - file reduce.py -mapper 'mapper.py' -reduce 'reduce.py'

注意：输出到part-**文件中， 一个reduce对应输出一个part-****文件

二：用mrjob来优化hadoop的MapReduce代码

1.mrjob的介绍

1).mrjob 是python对对hadoop的streaming封装

2).mrjob 的文档参照：https://pythonhosted.org/mrjob/

3).安装：pip install mrjob

4).运行模式：内嵌（-r inline）, 本地（-r local）, Hadoop(-r hadoop), Amazon EMR (-r emr)

2.用mrjob接着上面的一的例子实现top n

TopNWords.py

from mrjob.job improt MRJob, MRStep
import heapq   //python中的堆排序

class TopNWords(MRJob):
	def mapper(self, _, line):
		if line.strip() != "":
			for word in line.strip().split():
				yield word, 1
	def combiner(self, word, counts):
		yield word, sum(counts)
	def reducer_sum(self, word, counts):
		yield None, (sum(counts), word)
	def top_n_reducer(self, _, word_cnts):
		for cnt, word in heapq.nlargest(2, word_cnts):   //这样的优点是省内存, 我们默认看2个词的top
			yield word, cnt
	def steps(self):
		return [
			MRStep(mapper=self.mapper,
				   combiner=self.combiner,
				   reducer=self.reducer_sum),
			MRStep(reducer=self.top_n_reducer)
		]

def main():
	TopNWords.run()
	
if __name__=='__main__'
	main()

unix下执行脚本

1).切换格式：set ff=unix 变为unix
2).给个执行权限： chmod +x TopNWords.py
3).执行：./TopNWords.py -r hadoop hdfs:///data/data1 -o hdfs:///output/data3

三. 用mapreduce实现数据的连接（JOIN）

优化用python提供的groupby库。

****mapper_opt.py****

import sys
def main():
	for line in sys.stdin:
		line = line.strip()
		if line == '':
			continue
		fields = line.split('\t')
		if len(fields) == 3:
			source = 'A'
			user_id,_,user_loc = fields
			print '{0}\t{1}:{2}'.format(user_id,source,user_loc)
		elif len(fields) == 4:
			source = 'B'
			order_id,user_id,product_id,price = fields
			print '{0}\t{1}:{2}:{3}'.format(user_id,source,order_id,price)

if __name__=='__main__':
	main()
	
****reducer_opt.py****

from itertools import groupby  #对拿进来的数据进行分组
from operator import itemgetter
improt sys
def read_line(file):
	for line in file:
		line = line.strip()
		if line == '':
			continue
		fields = line.split('\t') #格式 key tab value
		yield fields

def main():
	data_iter = read_line(sys.stdin)
	for key,kviter in groupby(data_iter, itemgetter(0)):
		user_id = key
		user_loc = None
		order_cnt = 0
		order_sum = 0
		for line in kviter:
			fields = line[1].split(':')
			if len(fields) == 2:
				user_loc = fields[1]
			elif len(fields) == 3:
				order_cnt +=1
				order_sum +=int(fields[2])
		print '{0}\t{1}:{2}:{3}'.format(user_id,user_loc,order_cnt,order_sum)

if __name__=='__main__':
	main()

Susie娘

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据--hadoop--MapReducer编程学习笔记

一，MapReduce：mapReduce 的思想去统计时要写两个程序，一个是Mapper，一个是Reduce。例如,统计文本中每个词的个数.Mapper.pyimport sysfor line in sys.stdin: //删除开头和结尾的空行 line = line.strip() //以默认空格分隔单词到words列表 words = line.sp...
复制链接

扫一扫