在Hadoop中用Python实现MapReduce代码
利用MRjob编写MapReduce代码
1.1.1 安装Mrjob
mrjob简介
- 一个通过hadoop、emr的mapreduce编程接口(streamming),扩展出来的一个python的编程框架。
- mrjob程序可以在本地测试运行也可以部署到Hadoop集群上运行
mrjob安装
pip3 install mrjob
1.1.2 mrjob实现词频统计
# word_count.py @midi 2021-01-22
from mrjob.job import MRJob
class WordCount(MRJob):
#继承MRjob 重写两个方法mapper和reducer
#每一行从line中输入
def mapper(self, _, line):
for word in line.split():
yield word,1
# 键相同的会分配到同一个reducer
def reducer(self, word, counts):
yield word, sum(counts)
if __name__ == '__main__':
WordCount.run()
1.1.3 运行mrjob
python3 word_count.py input.txt
运行MRJob的几种方式
1. 默认是 -r inline 方式
python3 word_count.py input.txt > output.txt
python3 word_count.py -r inline input.txt > output.txt
2.本地-r local方式
python3 word_count.py -r local input.txt > output.txt
3. 运行在hadoop集群上
python3 word_count.py -r hadoop hdfs://master/input/ -o hdfs://master/output/
mrjob案例
统计所有单词出现最多的前n个
input.txt
Almost every child will complain about their parents sometimes.
It is natural, because when people stay together for a long time, they will start to have argument.
But ignore about the unhappy time, our parents love us all the time.
No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.
# .py @midi 2021-01-22
from mrjob.job import MRJob, MRStep
import heapq
class TopNWords(MRJob):
def mapper(self, _, line):
if line.strip() != "":
for word in line.strip().split():
word = word.strip(',')
word = word.strip('.')
word = word.strip('?')
word = word.strip('!')
#剔除每个单词中的标点符号
yield word, 1
# sum(count) 在元组前面是为了下面heapq的排序
def reducer_sum(self, word, counts):
yield None, (sum(counts), word)
# 利用heapq将数据进行排序,将最大的5个取出
def top_n_reducer(self, _, word_cnts):
for cnt, word in heapq.nlargest(5, word_cnts):
yield word, cnt
# 实现steps方法用于指定自定义的mapper和reducer方法
def steps(self):
# 传入两个step 定义了执行的顺序
return [
MRStep(mapper=self.mapper,
reducer=self.reducer_sum),
MRStep(reducer=self.top_n_reducer)
]
def main():
TopNWords.run()
if __name__ == '__main__':
main()
本地运行
Hadoop上运行
最终结果
tips:
当然 如果不用mrjob这个框架 自己直接写mapper.py和reducer.py也是可以的,下面是集群执行的代码
hadoop jar /home/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.6.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word