在Hadoop中用Python实现MapReduce代码

利用MRjob编写MapReduce代码

1.1.1 安装Mrjob


mrjob简介

  • 一个通过hadoop、emr的mapreduce编程接口(streamming),扩展出来的一个python的编程框架。
  • mrjob程序可以在本地测试运行也可以部署到Hadoop集群上运行

mrjob安装

pip3 install mrjob

1.1.2 mrjob实现词频统计

# word_count.py @midi 2021-01-22

from mrjob.job import MRJob

class WordCount(MRJob):
#继承MRjob 重写两个方法mapper和reducer

    #每一行从line中输入
    def mapper(self, _, line):
        for word in line.split():
            yield word,1

    # 键相同的会分配到同一个reducer
    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    WordCount.run()

1.1.3 运行mrjob

python3 word_count.py input.txt

运行MRJob的几种方式

1. 默认是 -r inline 方式

python3 word_count.py input.txt > output.txt
python3 word_count.py -r inline input.txt > output.txt

2.本地-r local方式

python3 word_count.py -r local input.txt > output.txt

3. 运行在hadoop集群上

python3 word_count.py -r hadoop hdfs://master/input/ -o hdfs://master/output/

mrjob案例

统计所有单词出现最多的前n个

input.txt

Almost every child will complain about their parents sometimes.
It is natural, because when people stay together for a long time, they will start to have argument.
But ignore about the unhappy time, our parents love us all the time.
No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.
# .py @midi 2021-01-22
from mrjob.job import MRJob, MRStep
import heapq

class TopNWords(MRJob):

    def mapper(self, _, line):
        if line.strip() != "":
            for word in line.strip().split():
                word = word.strip(',')
                word = word.strip('.')
                word = word.strip('?')
                word = word.strip('!')
                #剔除每个单词中的标点符号
                yield word, 1

	# sum(count) 在元组前面是为了下面heapq的排序
    def reducer_sum(self, word, counts):
        yield None, (sum(counts), word)

    # 利用heapq将数据进行排序,将最大的5个取出
    def top_n_reducer(self, _, word_cnts):
        for cnt, word in heapq.nlargest(5, word_cnts):
            yield word, cnt

    # 实现steps方法用于指定自定义的mapper和reducer方法
    def steps(self):
        # 传入两个step 定义了执行的顺序
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer_sum),
            MRStep(reducer=self.top_n_reducer)
        ]


def main():
    TopNWords.run()


if __name__ == '__main__':
    main()

本地运行
在这里插入图片描述

Hadoop上运行
在这里插入图片描述
最终结果
在这里插入图片描述

tips:
当然 如果不用mrjob这个框架 自己直接写mapper.py和reducer.py也是可以的,下面是集群执行的代码

hadoop jar /home/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.6.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值