在Hadoop中用Python实现MapReduce代码

最新推荐文章于 2024-04-28 14:20:11 发布

ISMidi

最新推荐文章于 2024-04-28 14:20:11 发布

阅读量627

点赞数 4

分类专栏： Hadoop 文章标签： mapreduce python hadoop 大数据

本文链接：https://blog.csdn.net/wanmiqi/article/details/112984086

版权

Hadoop 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

在Hadoop中用Python实现MapReduce代码

利用MRjob编写MapReduce代码
运行MRJob的几种方式
mrjob案例
- 统计所有单词出现最多的前n个

利用MRjob编写MapReduce代码

1.1.1 安装Mrjob

mrjob简介

一个通过hadoop、emr的mapreduce编程接口（streamming），扩展出来的一个python的编程框架。
mrjob程序可以在本地测试运行也可以部署到Hadoop集群上运行

mrjob安装

pip3 install mrjob

1.1.2 mrjob实现词频统计

# word_count.py @midi 2021-01-22

from mrjob.job import MRJob

class WordCount(MRJob):
#继承MRjob 重写两个方法mapper和reducer

    #每一行从line中输入
    def mapper(self, _, line):
        for word in line.split():
            yield word,1

    # 键相同的会分配到同一个reducer
    def reducer(self, word, counts):
        yield word, sum(counts)

if __name__ == '__main__':
    WordCount.run()

1.1.3 运行mrjob

python3 word_count.py input.txt

运行MRJob的几种方式

1. 默认是 -r inline 方式

python3 word_count.py input.txt > output.txt
python3 word_count.py -r inline input.txt > output.txt

2.本地-r local方式

python3 word_count.py -r local input.txt > output.txt

3. 运行在hadoop集群上

python3 word_count.py -r hadoop hdfs://master/input/ -o hdfs://master/output/

mrjob案例

统计所有单词出现最多的前n个

input.txt

Almost every child will complain about their parents sometimes.
It is natural, because when people stay together for a long time, they will start to have argument.
But ignore about the unhappy time, our parents love us all the time.
No matter what happen to us, they will stand by our sides. We should be grateful to them and try to understand them.

# .py @midi 2021-01-22
from mrjob.job import MRJob, MRStep
import heapq

class TopNWords(MRJob):

    def mapper(self, _, line):
        if line.strip() != "":
            for word in line.strip().split():
                word = word.strip(',')
                word = word.strip('.')
                word = word.strip('?')
                word = word.strip('!')
                #剔除每个单词中的标点符号
                yield word, 1

	# sum(count) 在元组前面是为了下面heapq的排序
    def reducer_sum(self, word, counts):
        yield None, (sum(counts), word)

    # 利用heapq将数据进行排序，将最大的5个取出
    def top_n_reducer(self, _, word_cnts):
        for cnt, word in heapq.nlargest(5, word_cnts):
            yield word, cnt

    # 实现steps方法用于指定自定义的mapper和reducer方法
    def steps(self):
        # 传入两个step 定义了执行的顺序
        return [
            MRStep(mapper=self.mapper,
                   reducer=self.reducer_sum),
            MRStep(reducer=self.top_n_reducer)
        ]


def main():
    TopNWords.run()


if __name__ == '__main__':
    main()

本地运行
在这里插入图片描述

Hadoop上运行
在这里插入图片描述
最终结果

tips:
当然如果不用mrjob这个框架自己直接写mapper.py和reducer.py也是可以的，下面是集群执行的代码

hadoop jar /home/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.6.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py -input /user/root/word -output /output/word