［MapReduce］Top N 任务的mapper

最新推荐文章于 2022-12-14 19:05:12 发布

Lesley dude

最新推荐文章于 2022-12-14 19:05:12 发布

阅读量1.2k

点赞数

分类专栏： MapReduce 文章标签： hadoop mapreduce python

本文链接：https://blog.csdn.net/aFeiOnePiece/article/details/47089109

版权

MapReduce 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习

求总体的Top N。

首先在Mapper中求出局部的Top N，求Top N不能像word count那样来一句print一句，要把所有的line都读完，计数，排序，输入topN

然后再Reducer中求出全局的 Top N。

以下是Mapper 代码

#!/usr/bin/python
"""
Your mapper function should print out 10 lines containing longest posts, sorted in
ascending order from shortest to longest.
Please do not use global variables and do not change the "main" function.
"""
import sys
import csv


def mapper():
    reader = csv.reader(sys.stdin, delimiter='\t')
    writer = csv.writer(sys.stdout, delimiter='\t', quotechar='"', quoting=csv.QUOTE_ALL)

    lines = []
    for line in reader:
    	lines.append(line)
        # YOUR CODE HERE

    lines.sort(key = lambda x: len(x[4]), reverse = True)
    for i in range(9, -1, -1):
        writer.writerow(lines[i])
    



test_text = """\"\"\t\"\"\t\"\"\t\"\"\t\"333\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"88888888\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"11111111111\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"1000000000\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"22\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"4444\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"666666\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"55555\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"999999999\"\t\"\"
\"\"\t\"\"\t\"\"\t\"\"\t\"7777777\"\t\"\"
"""

# This function allows you to test the mapper with the provided test string
def main():
    import StringIO
    sys.stdin = StringIO.StringIO(test_text)
    mapper()
    sys.stdin = sys.__stdin__

main()

Lesley dude

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
［MapReduce］Top N 任务的mapper

这是Udacity的课程 intro to hadoop and mapReduce里面Lesson4的练习求总体的Top N。首先在Mapper中求出局部的Top N，求Top N不能像word count那样来一句print一句，要把所有的line都读完，计数，排序，输入topN然后再Reducer中求出全局的 Top N。以下是Mapper 代码#!/usr/bin/py
复制链接

扫一扫

专栏目录