［MapReduce］Top 10 标签

最新推荐文章于 2023-05-30 17:29:53 发布

Lesley dude

最新推荐文章于 2023-05-30 17:29:53 发布

阅读量888

点赞数

分类专栏： MapReduce 文章标签： hadoop mapreduce python

本文链接：https://blog.csdn.net/aFeiOnePiece/article/details/47108957

版权

MapReduce 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

源自Udacity，Intro to hadoop这门课Final Project

题目要求，从给定的论坛帖子数据中，找出被使用最多的10个tag。tag存在于question中（因为comment和answer也存在于数据表中，所有需要在mapper中过滤）

思路：

mapper 从是question的记录中，提取tags，并输出

reducer 给每个tags计数，保存到一个dict里面，最后排序输出

# mapper
#!/usr/bin/python

import sys
import csv

reader = csv.reader(sys.stdin, delimiter = "\t")
next(reader, None)

for line in reader:
    if len(line) != 19:
        continue

    node_type = line[5].strip()
    if node_type == "question":
        tags = line[2].strip().split()
        for tag in tags:
            print tag

# reducer
#!/usr/bin/python

import sys
import csv

tags = {}
oldKey = None
count = 0

for line in sys.stdin:
    data = line.strip().split("\t")
    if len(data) != 1:
        continue

    thisKey = data[0]

    if oldKey and thisKey != oldKey:
        tags[oldKey] = count
        count = 0
        oldKey = thisKey

    oldKey = thisKey
    count += 1

if oldKey != None:
    tags[oldKey] = count

top10 = sorted(tags, key=tags.get, reverse=True)[:10]
for tag in top10:
    print "{0}\t{1}".format(tag, tags[tag])

Lesley dude

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
［MapReduce］Top 10 标签

源自Udacity，Intro to hadoop这门课Final Project题目要求，从给定的论坛帖子数据中，找出被使用最多的10个tag。tag存在于question中（因为comment和answer也存在于数据表中，所有需要在mapper中过滤）思路：mapper 从是question的记录中，提取tags，并输出reducer 给每个tags计数，保存
复制链接

扫一扫

专栏目录