hadoop-python——Wordcount程序：python实现详解

最新推荐文章于 2024-08-02 19:29:28 发布

pat_datamine

最新推荐文章于 2024-08-02 19:29:28 发布

阅读量2.9k

点赞数 2

分类专栏： MapReduce编程(python)

本文链接：https://blog.csdn.net/pat_datamine/article/details/42491963

版权

本文详细解析了使用Python实现的Hadoop WordCount程序，包括Mapper和Reducer的过程。Mapper通过读取文本、分词并输出键值对，Reducer则进行数据合并和排序，最终得出每个单词的计数。

摘要由CSDN通过智能技术生成

从hadoop官网提供用python语言编写的wordcount程序如下：

mapper.py函数如下：

import sys  
  
# 调用标准输入流  
for line in sys.stdin:  
    # 读取文本内容   
    line = line.strip()  
    # 对文本内容分词，形成一个列表 
    words = line.split()  
    # 读取列表中每一个元素的值  
    for word in words:  
        # map函数输出，key为word，下一步将进行shuffle过程，将按照key排序，输出，这两步为map阶段工作为，在本地节点进行  
        print '%s\t%s' % (word, 1)

reducer.py函数如下：

#!/usr/bin/env python  
  
from operator import itemgetter  
import sys  
  
current_word = None  
current_count = 0  
word = None  
  
# input comes from STDIN  
for line in sys.stdin:  
    # remove leading and trailing whitespace  
    line = line.strip()  
  
    # parse the input we got from mapper.py  
    word, count = line.split('\t', 1)  
  
    # convert count (currently a string) to int  
    try:  
        count = int(count)  
    except ValueError:  
        # count was not a number, so silently  
        # ignore/discard this line  
        continue  
  
    # this IF-switch only works because Hadoop sorts map output  
    # by key (here: word) before it is passed to the reducer  
    if current_word == word:  
        current_count += count  
    else:  
        if current_word:  
            # write result to STDOUT  
            print '%s\t%s' % (current_word, current_count)  
        current_count