MapReduce入门WordCount记录

最新推荐文章于 2023-03-31 13:08:27 发布

五河今心

最新推荐文章于 2023-03-31 13:08:27 发布

阅读量607

点赞数

分类专栏：来时路文章标签： mapreduce java 大数据

本文链接：https://blog.csdn.net/godwhjx/article/details/126848886

版权

来时路专栏收录该内容

10 篇文章 1 订阅

订阅专栏

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档

在这里插入图片描述

文章目录

前言
一、WordCount是啥
- - 传统代码
  - MapReduce方式
总结

前言

提示：这里可以添加本文要记录的大概内容：
未来的我，这里知识浅薄理解，仅作参考之用

提示：以下是本篇文章正文内容，下面案例可供参考

一、WordCount是啥

是一种计数代码

传统代码

传统的串行处理方式(Java)：
设有4组原始文本数据：
Text 1: the weather is good Text 2: today is good
Text 3: good weather is good Text 4: today has good weather

String[] text = new String[] 
   { “the weather is good”,“today is good ”,“good weather is good ”,” today has good weather” ｝;
  HashTable ht = new HashTable();    
  for(i=0; i<3; ++i)
  {  StringTokenizer st = new StringTokenizer(text[i]); 
     while (st.hasMoreTokens()) 
     {  String word = st.nextToken();
        if(!ht.containsKey(word)) //没有这个单词就传入哈希表中
        {  ht.put(word, new Integer(1));  }
        else { int wc = ((Integer)ht.get(word)).intValue() +1;// 有就计数加1
               ht.put(word, new Integer(wc));//更新次数
             }
     }
  }
  for (Iterator itr=ht.KeySet().iterator();  itr.hasNext(); )
  { String word = (String)itr.next(); 
    System.out.print(word+ “: ”+ (Integer)ht.get(word)+“;   ”);
  }

其中StringTokenizer是Java中的一种用来分隔字符(去除分隔符)的类，其hasMoreTokens是用来判断传入字符串中是否还有单词的方法；nextToken()是获取下一个单词

MapReduce方式

设有4组原始文本数据：
Text 1: the weather is good Text 2: today is good
Text 3: good weather is good Text 4: today has good weather

MapReduce处理方式

使用4个map节点：
map节点1: 
	输入：(text1, “the weather is good”)
	输出：(the, 1), (weather, 1), (is, 1), (good, 1)
map节点2: 
	输入：(text2, “today is good”)
	输出：(today, 1), (is, 1), (good, 1)
map节点3: 
	输入：(text3, “good weather is good”)
	输出：(good, 1), (weather, 1), (is, 1), (good, 1)
map节点4: 
	输入：(text3, “today has good weather”)
	输出：(today, 1), (has, 1), (good, 1), (weather, 1)