大数据管理技术实习——MapReduce之WordCount(去标点符号)

最新推荐文章于 2022-03-24 17:03:35 发布

Unauthorized_

最新推荐文章于 2022-03-24 17:03:35 发布

阅读量2k

点赞数 1

分类专栏：作业大数据管理技术 MapReduce 文章标签：大数据 mapreduce hadoop

本文链接：https://blog.csdn.net/Unauthorized_/article/details/107436324

版权

作业同时被 3 个专栏收录

6 篇文章 0 订阅

订阅专栏

大数据管理技术

6 篇文章 0 订阅

订阅专栏

MapReduce

2 篇文章 0 订阅

订阅专栏

大数据管理技术实习——MapReduce

文章目录

大数据管理技术实习——MapReduce

要求：

在新概念英语第二册（一个给定的任意txt文档）上完成 word count
在此基础上实现去标点化版本的WordCount

基础代码

在Hadoop中的examples自带了WordCount函数，代码如下

import java.io.DataInput;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  public static void main(String[] args) throws Exception {
    if(args.length!=2){
        System.err.println("Uage: wordcount <in> <out>");
        System.exit(2);
    }
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

我们可以拆开来看

1.map部分

extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
}

其中 itr = new StringTokenizer(value.toString()) 的作用是将Text类型的value转化为StringTokenizer类型，且该类构造方法为StringTokenizer(String str,String delim)，即构造一个用来解析 str 的 StringTokenizer 对象，并提供一个指定的分隔符（缺省的话，例如本mapper即默认为空格“”）。

而函数hasMoreTokens()作用为判断是否还有分隔符。

经过这两个操作后，我们就将整个txt转化来的string按照空格拆分为一个一个单词；继而再以每个单词为key，赋予每个单词词频为1，组成key-value对：（word,1），加到context里。

2.Reduce部分

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

就是简单地将key（即单词word）值相同的，把他们的词频合并相加。比如两个(apple,1)合并为一个(apple,2)。

改进代码

在原有WordCount基础上，为了实现去除词频效果，对Map部分进行了小幅度修改即可

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    private Text word2;//new coude
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        /*new code*/
        String s=word.toString();
        String regEx = "[`~☆★!@#$%^&*()+=|{}':;,\\[\\]》·.<>/?~！@#￥%……（）——+|{}【】‘；：”“’。\"，\\-、？]";
        String s1=s.replaceAll(regEx,"");
        word2=new Text(s1);
        context.write(word2, one);
      }
    }
  }

其中改动部分有：在map函数外添加了Text word2（方便之后转化）；在while函数中间先将Text类型的word转化为了string类型的s（方便使用函数replaceAll，在text没找到对应的函数hhh），然后用regEx记录所有的标点符号，s1=s.replaceAll(regEx,"")就可以将所有标点符号去除了（空格这个分隔符没有添加进去，得以保留），最后再将string类型的s1转化为text类型的word2，以(word2,one)形式加入context。

运行过程（命令行shell相关）

都是命令行操作

1.开启hdfs

start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

2.初始化/格式化（以前的输入输出没有可略过）

hdfs dfs -rm input/*
hdfs dfs -rm -r output/
hdfs dfs -put 新概念英语第二册.txt input/

这里需要注意的是，-put命令第一个path是虚拟机/本机上的，第二个path才是hdfs空间的，所以如果现在运行的地方比如是“alice@Master:~$“ 需要注意相对路径，否则会触发no such file or directory的操作，本菜就吃过不少亏hhh（。）

3.打包jar

导出过程：
在这里插入图片描述

这里我建立了一个myapp文档专门存放程序jar包（当然这步随意hhh）

4.运行程序

在确保hdfs开启的前提下：

hadoop jar WordCount.jar input output
//hadoop jar ./xxxxx/WordCount.jar input output

这里需要注意运行应切换路径"cd alice@Master:/usr/local/hadoop/hadoop-2.7.7/myapp$"，在myapp中运行hadoop操作（因为之前保存的WordCount在myapp包下）；或者在调用jar时打对相对路径 ./usr/local/hadoop/hadoop-2.7.7/myapp/WordCount.jar 之类的

此时应该已经运行成功了，本菜鸡遇到了几个bug，具体见末尾。这里先说运行成功的结果

输入：

hdfs dfs -lh -h output

应该出现结果
在这里插入图片描述
可以看到output/part-r-00000有25.7k的数据（在未删除标点符号的前提下好像有34.9k）

取回本地：

hdfs dfs -get output/part-r-00000 ./
cat part-r-00000

可以看到如果不去除标点符号，结果其实挺乱的：

在这里插入图片描述

用了去除标点符号版本的jar后结果就会干净多了：
在这里插入图片描述

当然，这个结果是按照单词的字典序排序的，如果你想要按照词频排序需要使用命令

sort part-r-00000 -n -k2

这样结果就是按照第二列的数字（即词频）从小到大排序了

5.部分bug

5.1 HDFS Corrupt block

在运行时遇到HDFS Corrupt block问题，解决方案：https://blog.csdn.net/lingbo229/article/details/81128316

先检测缺失块：

hdfs fsck -list-corruptfileblocks

此时本菜鸡这里显示corrupt block=13，然后我就懒得做什么复杂操作干脆全删了…（。）

hdfs fsck -delete ，可以查看之后corrupt block=0

再次运行jar包后发现corrupt block=0，可正常运行

在这里插入图片描述

5.2 正则表达式中的“-”问题

去除符号时的regEx添加-会报错，需要\\-才可以。

5.3 retry policy is…

还遇到了一个比较无语的问题，就是没有完整开启hdfs，忘了start-yarn.sh，于是显示“retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime”之类的，只需要再写一次start-yarn.sh就行了…。

5.4 SLF4J: Class path contains multiple SLF4J bindings.

在运行hdfs命令时遇到如下报错

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.7.7/myapp/WordCount.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

这是因为你有好几个重复的jar包，需要删除到只有一个…不知道java为啥这么令人无语…好像Python也有类似问题，反正很无语…。

好像就没了，感谢阅读w

Unauthorized_

关注

1
点赞
踩
17

收藏

觉得还不错? 一键收藏
1
评论
大数据管理技术实习——MapReduce之WordCount(去标点符号)

大数据管理技术实习——MapReduce文章目录大数据管理技术实习——MapReduce要求：基础代码1.map部分2.Reduce部分改进代码运行过程（命令行shell相关）1.开启hdfs2.初始化/格式化（以前的输入输出没有可略过）3.打包jar4.运行程序5.部分bug5.1 HDFS Corrupt block5.2 正则表达式中的“-”问题5.3 retry policy is...5.4 SLF4J: Class path contains multiple SLF4J bindings.
复制链接

扫一扫