【大数据运算系统】MapReduce实验记录

最新推荐文章于 2023-06-08 07:42:04 发布

慕青Voyager

最新推荐文章于 2023-06-08 07:42:04 发布

阅读量84

点赞数

文章标签：大数据 mapreduce hadoop

本文链接：https://blog.csdn.net/weixin_44664746/article/details/131099637

版权

大数据运算系统MapReduce实验记录

2023年4月10日于中国科学院大学

一、容器环境初始化

启动SSH服务：

root@<container ID>:~/hw2/part1# service ssh start
 * Starting OpenBSD Secure Shell server sshd                             [ OK ]

启动HDFS：

root@<container ID>:~/hw2/part1# start-dfs.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/bdms/setup/hadoop-2.9.2/logs/hadoop-root-namenode-36f965b5aab0.out
localhost: starting datanode, logging to /home/bdms/setup/hadoop-2.9.2/logs/hadoop-root-datanode-36f965b5aab0.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/bdms/setup/hadoop-2.9.2/logs/hadoop-root-secondarynamenode-36f965b5aab0.out

启动Yarn：

root@<container ID>:~/hw2/part1# start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /home/bdms/setup/hadoop-2.9.2/logs/yarn--resourcemanager-36f965b5aab0.out
localhost: starting nodemanager, logging to /home/bdms/setup/hadoop-2.9.2/logs/yarn-root-nodemanager-36f965b5aab0.out

二、对示例程序WordCount的解析

数据准备：

root@<container ID>:~/hw2/part1# hdfs dfs -rm -f -r /hw2/output
root@<container ID>:~/hw2/part1# hadoop fs -mkdir /hw2
root@<container ID>:~/hw2/part1# hadoop fs -put example-input.txt /hw2
root@<container ID>:~/hw2/part1# hadoop fs -ls -R /hw2
-rw-r--r--   1 root supergroup        165 2023-04-09 12:56 /hw2/example-input.txt

试运行代码：

root@<container ID>:~/hw2/part1# rm -f *.class *.jar
root@<container ID>:~/hw2/part1# javac WordCount.java
root@<container ID>:~/hw2/part1# jar cfm WordCount.jar WordCount-manifest.txt WordCount*.class
root@<container ID>:~/hw2/part1# ls
README.txt                       WordCount-manifest.txt  example-input.txt
WordCount$IntSumCombiner.class   WordCount.class         part1-input
WordCount$IntSumReducer.class    WordCount.jar
WordCount$TokenizerMapper.class  WordCount.java
root@<container ID>:~/hw2/part1# hadoop jar ./WordCount.jar /hw2/example-input.txt /hw2/output
root@<container ID>:~/hw2/part1# hdfs dfs -cat '/hw2/output/part-*'
count of 1.2.3.4 =	4
count of 10.3 =	1
count of 18811001100 =	2
count of 3 =	1
count of 30.1 =	1
count of 5.0 =	1
count of 5.2 =	1
count of 52 =	1
count of alpha =	4
count of beta =	2

核心代码分析：

1、文档输入后如何存储？在"TokenizerMapper"中的map方法参数中第一次出现文档表现形式：“value”。

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {...}

2、文档如何实现分词？在"StringTokenizer"对象创建过程中，通过指定"delim"参数的值确定分隔符实现分词。

StringTokenizer itr = new StringTokenizer(value.toString()," ");

3、分词后如何依次写入Map？通过nextToken()方法逐个写入。

while (itr.hasMoreTokens()) {
  word.set(itr.nextToken());
  context.write(word, one);
}

三、编写工程代码

1、Map函数

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
  //获取通话记录
  StringTokenizer itr = new StringTokenizer(value.toString(),"\n");
  //对每条通话记录进行处理
  while (itr.hasMoreTokens()) {
    //筛选通话双方与通话时长
    String[] record = itr.nextToken().split(" ");
    //筛查脏数据:长度为3表示通话双方+通话时长
    if(record.length == 3){
      try {
        //通话时长是否可解析,捕捉不可解析错误跳过该记录
        Float.parseFloat(record[2]);
        token.set(record[0]+" "+record[1]);
        //用来存储结果,理想状态应使用FloatWritable存储,但在Reducer中无法合并,故使用Test存储本阶段Time数据
        time.set(record[2]);
        context.write(token,time);
      } catch (NumberFormatException e) {
        
      }
    }
  }
}

2、Combiner中的Reduce函数

说明：该部分主要是对单个Node数据记录进行处理，要统计该Node下通话次数和通话总时长，从而在Reduce中进行再次合并。

public void reduce(Text token, Iterable<Text> values, Context context) throws IOException, InterruptedException {
  int count = 0;
  float alltime = 0;
  for (Text val : values) {
    count += 1;
    alltime += Float.parseFloat(val.toString());
  }
  result.set(String.valueOf(count)+" "+String.valueOf(alltime));
  //token是通话双方，result包含通话次数和通话总时长
  context.write(token, result);
}

3、Reduce模块

public void reduce(Text token, Iterable<Text> values, Context context) throws IOException, InterruptedException {
  int count = 0;
  float alltime = 0;
  for (Text val : values) {
    //解析Combiner部分通话次数和通话总时长
    String[] split = val.toString().split(" ");
    count += Integer.parseInt(split[0]);
    alltime += Float.parseFloat(split[1]);
  }
  //再次进行封装
  result_value.set(String.valueOf(count) + " " + String.valueOf(alltime/count));
  context.write(token, result_value);
}

4、模块main函数中需要更改输出输出的类型即可，此外，在主函数中要设置键和值存入时的间隔方式改为空格

job.getConfiguration().set("mapreduce.output.textoutputformat.separator", " ");

四、进行代码测试

参考"对示例程序WordCount的解析"部分。(测试数据给的过分干净了，可以自己尝试加入脏数据)

五、提交自动化测试机制

root@36f965b5aab0:~/hw2/hw2-check# ./myprepare
Deleted /hw2
Found 2 items
-rw-r--r--   1 root supergroup        259 2023-04-10 02:32 /hw2/part1-input/input_0
-rw-r--r--   1 root supergroup      36957 2023-04-10 02:32 /hw2/part1-input/input_1
root@36f965b5aab0:~/hw2/hw2-check# ./run-test-part1.pl ./score 0_xx_hw2.java 
./part1-result/result_0 has 5 lines
./part1-result/result_1 has 485 lines
------------------------------------------------------------
[Mon Apr 10 02:34:07 UTC 2023] 1 0_xx_hw2.java
------------------------------------------------------------
...
0_xx_hw2.java raw score: 2 / 2