Hadoop（MapReduce数据清理、压缩）笔记06

快长枝枝、

于 2023-09-10 20:03:17 发布

阅读量374

点赞数

文章标签： hadoop mapreduce 笔记

本文链接：https://blog.csdn.net/2201_75649224/article/details/132793473

版权

3、MapReduce 框架原理

3.7 数据清洗（ETL）

“ETL，是英文 Extract-Transform-Load 的缩写，用来描述将数据从来源端经过抽取（Extract）、转换（Transform）、加载（Load）至目的端的过程。ETL 一词较常用在数据仓库，但其对象并不限于数据仓库。
在运行核心业务 MapReduce 程序之前，往往要先对数据进行清洗，清理掉不符合用户要求的数据。清理的过程往往只需要运行 Mapper 程序，不需要运行 Reduce 程序。
1）需求
去除日志中字段个数小于等于 11 的日志。
（1）输入数据
web.log
（2）期望输出数据
每行字段长度都大于 11。
2）需求分析
需要在 Map 阶段对输入的数据根据规则进行过滤清洗。
3）实现代码
（1）编写 WebLogMapper 类

package com.atguigu.mapreduce.weblog;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WebLogMapper extends Mapper<LongWritable, Text, Text, 
NullWritable>{
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// 1 获取 1 行数据
String line = value.toString();
// 2 解析日志
boolean result = parseLog(line,context);
// 3 日志不合法退出
if (!result) {
return;
}
// 4 日志合法就直接写出
context.write(value, NullWritable.get());
}
// 2 封装解析日志的方法
private boolean parseLog(String line, Context context) {
// 1 截取
String[] fields = line.split(" ");
// 2 日志长度大于 11 的为合法
if (fields.length > 11) {
return true;
}else {
return false;
}
}
}

（2）编写 WebLogDriver 类

package com.atguigu.mapreduce.weblog;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WebLogDriver {
public static void main(String[] args) throws Exception {
// 输入输出路径需要根据自己电脑上实际的输入输出路径设置
 args = new String[] { "D:/input/inputlog", "D:/output1" };
// 1 获取 job 信息
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2 加载 jar 包
job.setJarByClass(LogDriver.class);
// 3 关联 map
job.setMapperClass(WebLogMapper.class);
// 4 设置最终输出类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
// 设置 reducetask 个数为 0
job.setNumReduceTasks(0);
// 5 设置输入和输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 6 提交
 boolean b = job.waitForCompletion(true);
 System.exit(b ? 0 : 1);
}
}

3.8 MapReduce 开发总结

1）输入数据接口：InputFormat
（1）默认使用的实现类是：TextInputFormat
（2）TextInputFormat 的功能逻辑是：一次读一行文本，然后将该行的起始偏移量作为key，行内容作为 value 返回。
（3）CombineTextInputFormat 可以把多个小文件合并成一个切片处理，提高处理效率。
2）逻辑处理接口：Mapper
用户根据业务需求实现其中三个方法：map() setup() cleanup ()
3）Partitioner 分区
（1）有默认实现 HashPartitioner，逻辑是根据 key 的哈希值和 numReduces 来返回一个
分区号；key.hashCode()&Integer.MAXVALUE % numReduces
（2）如果业务上有特别的需求，可以自定义分区。
4）Comparable 排序
（1）当我们用自定义的对象作为 key 来输出时，就必须要实现 WritableComparable 接口，重写其中的 compareTo()方法。
（2）部分排序：对最终输出的每一个文件进行内部排序。
（3）全排序：对所有数据进行排序，通常只有一个 Reduce。
（4）二次排序：排序的条件有两个。
5）Combiner 合并
Combiner 合并可以提高程序执行效率，减少 IO 传输。但是使用时必须不能影响原有的业务处理结果。
6）逻辑处理接口：Reducer
用户根据业务需求实现其中三个方法：reduce() setup() cleanup ()
7）输出数据接口：OutputFormat
（1）默认实现类是 TextOutputFormat，功能逻辑是：将每一个 KV 对，向目标文本文件输出一行。
（2）用户还可以自定义 OutputFormat。

4、Hadoop 数据压缩

4.1 概述

1）压缩的好处和坏处
压缩的优点：以减少磁盘 IO、减少磁盘存储空间。
压缩的缺点：增加 CPU 开销。
2）压缩原则
（1）运算密集型的 Job，少用压缩
（2）IO 密集型的 Job，多用压缩

4.2 MR 支持的压缩编码

1）压缩算法对比介绍
在这里插入图片描述
2）压缩性能的比较

4.3 压缩方式选择

压缩方式选择时重点考虑：压缩/解压缩速度、压缩率（压缩后存储大小）、压缩后是否可以支持切片。

4.3.1 Gzip 压缩

优点：压缩率比较高；
缺点：不支持 Split；压缩/解压速度一般；

4.3.2 Bzip2 压缩

优点：压缩率高；支持 Split；
缺点：压缩/解压速度慢。

4.3.3 Lzo 压缩

优点：压缩/解压速度比较快；支持 Split；
缺点：压缩率一般；想支持切片需要额外创建索引。

4.3.4 Snappy 压缩

优点：压缩和解压缩速度快；
缺点：不支持 Split；压缩率一般；

4.3.5 压缩位置选择

压缩可以在 MapReduce 作用的任意阶段启用。
在这里插入图片描述

4.4 压缩参数配置

1）为了支持多种压缩/解压缩算法，Hadoop 引入了编码/解码器
在这里插入图片描述
2）要在 Hadoop 中启用压缩，可以配置如下参数

4.5 压缩实操案例

4.5.1 Map 输出端采用压缩

即使你的 MapReduce 的输入输出文件都是未压缩的文件，你仍然可以对 Map 任务的中间结果输出做压缩，因为它要写在硬盘并且通过网络传输到 Reduce 节点，对其压缩可以提高很多性能，这些工作只要设置两个属性即可，我们来看下代码怎么设置。
1）给大家提供的 Hadoop 源码支持的压缩格式有：BZip2Codec、DefaultCodec

package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException, 
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
// 开启 map 端输出压缩
conf.setBoolean("mapreduce.map.output.compress", true);
// 设置 map 端输出压缩方式
conf.setClass("mapreduce.map.output.compress.codec", 
BZip2Codec.class,CompressionCodec.class);
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}

2）Mapper 保持不变

package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class WordCountMapper extends Mapper<LongWritable, Text, Text, 
IntWritable>{
Text k = new Text();
IntWritable v = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context 
context)throws IOException, InterruptedException {
// 1 获取一行
String line = value.toString();
// 2 切割
String[] words = line.split(" ");
// 3 循环写出
for(String word:words){
k.set(word);
context.write(k, v);
}
}
}

3）Reducer 保持不变

package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class WordCountReducer extends Reducer<Text, IntWritable, Text, 
IntWritable>{
IntWritable v = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
// 1 汇总
for(IntWritable value:values){
sum += value.get();
}
 v.set(sum);
 // 2 输出
context.write(key, v);
}
}

4.5.2 Reduce 输出端采用压缩

基于 WordCount 案例处理。
1）修改驱动

package com.atguigu.mapreduce.compress;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.compress.BZip2Codec;
import org.apache.hadoop.io.compress.DefaultCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.io.compress.Lz4Codec;
import org.apache.hadoop.io.compress.SnappyCodec;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountDriver {
public static void main(String[] args) throws IOException, 
ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(WordCountDriver.class);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
// 设置 reduce 端输出压缩开启
FileOutputFormat.setCompressOutput(job, true);
// 设置压缩的方式
 FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class); 
// FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); 
// FileOutputFormat.setOutputCompressorClass(job, 
DefaultCodec.class);
 
boolean result = job.waitForCompletion(true);
System.exit(result?0:1);
}
}

2）Mapper 和 Reducer 保持不变

5、常见错误及解决方案

1）导包容易出错。尤其 Text 和 CombineTextInputFormat。
2）Mapper 中第一个输入的参数必须是 LongWritable 或者 NullWritable，不可以是 IntWritable. 报的错误是类型转换异常。
3）java.lang.Exception: java.io.IOException: Illegal partition for 13926435656 (4)，说明 Partition和 ReduceTask 个数没对上，调整 ReduceTask 个数。
4）如果分区数不是 1，但是 reducetask 为 1，是否执行分区过程。答案是：不执行分区过程。
因为在 MapTask 的源码中，执行分区的前提是先判断ReduceNum 个数是否大于 1。不大于1 肯定不执行。
5）在 Windows 环境编译的 jar 包导入到 Linux 环境中运行，

hadoop jar wc.jar com.atguigu.mapreduce.wordcount.WordCountDriver /user/atguigu/ 
/user/atguigu/output

报如下错误：

Exception in thread "main" java.lang.UnsupportedClassVersionError: 
com/atguigu/mapreduce/wordcount/WordCountDriver : Unsupported major.minor version 52.0

原因是 Windows 环境用的 jdk1.7，Linux 环境用的 jdk1.8。
解决方案：统一 jdk 版本。
6）缓存 pd.txt 小文件案例中，报找不到 pd.txt 文件
原因：大部分为路径书写错误。还有就是要检查 pd.txt.txt 的问题。还有个别电脑写相对路径
找不到 pd.txt，可以修改为绝对路径。
7）报类型转换异常。
通常都是在驱动函数中设置 Map 输出和最终输出时编写错误。
Map 输出的 key 如果没有排序，也会报类型转换异常。
8）集群中运行 wc.jar 时出现了无法获得输入文件。
原因：WordCount 案例的输入文件不能放用 HDFS 集群的根目录
9）出现了如下相关异常

Exception in thread "main" java.lang.UnsatisfiedLinkError: 
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
atorg.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
atorg.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:356)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:371)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:364)

解决方案：拷贝 hadoop.dll 文件到 Windows 目录C:\Windows\System32。个别同学电脑还需要修改 Hadoop 源码。
方案二：创建如下包名，并将 NativeIO.java 拷贝到该包名下
在这里插入图片描述
10）自定义 Outputformat 时，注意在 RecordWirter 中的 close 方法必须关闭流资源。否则输出的文件内容中数据为空。

@Override
public void close(TaskAttemptContext context) throws IOException, 
InterruptedException {
if (atguigufos != null) {
atguigufos.close();
}
if (otherfos != null) {
otherfos.close();
}
}

快长枝枝、

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
Hadoop（MapReduce数据清理、压缩）笔记06

1）输入数据接口：InputFormat（1）默认使用的实现类是：TextInputFormat（2）TextInputFormat 的功能逻辑是：一次读一行文本，然后将该行的起始偏移量作为key，行内容作为 value 返回。（3）CombineTextInputFormat 可以把多个小文件合并成一个切片处理，提高处理效率。2）逻辑处理接口：Mapper用户根据业务需求实现其中三个方法：map() setup() cleanup ()3）Partitioner 分区。
复制链接

扫一扫