“ETL ,是英文 Extract-Transform-Load 的缩写,用来描述将数据从来源端经过抽取( Extract )、转换( Transform)、加载(Load )至目的端的过程。 ETL 一词较常用在数据仓 库,但其对象并不限于数据仓库在运行核心业务 MapReduce 程序之前,往往要先对数据进行清洗,清理掉不符合用户要求的数据。 清理的过程往往只需要运行 Mapper 程序,不需要运行 Reduce 程序。
(1)需求
去除日志中字段个数小于等于
11
的日志。(可以更改,代码模型适用)
(2)需求分析
需要在
Map
阶段对输入的数据根据规则进行过滤清洗。
(3)代码实现
使用Maven工程实现
配置文件pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.hadoop</groupId>
<artifactId>MapReduceDemo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
(1)Mapper类
package com.hadoop.mapreduce.etl;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.security.cert.TrustAnchor;
/**
* @author codestart
* @create 2023-06-22 17:41
*/
public class WebLogMapper extends Mapper<LongWritable, Text, Text, NullWritable> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {
//获取一行
String line = value.toString();
//清洗日志,标记是否符合
boolean result = parseLog(line, context);
if (!result) { //去掉不符合的日志
return;
}
//写出
context.write(value, NullWritable.get());
}
private boolean parseLog(String line, Mapper<LongWritable, Text, Text, NullWritable>.Context context) {
//切割
String[] split = line.split(" ");
//判断字段长度
if (split.length > 10) {
return true;
} else {
return false;
}
}
}
(2)Driver类
package com.hadoop.mapreduce.etl;
import org.apache.commons.io.output.NullWriter;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
/**
* @author codestart
* @create 2023-06-22 17:54
*/
public class WebLogDriver {
public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//定义文件的位置
args = new String[]{"D:\\data\\input\\inputlog", "D:\\data\\output\\output2"};
//1、获取Job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
//2、设置Jar包
job.setJarByClass(WebLogDriver.class);
//3、设置Mapper包
job.setMapperClass(WebLogMapper.class);
//4、设置Mapper输出类型
job.setMapOutputKeyClass(Text.class);
job.setOutputValueClass(NullWriter.class);
//5、设置最终输出的类型
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWriter.class);
//设置没有Reducer
job.setNumReduceTasks(0);
//6、设置输入输出路径
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//7、提交任务
boolean b = job.waitForCompletion(true);
System.exit(b ? 0 : 1);
}
}
(4)总结
使用Mapper对数据集进行清洗,业务逻辑写在Map阶段,能满足正常ETL。
以上是我通过网络学习,自己总结和练习的过程。一是为了防止自己忘记学过的知识,二是分享自己学习过程得到的结果,以此来发布博客。以上如有雷同,请联系本人!