利用MapReduce实现离线数据清洗

最新推荐文章于 2024-04-02 10:50:38 发布

Gru杨

最新推荐文章于 2024-04-02 10:50:38 发布

阅读量2k

点赞数

分类专栏： Hadoop

本文链接：https://blog.csdn.net/weixin_43517453/article/details/88857949

版权

Hadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

MapReduce虽然已经用的很少，但是它的开发流程还是需要了解的。本次是使用MapReduce完成数据清洗的ETL任务，由于是对日志进行清洗，一条日志信息对应一个map任务，完成任务后并不需要规约操作，所以只需要使用Map，而不需要Reduce任务。

一、日志文件解析

第一步日志文件解析，需要选取有用的字段，并把其中有些字段进行处理（例如时间格式不符合要求，需要更改时间格式）
并返回一个String类型的，需要

StringBuilder builder = new StringBuilder("");
 builder.append(xxx).append("\t")........
 重新变成一行日志

二、开发MapReduce作业

通过mapreduce框架的map方式进行数据清洗，每进来一条数据，走一个map方法，按照我们的解析规则清洗完以后输出：

// 继承Mapper类，四个泛型<进来的key，进来的value，出去的key，出去的value类型>
public class LogETLMapper extends Mapper<LongWritable,Text,NullWritable,Text>{
//重写map方法 key是偏移量，value是一条日志，context是传递数据的中间量
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    // 判断日志是否是按规定来的
        int length = value.toString().split("\t").length;
        if(length == 72) {
            LogUtils utils = new LogUtils();
            String result = utils.parse(value.toString());
            if(StringUtils.isNotBlank(result)) {
                context.write(NullWritable.get(), new Text(result));
            }
        }
    }
}

map方法中的context变量用来传递数据以及其他运行状态信息，map中的key、value写入context，让它传递给Reducer进行reduce，而reduce进行处理之后数据继续写入context，继续交给Hadoop写入hdfs系统。

开发driver驱动

以driver作为所有代码的入口

public static void main(String[] args) throws Exception{
    if(args.length != 2) {
        System.err.println("please input 2 params: input output");
        System.exit(0);
    }

    String input = args[0];
    String output = args[1];  //output/d=20180717

    //System.setProperty("hadoop.home.dir", "D:/cdh/hadoop-2.6.0-cdh5.7.0");


    Configuration configuration = new Configuration();
    //由于getInstance()方法会new一个configuration，所以最好自己建一个以备后面用

    // 写代码：死去活来法
    FileSystem fileSystem = FileSystem.get(configuration);
    Path outputPath = new Path(output);
    if(fileSystem.exists(outputPath)) {
        fileSystem.delete(outputPath, true);
    }

    
    Job job = Job.getInstance(configuration);//获取一个实例
    job.setJarByClass(LogETLDriver.class);
    job.setMapperClass(LogETLMapper.class);
    job.setMapOutputKeyClass(NullWritable.class);
    job.setMapOutputValueClass(Text.class);

    FileInputFormat.setInputPaths(job, new Path(input));
    FileOutputFormat.setOutputPath(job, new Path(output));

    job.waitForCompletion(true);//提交
}

打包到服务器上运行

IDEA下，View-Tool Windows-Maven Projects
进入后在 Lifecycle 中点击package 运行，打包成功

rz -be 上传jar包到 linux

然后 hadoop jar jar包路径主类的reference 两个参数（输入路径，输出路径）

每次手动敲太麻烦，一般选择shell命令来执行。

在Hive上完成统计分析

create external table g6_access (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/g6/hadoop/access/clear'