hadoop离线项目之数据清洗

最新推荐文章于 2022-05-03 23:19:33 发布

5xh

最新推荐文章于 2022-05-03 23:19:33 发布

阅读量387

点赞数

本文链接：https://blog.csdn.net/qq_37283909/article/details/88909008

版权

本文介绍了企业级项目开发流程，重点讲解了在大数据应用平台中Hadoop离线项目的数据清洗过程，包括使用Java制造假数据、MapReduce进行日志解析清洗，最后将清洗结果导入Hive分区表的操作步骤。

摘要由CSDN通过智能技术生成

企业级项目开发流程

一、项目调研：
以业务为导向
产品经理、非常熟悉业务、项目经理
二、需求分析：做什么做成什么样
用户提出来的：显式
隐式：用户不清楚，团队需提供
三、方案设计
概设
详设（具体到所有功能的实现，技术，表，模块，字段，有多少个类，类里的方法及方法名，参数，返回类型。。。）
系统设计（系统的扩展，水平的扩展，是否容错，可不可以定制化，监控告警…）
四、功能开发
开发
测试：单元测试 CICD
五、测试
功能
联调
性能
用户试用
六、部署上线
试运行(一两个月，并行) DIFF 稳定性
正式上线灰度
7、后期
2、3、4期运维保障功能开发 bug修复

企业级大数据应用平台：

数据分析：自研 + 商业
搜索/爬虫
机器学习/深度学习
人工智能
离线
实时

hadoop离线项目之数据清洗

baidu	CN	A	E	[17/Jul/2018:17:07:50 +0800]	2	223.104.18.110	-	112.29.213.35:80	0	v2.go2yd.com	GET	http://v1.go2yd.com/user_upload/1531633977627104fdecdc68fe7a2c4b96b2226fd3f4c.mp4_bd.mp4	HTTP/1.1	-	bytes 13869056-13885439/25136186	TCP_HIT/206	112.29.213.35	video/mp4	17168	16384	-:0	0	0	-	-	-	11451601	-	"JSP3/2.0.14"	"-"	"-"	"-"	http	-	2	v1.go2yd.com	0.002	25136186	16384	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	1531818470104-11451601-112.29.213.66#2705261172	644514568

这是用java制造的假数据，每条数据共有72个字段，通过‘\t’ 分隔。
通过MapReduce进行日志解析清洗，将清洗完的数据导入到hive分区表

第一步：编写清洗工具类LogUtils

public class LogUtils {
    DateFormat sourceFormat = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
    DateFormat targetFormat = new SimpleDateFormat("yyyyMMddHHmmss");


    /**
     * 日志文件解析，对内容进行字段的处理
     * 按\t分割
     */
    public String parse(String log) {
        String result = "";
        try {
            String[] splits = log.split("\t");
            String cdn = splits[0];
            String region = splits[1];
            String level = splits[3];
            String timeStr = splits[4];
            String time = timeStr.substring(1,timeStr.length()-7);
            time = targetFormat.format(sourceFormat.parse(time));
            String ip = splits[6];
            String domain = splits[10];
            String url = splits[12];
            String traffic = splits[20];
            
            StringBuilder builder = new StringBuilder("");
            builder.append(cdn).append("\t")
                    .append(region).append("\t")
                    .append(level).append("\t")
                    .append(time).append("\t")
                    .append(ip).append("\t")
                    .append(domain).append("\t")
                    .append(url).append("\t")
                    .append(traffic);

            result = builder.toString();
        } catch (ParseException e) {
            e.printStackTrace();
        }
        return result;
    }
}

第二步：编写mapper

public class LogETLMapper extends Mapper<LongWritable,Text,NullWritable,Text>{

    /**
     * 通过mapreduce框架的map方式进行数据清洗
     * 进来一条数据就按照我们的解析规则清洗完以后输出
     */
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        int length = value.toString().split("\t").length;
        if(length == 72) {
            LogUtils utils = new LogUtils();
            String result = utils.parse(value.toString());
            if(StringUtils.isNotBlank(result)) {
                context.write(NullWritable.get(), new Text(result));
            }
        }
    }
}

第三步：编写MRdriver类

public class LogETLDriver {

    public static void main(String[] args) throws Exception{
        if(args.length != 2) {
            System.err.println("please input 2 params: input output");
            System.exit(0);
        }

        String input = args[0];
        String output = args[1];  //output/d=20180717

        //System.setProperty("hadoop.home.dir", "D:/cdh/hadoop-2.6.0-cdh5.7.0");


        Configuration configuration = new Configuration();

        // 写代码：死去活来法
        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path(output);
        if(fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }

        Job job = Job.getInstance(configuration);
        job.setJarByClass(LogETLDriver.class);
        job.setMapperClass(LogETLMapper.class);
        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }
}

第五步：测试打包并上传到linux
通过本地测试，测试完之后通过maven打成jar包上传

第六步：准备数据
在这里插入图片描述

第七步：建hive外部关联表

create external table g6_access (
cdn string,
region string,
level string,
time string,
ip string,
domain string,
url string,
traffic bigint
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/g6/hadoop/access/clear'

第八步：编写shell脚本

process_date=20180717
echo "step1: mapreduce etl"
hadoop jar /home/hadoop/lib/g6-hadoop-1.0.jar com.ruozedata.hadoop.mapreduce.driver.LogETLDriver /g6/hadoop/accesslog/$process_date /g6/hadoop/access/output/day=$process_date

echo "step2：数据移动到DW"

hdfs dfs -mv /g6/hadoop/access/output/day=$process_date  /g6/hadoop/access/clear/
echo "step3：刷元数据"
hive -e "use g6hadoop; alter table g6_access add if not exists partition (day=$process_date);"

完成

5xh

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hadoop离线项目之数据清洗

企业级项目开发流程一、项目调研：以业务为导向产品经理、非常熟悉业务、项目经理二、需求分析：做什么做成什么样用户提出来的：显式隐式：用户不清楚，团队需提供三、方案设计概设详设（具体到所有功能的实现，技术，表，模块，字段，有多少个类，类里的方法及方法名，参数，返回类型。。。）系统设计（系统的扩展，水平的扩展，是否容错，可不可以定制化，监控告警…）四、功能开发开发测试：...
复制链接

扫一扫