1.什么是数据清洗
数据清洗指将原始数据处理成有价值的数据的过程,就称为数据清洗。这是由大数据的低价值密度的特点决定的。
2.大数据开发的基本流程
- 采集数据(flume、logstash)先保存到MQ(Kafka) 中
- 将MQ中的暂存数据存放到HDFS中保存。
- 数据清洗(低价值密度的数据处理),存放到HDFS。
- 算法干预(MapReduce),计算结果保存到HDFS或者HBase。
- 计算结果的可视化展示(Echarts、 HCharts)
3.使用正则表达式提取数据
正则表达式主要作用字符串匹配,抽取和替换
语法:
规则 | 解释 |
---|---|
. | 匹配任意字符 |
\d | 匹配任意数字 |
\D | 匹配任意非数字 |
\w | 匹配az和AZ |
\W | 匹配非az和AZ |
\s | 匹配空白符 |
^ | 匹配字符串的开头 |
$ | 匹配字符串的末尾 |
规则的匹配次数
语法 | 解释 |
---|---|
* | 规则匹配0到N次 |
? | 规则匹配1次 |
{n} | 规则匹配N次 |
{n,m} | 规则匹配n到m次 |
+ | 规则匹配1到N次(至少一次) |
4.访问日志案例分析
1.数据样式 部分数据,总数据量50万条
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /static/image/common/faq.gif HTTP/1.1” 200 1127
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /data/cache/style_1_widthauto.css?y7a HTTP/1.1” 200 1292
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /static/image/common/hot_1.gif HTTP/1.1” 200 680
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /static/image/common/hot_2.gif HTTP/1.1” 200 682
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /static/image/filetype/common.gif HTTP/1.1” 200 90
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /source/plugin/wsh_wx/img/wsh_zk.css HTTP/1.1” 200 1482
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /data/cache/style_1_forum_index.css?y7a HTTP/1.1” 200 2331
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /source/plugin/wsh_wx/img/wx_jqr.gif HTTP/1.1” 200 1770
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /static/image/common/recommend_1.gif HTTP/1.1” 200 1030
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /static/image/common/logo.png HTTP/1.1” 200 4542
27.19.74.143 - - [30/May/2013:17:38:20 +0800] “GET /data/attachment/common/c8/common_2_verify_icon.png HTTP/1.1” 200 582
110.52.250.126 - - [30/May/2013:17:38:20 +0800] “GET /static/js/logging.js?y7a HTTP/1.1” 200 603
大数据处理算法,需要参数 客户端的ip、=请求时间、=资源、响应状态
2.使用正则表达式提取四项数据
分析后得到需要的正则表达式
^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*\[(.*)\]\s(.*)\w\sHTTP\/1.1"\s(\d{3}).*$
3.使用MapReduce分布式计算框架进行数据清洗
注意:因为数据清洗不涉及统计计算,所有MapReduce程序通常只有map任务,而没有Reduce任务
job.setNumReduceTasks(0)
实现代码
1.数据清洗的Mapper
package com.hw.dataClearn;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Locale;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.io.IOException;
/**
* @aurhor:fql
* @date 2019/8/19 12:40
* @type:
*/
public class DateClearnMapper extends Mapper<LongWritable, Text,Text, NullWritable> {
/**
*
* @param key
* @param value neginx访问日志中的一条记录
* @param context
* @throws IOException
* @throws InterruptedException
*/
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
final String regex = "^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}).*\\[(.*)\\]\\s\"\\w*\\s(.*)\\sHTTP\\/1.1\"\\s(\\d{3}).*$";
String line = value.toString();
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
String ip = matcher.group(1);
String accertTime = matcher.group(2);
String resources = matcher.group(3);
String status = matcher.group(4);
//30/May/2013:17:38:20 +0800
SimpleDateFormat sdf = new SimpleDateFormat("dd/MMM/yyyy:HH:mm:ss", Locale.ENGLISH);
try {
Date date = sdf.parse(accertTime);
SimpleDateFormat sdf2 = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
String finalDate = sdf2.format(date);
context.write(new Text(ip + " " + finalDate + " " + resources + " " + status), null);
} catch (ParseException e) {
e.printStackTrace();
}
}
}
}
2.初始化类
package com.hw.dataClearn;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import java.io.IOException;
/**
* @aurhor:fql
* @date 2019/8/19 12:40
* @type:
*/
public class DataClearnApplication {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "data clean");
job.setJarByClass(DataClearnApplication.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
//设置本地计算
TextInputFormat.setInputPaths(job,new Path("file:///d:/my/access.log"));
TextOutputFormat.setOutputPath(job,new Path("file:///d:/my/final"));
job.setMapperClass(DateClearnMapper.class);
// 注意:数据清洗通常只有map任务而没有reduce
job.setNumReduceTasks(0);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.waitForCompletion(true);
}
}
3.结果展示
- 生成文件
- 部分结果