MapReduce之日志分析
一、相关说明
- 要求:对电商访问日志进行清洗,求出每种商品或者url的访问量(PV)
二、测试数据
- 测试数据如下,部分内容如下:
niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89923 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89944 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89944 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89923 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89933 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-"
- 可以下载其他的,点击数据源进行下载
三、编程思路
- 思路:
1、按照wordcount的思路进行编程,区别在如何找到能唯一确定不同商品的标识,如url
2、可以利用正则表达式查找目标url
3、可以利用字符串截取找到目标的url地址
四、实现步骤
-
在Idea或eclipse中创建maven项目
-
在pom.xml中添加hadoop依赖
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-common</artifactId> <version>2.7.3</version> </dependency> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-mapreduce-client-core</artifactId> <version>2.7.3</version> </dependency>
-
添加log4j.properties文件在资源目录下即resources,文件内容如下:
### 配置根 ### log4j.rootLogger = debug,console,fileAppender ## 配置输出到控制台 ### log4j.appender.console = org.apache.log4j.ConsoleAppender log4j.appender.console.Target = System.out log4j.appender.console.layout = org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern = %d{ABSOLUTE} %5p %c:%L - %m%n ### 配置输出到文件 ### log4j.appender.fileAppender = org.apache.log4j.FileAppender log4j.appender.fileAppender.File = logs/logs.log log4j.appender.fileAppender.Append = false log4j.appender.fileAppender.Threshold = DEBUG,INFO,WARN,ERROR log4j.appender.fileAppender.layout = org.apache.log4j.PatternLayout log4j.appender.fileAppender.layout.ConversionPattern = %-d{yyyy-MM-dd HH:mm:ss} [ %t:%r ] - [ %p ] %m%n
-
编写文本类型的mapper即LogMapper
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException; import java.util.regex.Matcher; import java.util.regex.Pattern; public class LogMapper extends Mapper<LongWritable, Text,Text, IntWritable> { // 按指定模式在字符串查找 String pattern = "\\=[0-9a-z]*"; // 创建 Pattern 对象 Pattern r = Pattern.compile(pattern); protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //niit110,192.168.215.131 - - [28/May/2019:18:11:44 +0800] "GET /shop/detail.html?id=402857036a2831e001kshdksdsdk89912 HTTP/1.0" 200 4391 "-" "ApacheBench/2.3" "-" String data = value.toString(); // 现在创建 matcher 对象 Matcher m = r.matcher(data); if (m.find()) { String idStr = m.group(0); String id = idStr.substring(1); context.write(new Text(id),new IntWritable(1)); } } }
-
编写reducer类
import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class LogReducer extends Reducer<Text, IntWritable,Text, IntWritable> { @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable v: values) { sum += v.get(); } context.write(key,new IntWritable(sum)); } }
-
编写Driver类
import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class LogJob { public static void main(String[] args) throws Exception { Job job = Job.getInstance(new Configuration()); job.setJarByClass(LogJob.class); job.setMapperClass(LogMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setReducerClass(LogReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.setInputPaths(job,new Path("F:\\NIIT\\access2.log")); FileOutputFormat.setOutputPath(job,new Path("F:\\NIIT\\logs\\002")); boolean completion = job.waitForCompletion(true); } }
-
本地运行代码,测试下结果正确与否,参考结果如下:
402857036a2831e001kshdksdsdk89912 2 402857036a2831e001kshdksdsdk89923 2 402857036a2831e001kshdksdsdk89933 4 402857036a2831e001kshdksdsdk89944 2
五、打包上传到集群中运行(仅供参考,自行修改)
-
本地运行测试结果正确后,需要对Driver类输出部分代码进行修改,具体修改如下:
FileOutputFormat.setOutputPath(job,new Path(args[0])); -
修改Job中【数据库】相关的信息
-
将程序打成jar包,需要在pom.xml中配置打包插件
<build> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId> maven-assembly-plugin </artifactId> <configuration> <!-- 使用Maven预配置的描述符--> <descriptorRefs> <descriptorRef>jar-with-dependencies</descriptorRef> </descriptorRefs> </configuration> <executions> <execution> <id>make-assembly</id> <!-- 绑定到package生命周期 --> <phase>package</phase> <goals> <!-- 只运行一次 --> <goal>single</goal> </goals> </execution> </executions> </plugin> </plugins> </build>
按照如下图所示进行操作
-
提交集群运行,执行如下命令:
hadoop jar packagedemo-1.0-SNAPSHOT.jar com.niit.mr.EmpJob /datas/emp.csv /output/emp/
至此,所有的步骤已经完成,大家可以试试,祝大家好运~~~~