hadoopmr实例_Hadoop经典案例Spark实现(七)

1、需求:根据tomcat日志计算url访问了情况,具体的url如下,

要求:区别统计GET和POST URL访问量

结果为:访问方式、URL、访问量

测试数据集:196.168.2.1 - - [03/Jul/2014:23:36:38 +0800] "GET /course/detail/3.htm HTTP/1.0" 200 38435 0.038

182.131.89.195 - - [03/Jul/2014:23:37:43 +0800] "GET /html/notes/20140617/888.html HTTP/1.0" 301 - 0.000

196.168.2.1 - - [03/Jul/2014:23:38:27 +0800] "POST /service/notes/addViewTimes_23.htm HTTP/1.0" 200 2 0.003

196.168.2.1 - - [03/Jul/2014:23:39:03 +0800] "GET /html/notes/20140617/779.html HTTP/1.0" 200 69539 0.046

196.168.2.1 - - [03/Jul/2014:23:43:00 +0800] "GET /html/notes/20140318/24.html HTTP/1.0" 200 67171 0.049

196.168.2.1 - - [03/Jul/2014:23:43:59 +0800] "POST /service/notes/addViewTimes_779.htm HTTP/1.0" 200 1 0.003

196.168.2.1 - - [03/Jul/2014:23:45:51 +0800] "GET /html/notes/20140617/888.html HTTP/1.0" 200 70044 0.060

196.168.2.1 - - [03/Jul/2014:23:46:17 +0800] "GET /course/list/73.htm HTTP/1.0" 200 12125 0.010

196.168.2.1 - - [03/Jul/2014:23:46:58 +0800] "GET /html/notes/20140609/542.html HTTP/1.0" 200 94971 0.077

196.168.2.1 - - [03/Jul/2014:23:48:31 +0800] "POST /service/notes/addViewTimes_24.htm HTTP/1.0" 200 2 0.003

196.168.2.1 - - [03/Jul/2014:23:48:34 +0800] "POST /service/notes/addViewTimes_542.htm HTTP/1.0" 200 2 0.003

196.168.2.1 - - [03/Jul/2014:23:49:31 +0800] "GET /notes/index-top-3.htm HTTP/1.0" 200 53494 0.041

196.168.2.1 - - [03/Jul/2014:23:50:55 +0800] "GET /html/notes/20140609/544.html HTTP/1.0" 200 183694 0.076

196.168.2.1 - - [03/Jul/2014:23:53:32 +0800] "POST /service/notes/addViewTimes_544.htm HTTP/1.0" 200 2 0.004

196.168.2.1 - - [03/Jul/2014:23:54:53 +0800] "GET /service/notes/addViewTimes_900.htm HTTP/1.0" 200 151770 0.054

196.168.2.1 - - [03/Jul/2014:23:57:42 +0800] "GET /html/notes/20140620/872.html HTTP/1.0" 200 52373 0.034

196.168.2.1 - - [03/Jul/2014:23:58:17 +0800] "POST /service/notes/addViewTimes_900.htm HTTP/1.0" 200 2 0.003

196.168.2.1 - - [03/Jul/2014:23:58:51 +0800] "GET /html/notes/20140617/888.html HTTP/1.0" 200 70044 0.057

186.76.76.76 - - [03/Jul/2014:23:48:34 +0800] "POST /service/notes/addViewTimes_542.htm HTTP/1.0" 200 2 0.003

186.76.76.76 - - [03/Jul/2014:23:46:17 +0800] "GET /course/list/73.htm HTTP/1.0" 200 12125 0.010

8.8.8.8 - - [03/Jul/2014:23:46:58 +0800] "GET /html/notes/20140609/542.html HTTP/1.0" 200 94971 0.077

由于Tomcat日志是不规则的,需要先过滤清洗数据。

2、Hadoop之MapReduce实现:

Map类import java.io.IOException;

import javax.naming.spi.DirStateFactory.Result;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class LogMapper extends Mapper {

private IntWritable val = new IntWritable(1);

@Override

protected void map(LongWritable key, Text value,Context context)

throws IOException, InterruptedException {

String line = value.toString().trim();

String tmp = handlerLog(line);

if(tmp.length()>0){

context.write(new Text(tmp), val);

}

}

//127.0.0.1 - - [03/Jul/2014:23:36:38 +0800] "GET /course/detail/3.htm HTTP/1.0" 200 38435 0.038

private String handlerLog(String line){

String result = "";

try{

if(line.length()>20){

if(line.indexOf("GET")>0){

result = line.substring(line.indexOf("GET"), line.indexOf("HTTP/1.0")).trim();

}else if(line.indexOf("POST")>0){

result = line.substring(line.indexOf("POST"), line.indexOf("HTTP/1.0")).trim();

}

}

}catch (Exception e) {

System.out.println(line);

}

return result;

}

public static void main(String[] args){

String line = "127.0.0.1 - - [03/Jul/2014:23:36:38 +0800] \"GET /course/detail/3.htm HTTP/1.0\" 200 38435 0.038";

System.out.println(new LogMapper().handlerLog(line));

}

}

Reduce类import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class LogReducer extends Reducer {

@Override

protected void reduce(Text key, Iterable values,Context context)

throws IOException, InterruptedException {

int sum = 0;

for(IntWritable val : values){

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

启动类import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class JobMain {

/**

* @param args

*/

public static void main(String[] args)throws Exception {

Configuration configuration = new Configuration();

Job job = new Job(configuration,"log_job");

job.setJarByClass(JobMain.class);

job.setMapperClass(LogMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

job.setReducerClass(LogReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

Path path = new Path(args[1]);

FileSystem fs = FileSystem.get(configuration);

if(fs.exists(path)){

fs.delete(path, true);

}

FileOutputFormat.setOutputPath(job, path);

System.exit(job.waitForCompletion(true)?0:1);

}

}

3、Spark实现之Scala版本//textFile() 加载数据

val data = sc.textFile("/spark/seven.txt")

//filter 过滤长度小于0, 过滤不包含GET与POST的URL

val filtered = data.filter(_.length()>0).filter( line => (line.indexOf("GET")>0 || line.indexOf("POST")>0) )

//转换成键值对操作

val res = filtered.map( line => {

if(line.indexOf("GET")>0){ //截取 GET 到URL的字符串

(line.substring(line.indexOf("GET"),line.indexOf("HTTP/1.0")).trim,1)

}else{ //截取 POST 到URL的字符串

(line.substring(line.indexOf("POST"),line.indexOf("HTTP/1.0")).trim,1)

}//最后通过reduceByKey求sum

}).reduceByKey(_+_)

//触发action事件执行

res.collect()

Scala函数式编程的代码简洁且优雅,在JDK1.8之后的也会有类似的新特性。

对比输出结果与MR是一致的(POST /service/notes/addViewTimes_779.htm,1),

(GET /service/notes/addViewTimes_900.htm,1),

(POST /service/notes/addViewTimes_900.htm,1),

(GET /notes/index-top-3.htm,1),

(GET /html/notes/20140318/24.html,1),

(GET /html/notes/20140609/544.html,1),

(POST /service/notes/addViewTimes_542.htm,2),

(POST /service/notes/addViewTimes_544.htm,1),

(GET /html/notes/20140609/542.html,2),

(POST /service/notes/addViewTimes_23.htm,1),

(GET /html/notes/20140617/888.html,3),

(POST /service/notes/addViewTimes_24.htm,1),

(GET /course/detail/3.htm,1),

(GET /course/list/73.htm,2),

(GET /html/notes/20140617/779.html,1),

(GET /html/notes/20140620/872.html,1)

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值