写在前面
前面两篇文章中,我们使用Spark RDD及Spark Streaming 从CDN日志中计算出了独立IP数、每个视频独立IP数、每时CDN流量,这里我们将使用MapReduce计算框架,从同样的日志中完成相同的计算,前提是要搭建好Hadoop及HDFS运行环境
独立IP计算
计算思路
- Map 在map中将每行的IP地址提取出来,并将(ip,1)输出
Reduce 在reduce中相同的ip已经合并,所以只需统计出values的大小就是这个IP的数量了
代码
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] lines = line.split(" ");
String ip = lines[0];
context.write(new Text(ip), new IntWritable(1));
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
每个视频独立IP计算
计算思路
- map 在map中将每行的视频及IP提出,以(id,ip)输出
- Reduce 在reduce中同一视频id被合并,只需统计values中不同IP的数量即可
代码
//从每行日志中获取视频ID
public static IntWritable getVideoId(String line) {
Matcher matcher = Pattern.compile("([\\d]+)\\.mp4").matcher(line);
if (matcher.find()) {
return new IntWritable(Integer.parseInt(matcher.group(1)));
} else {
return new IntWritable(0);
}
}
public static class Map extends Mapper<LongWritable, Text, IntWritable, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] lines = line.split(" ");
String ip = lines[0];
context.write(getVideoId(line), new Text(ip));
}
}
public static class Reduce extends Reducer<IntWritable, Text, IntWritable, IntWritable> {
@Override
protected void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//统计出不同IP数量
Set<Text> ips = new HashSet<Text>();
for (Text value : values) {
ips.add(value);
}
context.write(key, new IntWritable(ips.size()));
}
}
每个小时CDN流量计算
计算思路
- map 将每行日志中的小时及请求大小提取出来以(hour,size)输出
Reduce 将每小时的size相加,得出每小时总大小
代码
//提取小时
public static IntWritable getHour(String line) {
Matcher matcher = Pattern.compile("2017:(\\d{2}):").matcher(line);
if (matcher.find()) {
return new IntWritable(Integer.parseInt(matcher.group(1)));
} else {
return new IntWritable(0);
}
}
//提取大小
public static LongWritable getHttpSize(String line) {
Matcher matcher = Pattern.compile(" (200|206|304) ([0-9]+) ").matcher(line);
if (matcher.find()) {
return new LongWritable(Integer.parseInt(matcher.group(2)));
} else {
return new LongWritable(0);
}
}
public static class Map extends Mapper<LongWritable, Text, IntWritable, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
context.write(getHour(line), getHttpSize(line));
}
}
public static class Reduce extends Reducer<IntWritable, LongWritable, IntWritable, LongWritable> {
@Override
protected void reduce(IntWritable key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
//统计每时cdn总和
Long size = 0L;
for (LongWritable value : values) {
size += value.get();
}
context.write(key, new LongWritable(size));
}
}
源码
http://git.oschina.net/whzhaochao/spark-learning