Centos6.5 64位 安装Hadoop2.7.0, MapReduce日志分析, Hive2.1.0, JDBC连接Hive查询 (2)

原创 2016年09月12日 14:10:12

第二篇 MapReduce日志分析


做日志分析之前, 我觉得要先了解下MapReduce , 网上很多, 你可以搜下, 这位哥们讲的还不错 点击打开链接

日志长这样的:


<?php  if ( ! defined('ROOT_PATH')) exit('No direct script access allowed'); ?>

[2016-06-01 00:10:27] POST 218.82.131.157 /user/HealthCenter.php?m=submit uid=14&hash=dd16e3e4d0e8786f13166a4065f24fa0&num=13.0&type=1&time=1464711029 0(OK)
[2016-06-01 08:10:27] POST 218.82.131.157 /user/HealthCenter.php?m=submit uid=14&hash=863fbf2535639c16d885a55c78dff665&num=13.0&type=1&time=1464739829 0(OK)
[2016-06-01 09:10:28] POST 124.74.69.134 /user/HealthCenter.php?m=submit uid=14&hash=9a310b722e795e2673cdba76bea29b26&num=13.0&type=1&time=1464743429 0(OK)
[2016-06-01 09:16:05] GET 124.74.69.134 /index/Main.php?hash=eac57627d3407963dab81da2bb07e378&page_num=1&page_size=10&time=1464743769&uid=8 0(OK)
[2016-06-01 10:01:30] GET 124.74.69.134 /index/Main.php?hash=7979353ef669c61f75a5a7e9d39cd646&page_num=1&page_size=10&time=1464746494&uid=8 0(OK)
[2016-06-01 10:10:28] POST124.74.69.134 /user/HealthCenter.php?m=submit uid=14&hash=98012832769b5a0e45f036e92032f1ef&num=13.0&type=1&time=1464747029 0(OK)
[2016-06-01 10:11:12] GET 124.74.69.134 /index/Main.php?hash=77938b0fdf1b733a9e15d9a2055767d1&page_num=1&page_size=10&time=1464747076&uid=8 0(OK)
[2016-06-01 10:48:00] GET 124.74.69.134 /index/Main.php?hash=1a1979e9fdcfca2f17bf1b287f4508aa&page_num=1&page_size=10&time=1464749284&uid=8 0(OK)
[2016-06-01 10:48:42] POST 124.74.69.134 /user/Position.php uid=9&address=undefine&latitude=4.9E-324&time=1464749394&type=1&hash=d4aebe936762be3a0420b62a77e37b00&longitude=4.9E-324 0(OK)

分别是: 时间 请求方式 IP 请求地址 参数 返回值

每天产生一个, 分别已Y-m-d.php 方式命名.

达到的目的是: 统计每天 每个接口的请求次数, 以返回结果分组,

编写程序

/**
 * @ClassName:     LogMapReduce.java
 * @author         273030282@qq.com
 * @version        V1.0 
 * @Date           2016-7-11 10:20:11
 * @Description:   TODO
 *
 */

package www.com.cn;

import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class LogMapReduce extends Configured implements Tool {

	public static void main(String[] args) {
		Configuration conf = new Configuration();
		try {
			int res = ToolRunner.run(conf, new LogMapReduce(), args);
			System.exit(res);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	@Override
	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		final Job job = Job.getInstance(conf, "LogParaseMapReduce");
		job.setJarByClass(LogMapReduce.class);
		FileInputFormat.setInputPaths(job, args[0]);
		job.setMapperClass(MyMapper.class);
		job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean success = job.waitForCompletion(true);
		if (success) {
			System.out.println("process success!");
		} else {
			System.out.println("process failed!");
		}
		return 0;
	}
	
    enum Counter{  
        LINESKIP,  
    }     
	

	static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
		private final static IntWritable one = new IntWritable(1);
		public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			if ("".equals(value)) {
				return;
			}
			String line = value.toString();
			if (line.length() == 0 || !"[".equals(line.substring(0, 1))) {
				return;
			}
			
			try {
				String[] lines = line.split(" ");
				String data = lines[0].replace("[", "") + "\t";
				if ("GET".equals(lines[2])) {
					String url = "";
					String[] urls = lines[4].split("[?]");
					String[] params = urls[1].split("[&]");
					if (params[0].indexOf('m') != -1) {
						url = urls[0] + "?" + params[0];
					} else {
						url = urls[0];
					}
					data += url + "\t" + lines[5];
				} else if ("POST".equals(lines[2])) {
					data += lines[4] + "\t" + lines[6];
				}
				Text out = new Text(data);
				context.write(out, one);
			} catch (ArrayIndexOutOfBoundsException e) {
				context.getCounter(Counter.LINESKIP).increment(1); 
			}
		}
	}

	static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
		protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
			int count = 0;
			for (IntWritable v: values) {
				count = count + 1;
			}
			context.write(key, new IntWritable(count));
		}
	}

}

然后将程序打jar包:LogMapReduce.jar

打包时注意选择 Main class, 这时候选择就不用在调用时指定包目录了


Shell脚本

脚本分为两个, 一个执行脚本, 一个运行执行的脚本

运行脚本 run.sh:

#! /bin/bash

d=`date "+%Y-%m-%d %H:%M:%S"`
echo "{$d} start..."

file=$1;
if [ -f ${file} ];then
        echo "${file} exists"
else
        echo "${file} not exists"
        exit 0
fi


#获取文件名
fileinfo=(${file//// })
filename=${fileinfo[$[${#fileinfo[@]}-1]]}
info=(${filename//./ })
name=${info[0]}
echo "hadoop put file to /api/put/${filename}"
hadoop fs -put ${file} /api/put/${filename}
echo "call LogMapReduce.jar"
hadoop jar /home/hadoop/hadoop-2.7.0/share/hadoop/mapreduce/LogMapReduce.jar /api/put/${filename} /api/out/${name}
echo "hive load into api_logs"
hive -e "load data inpath '/api/out/${name}/part-r-00000' into table apis.api_logs"
echo "delete /api/put/${filename}"
hadoop fs -rm /api/put/${filename}
echo "delete /api/out/${name}"
hadoop fs -rmr /api/out/${name}
echo "end"
~          

大致的逻辑, 接收传入的文件(含路径), 然后分割, 得到文件名, 然后将文件put到hadoop, 调用LogMapReduce.jar, 将结果插入到hive, 删除文件


运行执行的脚本 process_2016_06.sh:

#!/bin/sh

for((i=5;i<31;i++))
do
        logdate=`printf "%'.02d" $i`
        ./run.sh /home/hadoop/data/2016-06-${logdate}.php
done

也可以折腾在一起.


折腾到hive 就i可以查询了, 比如查询6月份失败率前20




相关文章推荐

centos6.8平台上安装hive(基于Mysql6.5 和hadoop2.7.3伪分布集群下)

一、安装JDK 解压jdk-8u111-linux-x64.tar.gz 重命名文件夹 mv /opt/j dk-8u111 /usr/java 配置环境变量 使环境变量生效 二、mys...

hue的基本认识和安装(Ubuntu+hadoop2.7.2+hive2.1.0+hue3.11.0)

最近由于需要,了解了下hue。本文就hue的基本情况做个介绍。以及安装部署(这个很折腾)。 hue的功能特性介绍 Hue是一个开源的Apache Hadoop UI系统,最早是由Cloudera D...

CentOS下安装Hive2.1.0详解

CentOS下安装Hive2.1.0详解 本文所需环境如下: 操作系统:CentOS 6.5 64位 Hive版本:2.1.0 JDK版本:1.7.0_79 Hadoop版...
  • hyz301
  • hyz301
  • 2016年08月16日 22:45
  • 102

ubuntu16.04+hadoop2.7.2+hive1.2.1 server2通过jdbc连接

package zg; import java.sql.SQLException; import java.sql.Connection; import java.sql.ResultSet; imp...

centos 6.5-hadoop-2.2.0配置hive1.2.1

在centos 6.5 上已经配置好了hadoop-2.2.0,然后想在其上配置hive,先下了个hive最新版的1.2.1,搞了半天,配置失   败,又换了hive-1.1.0和0.9.0,都没配成...

【零基础】CentOS 7 64位系统下编译Hadoop2.7.0

因为官网下载下来的Hadoop安装包是x86,32位的,64位需要自己编译。 编译环境 操作系统:CentOS 7 64位(需要连接互联网) Hadoop源代码版本:hadoop-2.7.0-src....

Hadoop(2.7.0)伪分布式环境CentOS(7.0 64位)配置

Hadoop(2.7.0)伪分布式环境CentOS(7.0 64位)配置 1、实现ssh无密码验证配置 2、安装jdk,并配好环境变量 3、安装与配置Hadoop 4、格式化与启动 5...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Centos6.5 64位 安装Hadoop2.7.0, MapReduce日志分析, Hive2.1.0, JDBC连接Hive查询 (2)
举报原因:
原因补充:

(最多只允许输入30个字)