hadoop日志解析方式总结

一、hive正则表

    即建立hive表时指定解析正则式,直接用正则表达式解析日志为规范化的表。

日志类型:

"tom"@_@123@_@"192.168.1.2"@_@"2017-02-05 12:13:06"
"jack"@_@139@_@"192.168.221.23"@_@"2017-02-05 12:13:08"
"jesse"@_@114@_@"192.168.10.82"@_@"2017-02-05 12:13:13"

1.1 导入jar包

    如果没有hive-contrib-0.7.0.jar,请导入

1.2 建表

内部表:

CREATE TABLE IF NOT EXISTS hive_test
(name     string,
 age      string,
 ip       string,
 dt       string)
ROW FORMAT   
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'   
WITH SERDEPROPERTIES   
( 'input.regex' = '"(.*)"@_@(.*)@_@"(.*)"@_@"(.*)"',   
'output.format.string' = '%1$s %2$s %3$s %4$s')   
STORED AS TEXTFILE 

外部表:

CREATE EXTERNAL TABLE IF NOT EXISTS hive_test
(name     string,
 age      string,
 ip       string,
 dt       string) 
ROW FORMAT   
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'   
WITH SERDEPROPERTIES   
( 'input.regex' = '"(.*)"@_@(.*)@_@"(.*)"@_@"(.*)"',   
'output.format.string' = '%1$s %2$s %3$s %4$s')    
STORED AS TEXTFILE
LOCATION '/DB/test/hive_test '

1.3 导入数据

    内部表导入数据:

        load data local inpath '/opt/test_data/data_test.txt' into table hive_test;  

    外部表导入数据:

        sudo -u hdfs hadoop fs -put /opt/test_data/data_test.txt /DB/test/hive_test

1.4 解析实例

日志格式:

192.168.103.230 - - [04/Jul/2017:00:00:06 +0800] "POST /mes/perfectBillAuth HTTP/1.1" 200 75 5 13919389860,个体户张国强,86191229388 223.104.27.45,180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 –

建表语句:

CREATE TABLE IF NOT EXISTS test.ishua_log (
host                 string,
visiter              string,
time                 string,
http                 string,
status_code          string,
byte                 string,
cost_time            string,
phone                string,
cust_name            string,
cust_no              string,
cust_ip              string,
ph_system            string,
ph_system_version    string,
phone_model          string,
app_name             string,
app_version          string,
location             string,
visit                string
) ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
( 'input.regex' = "([^ ]*) - ([^ ]*) (\\[.*\\]) (\".*?\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) (-|[0-9]*),([^ ]*),(-|[0-9]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*) ([^ ]*)",
'output.format.string' = '%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s %10$s %11$s %12$s %13$s %14$s %15$s %16$s %17$s %18$s')
STORED AS TEXTFILE

解析结果:

170532_EbMw_2812496.png

二、shell解析

    即使用shell中的awk\sed\grep等文本处理方法,解析日志输出规范化的数据格式,此方式为单机模式适用于复杂度一般数据量一般的日志。

2.1 日志类型:

15:25:07.958 416396277 [DubboServerHandler-192.168.108.29:28513-thread-119] INFO  c.l.c.filter.DubboInterfaceFilter - 192.168.101.67 com.lefu.cust.dubbo.CustomerInterface.findByCustomerNo RESULT:00
15:25:08.057 416396376 [DubboServerHandler-192.168.108.29:28513-thread-119] INFO  c.l.c.filter.DubboInterfaceFilter - 192.168.101.102 com.lefu.cust.dubbo.CustomerInterface.findByCustomerNo ARGS:[86191017000]
15:25:08.200 416396519 [DubboServerHandler-192.168.108.29:28513-thread-123] INFO  c.l.c.filter.DubboInterfaceFilter - 192.168.101.63 com.lefu.cust.dubbo.EnterpriseInfoInterface.findByCustomerNo ARGS:[8617931353]

2.2 解析代码:

#!/bin/bash

source /usr/local/work/Maya/common/all.sh

function cust_server_create {

tmpDir=/tmp/pu_tmp/cust_server.log
sourceDir=/RowLogs/cust_server/$DD
targetDir=/DB/logsource/cust_server_create/$DD


#remove tmpfile
    rm -f $tmpDir

#format data to tmpfile
    for i in `hdfs dfs -ls $sourceDir|awk -F ' ' '{print $8}'`;do
        hdfs dfs -cat $i |grep -v portal|grep -E "ARGS:|RESULT:"| \
        awk -F ' ' '{ if (substr($5,1,3)=='com') \
                         {print $1"&"$7"&"''"&"substr($8,1,4)} \
                    else {print $1"&"$8"&"$7"&"substr($9,1,4)}}'  >> $tmpDir
    done

#mkdir and  put data to HDFS
    hdfs dfs -rm -r "$targetDir";
    hdfs dfs -mkdir -p "$targetDir";

    sudo -u hdfs hadoop fs -put $tmpDir $targetDir

    TickCross

    AddPartition origin_logs cust_server_create $DD $DD

}

echo "--------------------Cust_server_create start------------------"
cust_server_create
echo "--------------------Cust_server_create end--------------------"

2.3 解析结果:

17:38:20.506&com.lefu.cust.dubbo.ShopInterface.getByCustomerNo&192.168.108.9/192.168.108.9:55696&ARGS
17:38:20.731&com.lefu.cust.dubbo.CustomerInterface.findByCustomerNo&192.168.101.86/192.168.101.86:45533&ARGS
17:38:20.736&com.lefu.cust.dubbo.CustomerInterface.findByCustomerNo&192.168.101.86/192.168.101.86:45533&RESU

三、spark解析

    即使用spark平台scala语言进行日志解析,程序最终可被打成jar包提交到spark集群执行,并输出规范的格式到HDFS。此方式处理能力强效率较高,适用于大数据量教复杂格式的解析。

测试环境:

       工具:IDEA2016.1.2

       Java:jdk1.8_25

       Scala:sdk-2.10.61 (2.11及以上版本会与spark1.6冲突)

       Spark:spark-assembly-1.6.1-hadoop2.6.0.jar

3.1 日志类型:

192.168.103.230 - - [04/Jul/2017:00:00:06 +0800] "POST /mes/bill/perfectBillAuth HTTP/1.1" 200 75 5 13919389860,个体户张国强,86191229388 223.104.27.45,180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 -
192.168.189.53 - - [04/Jul/2017:12:40:06 +0800] "POST /mes/bill/perfectBillAuth HTTP/1.1" 200 75 7 1456383406,个体户张向东,86191229372 223.104.30.30,180.97.56.27 IOS 6.0.1 apple6s LefuTong 4.4.1 T35.38862|103.85023 -
192.168.125.20 - - [04/Jul/2017:10:06:06 +0800] "POST /mes/bill/perfectBillAuth HTTP/1.1" 200 75 3 13777789239,个体户李华,86191229455 223.104.27.45,180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 –

3.2 解析代码:

解析脚本ishua_format.scala

package ishua

import java.text.SimpleDateFormat
import java.util.Locale
import org.apache.log4j.Logger

case class Ishua_log(host:String,visiter:String,time:String,http:String,status_code:String,byte:String,cost_time:String,phone:String,cust_name:String,cust_no:String,cust_ip:String,ph_system:String,ph_system_version:String, phone_model:String,app_name:String,app_version:String, location:String,visit:String)
  extends {
    override
    def toString(): String = {
      "%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s\001%s"
      .format(host,visiter,time,http,status_code,byte,cost_time,phone,cust_name,cust_no,cust_ip,ph_system,ph_system_version,phone_model,app_name,app_version,location,visit)
  }
}

object ishua_format {
  val log = Logger.getLogger(this.getClass)
  val _null: Ishua_log = new Ishua_log("","","","","","","","","","","","","","","","","","")

  def fromStringlog(in: String): Ishua_log = {
    try {
      val text = in.split(" ")
      val host = text(0)
      val visiter = text(2)

      val date_old = new SimpleDateFormat("dd/MMM/yyyy:hh:mm:ss", Locale.US)
      val date_new = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
      val time = date_new.format(date_old.parse(text(3).replaceAll("\\[", "")))

      val http = text(5).replaceAll("\"", "") + " " + text(6) + " " + text(7).replaceAll("\"", "")
      val status_code = text(8)
      val byte = text(9)
      val cost_time = text(10)
      val phone = text(11).split(",")(0)
      val cust_name = text(11).split(",")(1).replaceAll("\t","")
      val cust_no = text(11).split(",")(2)
      val cust_ip = text(12).replaceAll(",", "|")
      val ph_system = text(13)
      val ph_system_version = text(14)
      val phone_model = text(15)
      val app_name = text(16)
      val app_version = text(17)
      val location = text(18)
      val visit = text(19)
      new Ishua_log(host, visiter, time, http, status_code, byte, cost_time, phone, cust_name, cust_no, cust_ip,
        ph_system, ph_system_version, phone_model, app_name, app_version, location, visit)
    }
    catch {
       case e: Throwable => log.error("%s\n%s".format(e.printStackTrace, in))
       _null
    }
  }


  def main(args: Array[String]) {
    val data="""192.168.103.230 - - [04/Jul/2017:12:06:08 +0800] "POST /mes/bill/perfectBillAuth HTTP/1.1" 200 75 5 13919389860,个体户张国强,86191229388 223.104.27.45,180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 -"""
    val model = fromStringlog(data)
    println(model)
  }
}

 

调用脚本ishua_parser.scala

package ishua

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.io.compress.GzipCodec

object ishua_parser {

  def main(args: Array[String]) {

    //设置运行环境
    val conf = new SparkConf().setAppName("ishua_log").setMaster("local")
    val sc = new SparkContext(conf)
    val input ="E:\\ishua.txt"
    val output = "E:\\ishuaout.txt"

    val form = sc.textFile(input)
        .filter(l => l.contains("0800") && !l.contains("com.lefu"))
        .map(ishua.ishua_format.fromStringlog(_)).filter(!_.host.trim.isEmpty)
        .foreach(println)
        //.repartition(1).saveAsTextFile(output,classOf[GzipCodec])

    sc.stop()

  }

}

 

测试jar

./spark-shell --master local --class ishua.ishua_parser --jars /opt/jar/log_format.jar

3.3 解析结果:

    (结果为\001分隔,此处未显现)

192.168.103.230 - 2017-07-04 00:00:06 POST /mes/bill/perfectBillAuth HTTP/1.1 200 75 5 13919389860 个体户张国强 86191229388 223.104.27.45|180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 -
192.168.189.53 - 2017-07-04 00:40:06 POST /mes/bill/perfectBillAuth HTTP/1.1 200 75 7 1456383406 个体户张向东 86191229372 223.104.30.30|180.97.56.27 IOS 6.0.1 apple6s LefuTong 4.4.1 T35.38862|103.85023 -
192.168.125.20 - 2017-07-04 10:06:06 POST /mes/bill/perfectBillAuth HTTP/1.1 200 75 3 13777789239 个体户李华 86191229455 223.104.27.45|180.97.88.132 ANDROID 6.0.1 vivoX9 LefuTong 4.4.1 T35.38862|103.85023 -

 

转载于:https://my.oschina.net/puwenchao/blog/1457028

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值