1、apache日志格式:
LogFormat "%h %l %u %t \"%r\" %>s %b" common(引用httpd.conf配置文件)
注视:每个位置代表什么含义网上都有,可以自己看。
2、apache日志内容:(第一行空格已经标出,其他行格式一样,不标了)
183.31.43.112 空格 -空格 -空格 [06/Nov/2012:00:00:17 空格+0800] 空格"GET空格 /hz HTTP/1.1" 空格302空格 -
211.103.82.175 - - [06/Nov/2012:00:00:24 +0800] "GET /hz HTTP/1.1" 302 -
221.179.37.12 - - [06/Nov/2012:00:00:33 +0800] "GET /hz?t=35245 HTTP/1.0" 302 -
211.139.145.101 - - [06/Nov/2012:00:00:34 +0800] "GET /hz HTTP/1.1" 302 -
120.196.208.79 - - [06/Nov/2012:00:00:38 +0800] "GET /hz HTTP/1.1" 302 -
117.136.12.210 - - [06/Nov/2012:00:00:38 +0800] "GET /hz HTTP/1.1" 302 -
221.179.85.10 - - [06/Nov/2012:00:00:40 +0800] "GET /hz?t=43385 HTTP/1.0" 302 -
119.139.89.179 - - [06/Nov/2012:00:00:46 +0800] "GET /hz HTTP/1.1" 302 -
。。。。。。。。。
。。。。。。。。。
。。。。。。。。。
3、试验要求:
把apache日志中每行的 “远程主机IP/主机名”,“访问时间”,“url”,“ 状态”导入到oracle数据库中。
4、oracle数据库建立的相关表及视图:
(1) create table APACHE_LOG
(
remote_host CHAR(15),
access_time VARCHAR2(30),
full_url VARCHAR2(200),
status CHAR(10)
);
(2)create table APACHE_LOG_RPT
(
remote_host CHAR(15),
access_time DATE,
real_url VARCHAR2(392),
full_url VARCHAR2(200),
status CHAR(10)
);
(3)create or replace view v_apache_log_vi as
select REMOTE_HOST , trunc(to_date(ACCESS_TIME ,'dd/mon/yyyy hh24:mi:ss')) access_time ,
case when instr(full_url, '?') > 0 then
substr(full_url, 5, instr(full_url, '?')-5 )
else
substr(full_url, 5, instr(full_url, ' ',5) -5 )
end real_url
, FULL_URL , STATUS from apache_log;
5、使用python格式化apache日志:
(1)python脚本1:processor.py
#!/bin/env python
import sys, re
class Processor:
def __init__(self, reader, writer):
self.reader = open(reader, 'r')
self.writer = open(writer, 'w')
def convert(self, data):
assert 0, 'convert needed to be defined'
def process(self):
while 1:
line = self.reader.readline()
if not line: break
data = self.convert(line)
self.writer.write(data)
class UpperCase(Processor):
def convert(self, data):
return data.upper()
class HTMLize:
def write(self, data):
print '%s' % data[:-1]
class ScanApacheLog(Processor):
def convert(self, data):
p = re.compile(r'''
(?P\S+) #remote_host
\s+ #whitespace
\S+ #remote_login
\s+ #whitespace
\S+ #remote_user
\s+ #whitespace
(?P\[.*\]) #acess_time
\s+ #whitespace
(?P".*") #first line of request
\s+ #whitespace
(?P\d+) #status
\s+ #whitespace
(-|\d+) # send_size
\s*
''', re.X)
m = p.match(data)
if m is not None:
arr = m.groupdict()
request = re.match( r'"(.*)"', arr['request']).group(1)
access_time = re.match(r'\[(\S+)\s+\S+\]', arr['time']).group(1)
line = arr['remote_host'] + ',' + access_time + ',' + request + ',' + arr['status'] + '\n'
return line
if __name__ == '__main__':
UpperCase(open('1.txt'), sys.stdout).process()
UpperCase(open('1.txt'), open('2.txt', 'w')).process()
UpperCase(open('1.txt'), HTMLize()).process()
(2)python脚本2:scan.py
#!/bin/env python
from processor import *
if __name__ == '__main__':
reader='/home/oracle/v1x-access_log_20121106'
writer='/tmp/dealed.log'
ScanApacheLog(reader, writer ).process()
(3)把上面两个脚本放在同一个目录下,执行#python scan.py 即可
reader='/home/oracle/v1x-access_log_20121106'记录的是apache日志位置;
writer='/tmp/dealed.log' 记录的是经过python脚本格式化处理的文件,也是下面步骤要使用的文件。
/tmp/dealed.log'部分内容如下:
112.95.194.245,06/Nov/2012:00:00:01,GET /hz HTTP/1.1,302
120.196.208.72,06/Nov/2012:00:00:07,GET /hz HTTP/1.1,302
113.74.6.56,06/Nov/2012:00:00:09,GET /hz HTTP/1.1,302
183.31.43.112,06/Nov/2012:00:00:17,GET /hz HTTP/1.1,302
211.103.82.175,06/Nov/2012:00:00:24,GET /hz HTTP/1.1,302
221.179.37.12,06/Nov/2012:00:00:33,GET /hz?t=35245 HTTP/1.0,302
211.139.145.101,06/Nov/2012:00:00:34,GET /hz HTTP/1.1,302
120.196.208.79,06/Nov/2012:00:00:38,GET /hz HTTP/1.1,302
。。。。。。。
。。。。。。。
6、利用 sqlldr工具把'/tmp/dealed.log导入到oracle数据库的apache_log表
首先编写控制文件,控制文件很重要
[oracle@shanxi myself]$ more apachelog2.ctl
load data
truncate into table apache_log
fields terminated by ','
trailing nullcols
(
REMOTE_HOST,
ACCESS_TIME ,
FULL_URL ,
STATUS
)
,然后把/tmp/dealed.log导入数据库:
[oracle@shanxi myself]$ sqlldr scott/### control=apachelog.ctl data=/tmp/dealed.log
7、最后一步,使用前面建立的封装处理逻辑的视图导入到apache_log_rpt表
SQL> alter session set nls_date_language='AMERICAN';
SQL > insert into apache_log_rpt select * from v_apachelog_vi;
因为涉及到日志格式转换,所以需要修改session的nls_date_language参数( 主要是“select REMOTE_HOST , trunc(to_date(ACCESS_TIME ,'dd/mon/yyyy hh24:mi:ss')) access_time”的mon部分是那种格式,与nls_date_language是CHINEASE还是AMERICAN有关系),否则如果mon格式跟nls_date_language不一致的话,查询视图v_apache_log_vi的时候会报错的。就是因为这个原因导致的。
这个处理逻辑就是上面这些。下面列出我之前接触过的这方面的脚本,功能是类似的:
试验要求是使用crond进程调用shell脚本,每天定时自动导入昨天的apache日志(v1x-access_log_***)到数据库中。
目录结构如下:
[oracle@searchdb2 apache]$pwd
/logbackup/apache
[oracle@searchdb2 apache]$ll
total 88
drwxr-xr-x 2 yangzhanli dev 9216 Oct 1 08:50 201209
drwxr-xr-x 2 yangzhanli dev 16384 Nov 1 08:50 201210
drwxr-xr-x 2 yangzhanli dev 10240 Nov 19 08:50 201211
drwxr-xr-x 2 yangzhanli dev 38912 Nov 19 08:50 md5sumfile
[oracle@searchdb2 apache]$cd
[oracle@searchdb2 apache]$cd 201211/
[oracle@searchdb2 201211]$ls -l | grep v1x | more
-rw-r--r-- 1 yangzhanli dev 820634 Nov 1 23:59 v1x-access_log_20121101
-rw-r--r-- 1 yangzhanli dev 658913 Nov 2 23:59 v1x-access_log_20121102
-rw-r--r-- 1 yangzhanli dev 1066549 Nov 3 23:59 v1x-access_log_20121103
-rw-r--r-- 1 yangzhanli dev 644101 Nov 4 23:59 v1x-access_log_20121104
-rw-r--r-- 1 yangzhanli dev 888134 Nov 5 23:59 v1x-access_log_20121105
-rw-r--r-- 1 yangzhanli dev 1336714 Nov 6 23:59 v1x-access_log_20121106
-rw-r--r-- 1 yangzhanli dev 596688 Nov 7 23:59 v1x-access_log_20121107
-rw-r--r-- 1 yangzhanli dev 497143 Nov 8 23:59 v1x-access_log_20121108
cron调用如下:
[oracle@searchdb2 201211]$ crontab -l
0 7 * * * /home/oracle/admin/tarsmsglogs/bin/tar_smslogs.sh cm
0 8 * * * /home/oracle/admin/tarsmsglogs/bin/tar_smslogs.sh ct
0 6 * 1 * /home/oracle/insert_log/insert_log.sh
4 9 * * * /home/oracle/admin/apache/import_apachelog2.sh > /dev/null 2>&1
shell脚本import_apachelog2.sh内容如下:
!/bin/bash
# Filename: import_apachelog2.sh
# Date: 2012-10-31
# Desc: import the logfile '/logbackup/apache/${month}/ v1x-access_log_**.log' into the table 'apache_log_rpt' in SPICEMOM schema
# Usage: /home/oracle/admin/apache/import_apachelog2.sh
# Version: 1.1
. /home/oracle/.bash_profile
LOG_ROOT=/logbackup/apache
CUR_MONTH=`date -d '1 day ago' +%Y%m`
YESTERDAY=`date -d '1 day ago' +%Y%m%d`
LOG_PREFIX=v1x-access_log
LOG_FILE=${LOG_ROOT}/${CUR_MONTH}/${LOG_PREFIX}_${YESTERDAY}
SCRIPT_PATH=/home/oracle/admin/apache
#echo $LOG_FILE
#echo ${SCRIPT_PATH}
#exit
cd ${SCRIPT_PATH}
if [ ! -f $LOG_FILE ]; then
now=`date '+ %Y-%m-%d %H:%M:%S'`
echo "* Failed at $now:the logfile ${LOG_FILE} needed to import doesn't exist !" >> import_history.log
exit
fi
cp ${LOG_FILE} /tmp/v1x.log
python scanlog.py
sqlldr spicemom/spicemom control=apachelog.ctl data="/tmp/scanlog.txt"
if [ "$?" -eq 0 ] ; then
now=`date '+ %Y-%m-%d %H:%M:%S'`
sqlplus spicemom/spicemom @append_log.sql
echo "At $now:the logfile ${LOG_FILE} succeeds to import the table apache_log_rpt !" >> import_history.log
rm -rf /tmp/v1x.log
rm -rf /tmp/scanlog.txt
fi
所有涉及到的脚本目录结构如下:
[oracle@searchdb2 201211]$cd /home/oracle/admin/apache
[oracle@searchdb2 apache]$ll
total 36
-rw-r--r-- 1 oracle dba 69 Sep 12 15:29 afiedt.buf
-rw-r--r-- 1 oracle dba 383 Oct 30 13:38 apachelog.ctl
-rw-r--r-- 1 oracle dba 1750 Nov 19 09:04 apachelog.log
-rw-r--r-- 1 oracle dba 135 Oct 29 20:29 append_log.sql
-rwxr--r-- 1 oracle dba 968 Nov 4 12:19 import_apachelog.sh
-rwxr--r-- 1 oracle dba 1154 Nov 4 12:19 import_apachelog2.sh
-rw-r--r-- 1 oracle dba 2788 Nov 19 09:04 import_history.log
drwxr-xr-x 2 oracle dba 4096 Oct 30 16:00 orig_source
-rw-r--r-- 1 oracle dba 938 Oct 30 13:35 scanlog.py
再看一下脚本内容:
python的脚本得注意格式要符合程序结构,否则的话可能运行不了。python的代码是不是一段代码块不是靠类似C语言的中括号{},而是靠格式对齐来判断是不是同一段代码块, 这点感觉挺恶心的。(如果有人copy可得注意了)
[oracle@searchdb2 apache]$ vi scanlog.py
#!/usr/bin/env python
import sys
import re
def ScanOneLog(sLogFile,dLogFile):
log_line_re = re.compile(r'''(?P\S+) #IP ADDRESS
\s+ #whitespace
\S+ #remote logname
\s+ #whitespace
\S+ #remote user
\s+ #whitespace
(?P\[[^\[\]]+\]) #time
\s+ #whitespace
(?P"[^"]+") #first line of request
\s+ #whitespace
(?P\d+)
\s+ #whitespace
(?P-|\d+)
\s* #whitespace
''', re.VERBOSE)
lf=open(dLogFile,'w')
for line in open(sLogFile,'r'):
m = log_line_re.match(line)
if m is not None :
groupdict = m.groupdict()
tmps=groupdict['remote_host']+','+groupdict['status']+','+groupdict['time']+','+groupdict['request']+'\n'
lf.writelines(tmps)
lf.close()
if __name__=='__main__':
ScanOneLog('/tmp/v1x.log','/tmp/scanlog.txt')
[oracle@searchdb2 apache]$vi apachelog.ctl
load data
truncate into table apache_log
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
remote_host char,
status char,
access_time char,
url char(4000)
)
[oracle@searchdb2 apache]$vi append_log.sql
alter session set nls_date_language = 'AMERICAN';
insert into apache_log_rpt select * from v_apache_log_ori;
commit;
exit
最后给列下表和视图结构定义:
create table APACHE_LOG
(
remote_host VARCHAR2(20),
url VARCHAR2(4000),
status VARCHAR2(10),
access_time VARCHAR2(30)
)
create table APACHE_LOG_RPT
(
remote_host VARCHAR2(20),
real_url VARCHAR2(4000),
url VARCHAR2(4000),
status VARCHAR2(10),
access_time DATE
)
create or replace view v_apache_log as
select "REMOTE_HOST","REAL_URL","URL","STATUS","ACCESS_TIME" from apache_log_rpt;
create or replace view v_apache_log_ori as
select remote_host,
case
when instr(url,'?') >0 then substr(url,6,instr(url,'?')-6)
else substr(url,6,instr(url,' ',6)-6)
end real_url,
url,status,
trunc(to_date(SUBSTR(ACCESS_TIME,2,LENGTH(ACCESS_TIME)-8),'dd/mon/yyyy:hh24:mi:ss') ) access_time from apache_log;
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/27042095/viewspace-749496/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/27042095/viewspace-749496/