nginx日志比较规范和成熟,也有现成的供参考的格式。
1. 基本准备
Nginx日志格式
192.168.0.20 - - [01/Aug/2021:14:53:35 +0800] "GET /demo HTTP/1.1" 404 3650 "-" "Chrome xxx" "-"
Nginx日志格式配置
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
Grok日志信息提取,包括内置表达式和自定义表达式,调试的时候,可以在kibana进行验证。
内置表达式参考:
/logstash-7.15.1/vendor/bundle/jruby/2.5.0/gems/logstash-patterns-core-4.3.1/patterns/ecs-v1/grok-patterns
USERNAME [a-zA-Z0-9._-]+
USER %{USERNAME}
EMAILLOCALPART [a-zA-Z][a-zA-Z0-9_.+-=:]+
EMAILADDRESS %{EMAILLOCALPART}@%{HOSTNAME}
INT (?:[+-]?(?:[0-9]+))
BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
NUMBER (?:%{BASE10NUM})
BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
POSINT \b(?:[1-9][0-9]*)\b
NONNEGINT \b(?:[0-9]+)\b
WORD \b\w+\b
NOTSPACE \S+
SPACE \s*
DATA .*?
GREEDYDATA .*
QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))
UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
# URN, allowing use of RFC 2141 section 2.3 reserved characters
URN urn:[0-9A-Za-z][0-9A-Za-z-]{0,31}:(?:%[0-9a-fA-F]{2}|[0-9A-Za-z()+,.:=@;$_!*'/?#-])+
...............
Grok自写正则提取语法:(?<字段名>自写正则表达式)
(?<remote_addr>\d+\.\d+\.\d+\.\d+)
内置正则提取语法:
%
{内置正则表达式:字段名}
%{IP:remote_addr} - (%{WORD:remote_user}|-) \[%{HTTPDATE:time_local}\] "%{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER}" %{NUMBER:status} %{NUMBER:body_bytes_sent} %{QS} %{QS:http_user_agent}
正则表达式符号
. 表示任意一个字符,* 表示前面一个字符出现0次或者多次
[abc]表示中括号内任意一个字符,[^abc]表示非中括号内的字符
[0-9]表示数字,[a-z]表示小写字母,[A-Z]表示大写字母,[a-zA-Z]表示所有字母,[a-zA-Z0-9]表示所有字母+数字
[^0-9]表示非数字
^xx表示以xx开头,xx$表示以xx结尾
\s表示空白字符,\S表示非空白字符,\d表示数字
?表示前面字符出现0或者1次,+前面字符出现1或者多次
{a}表示前面字符匹配a次,{a,b}表示前面字符匹配a到b次
{,b}表示前面字符匹配0次到b次,{a,}前面字符匹配a或a+次
string1|string2表示匹配string1或者string2
2. ngnix日志文件读取配置
配置文件:
input {
file {
path => "/var/log/nginx/access.log"
}
}
filter {
grok {
match => {
"message" => '%{IP:remote_addr} - (%{WORD:remote_user}|-) \[%{HTTPDATE:time_local}\] "%{WORD:method} %{NOTSPACE:request} HTTP/%{NUMBER}" %{NUMBER:status} %{NUMBER:body_bytes_sent} %{QS:url_path} %{QS:http_user_agent}'
}
remove_field => ["message"]
}
date {
match => ["time_local", "dd/MMM/yyyy:HH:mm:ss Z"]
target => "@timestamp"
}
mutate {
gsub => [ "http_user_agent",'"',"" ]
convert => { "status" => "integer" }
convert => { "body_bytes_sent" => "integer" }
remove_field => ["time_local"]
}
}
output {
elasticsearch {
hosts => ["http://192.168.0.90:9200", "http://192.168.0.92:9200"]
user => "elastic"
password => "password"
index => "nginx-log-%{+YYYY.MM.dd}"
}
}
http_user_agent包含双引号,通过转换去除。
status和body_bytes_sent的数据类型转换;
采用日志中的时间作为日志时间;
设置正则出错提取到另外的索引里
output {
if "_grokparsefailure" not in [tags] and "_dateparsefailure" not in [tags] {
elasticsearch {
hosts => ["http://192.168.0.90:9200", "http://192.168.0.92:9200"]
user => "elastic"
password => "password"
index => "nginx-log-%{+YYYY.MM.dd}"
}
}
else{
elasticsearch {
hosts => ["http://192.168.0.90:9200", "http://192.168.0.92:9200"]
user => "elastic"
password => "password"
index => "nginx-log-%{+YYYY.MM.dd}"
}
}
}
3. grok pattern案例
2.1 自定义提取定长数字串
从字符串中提取自定长度的字符串,这里以28个数字为例。
22-03-03 00:00:00 INFO [PAY_SUBSIDYORDER]STEP[2][OK[183]ms]>SUBSIDY_TRANSFER_DONE{in:{amount=10.00, orderNo=1000000000202202010752324090, payAmount=10.00}}
grok pattern写法
(?<order_no>\d{28})
输出结果:
{
"order_no": "1000000000202202010752324090"
}
2.2 自定义提取手机号
从字符串中提取出手机号字符串
125.210.239.129 - - [01/Mar/2022:03:50:02 +0800] "POST /box-cweb/api/cpn/v1/query/myCpns HTTP/1.1" 200 3144 "https://www.demo.com/qd-mall/?orderNo=1000000000202203010348123456&userKey=1415822942094924567&mobile=13708951234"
grok pattern写法
(?<orderNo>orderNo=\d{28})&(?<userKey>userKey=\d{19})&(?<mobile>mobile=\d{11})
输出结果:
{
"orderNo": "orderNo=1000000000202203010348123456",
"mobile": "mobile=13708951234",
"userKey": "userKey=1415822942094924567"
}
在提交到elasticsearch的时候,将固定的字符串替换掉。