Linux: 文本查找包含某个关键字的段落(awk实现)
AIX中的grep命令的"-p"选项可以查找包含某个关键字的段落(我们这里把段落定义为由空行分开的记录,段与段之间有至少一个空行),比如下面的文本中有两段:
$ cat test.txt
Hello,world
This is a file with
two paragraph.
下面的命令可以查找db2diag.log中每个数据库取消激活的段落:
linux中的grep命令的-p选项有完全不同的含义,而且没有其他选项来实现这个功能。换个思路,能不能把“一段”作为“一行”来处理呢?答案是可以的,这时候就显示awk的强大了,awk有两个关键字,如下:
ORS terminates each record on output, initially = "\n".
RS input record separator, initially = "\n".
RS表示行分割符,默认是换行符'\n'。如果把“一段”当作“一行”,那么“行”与“行”之间的分割符就是两个或以上的换行符,所以,只需要指定RS为"\n\n+"就可以了,awk手册中提供了一个绝佳的范本:
AIX中的grep命令的"-p"选项可以查找包含某个关键字的段落(我们这里把段落定义为由空行分开的记录,段与段之间有至少一个空行),比如下面的文本中有两段:
$ cat test.txt
Hello,world
This is a file with
two paragraph.
下面的命令可以查找db2diag.log中每个数据库取消激活的段落:
$ grep -ip 'DEACTIVATED' db2diag.log
$ grep -ip 'DEACTIVATED' db2diag.log
2017-09-17-12.03.33.048373+480 E1594733A513 LEVEL: Event
PID : 19726438 TID : 3343 PROC : db2sysc 0
INSTANCE: e105q5a NODE : 000 DB : SAMPLE
APPHDL : 0-81 APPID: *LOCAL.e105q5a.170917035458
AUTHID : E105Q5A HOSTNAME: db2b
EDUID : 3343 EDUNAME: db2agent (idle) 0
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::FreeResourcesOnDBShutdown, probe:15127
STOP : DATABASE: SAMPLE : DEACTIVATED: NO
2017-09-17-12.03.58.149245+480 E1601224A513 LEVEL: Event
PID : 19726438 TID : 3343 PROC : db2sysc 0
INSTANCE: e105q5a NODE : 000 DB : SAMPLE
APPHDL : 0-109 APPID: *LOCAL.e105q5a.170917040333
AUTHID : E105Q5A HOSTNAME: db2b
EDUID : 3343 EDUNAME: db2agent (idle) 0
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::FreeResourcesOnDBShutdown, probe:15127
STOP : DATABASE: SAMPLE : DEACTIVATED: NO
2017-09-17-12.16.49.507211+480 E1609705A513 LEVEL: Event
PID : 19726438 TID : 3343 PROC : db2sysc 0
INSTANCE: e105q5a NODE : 000 DB : SAMPLE
APPHDL : 0-125 APPID: *LOCAL.e105q5a.170917040401
AUTHID : E105Q5A HOSTNAME: db2b
EDUID : 3343 EDUNAME: db2agent (idle) 0
FUNCTION: DB2 UDB, base sys utilities, sqeLocalDatabase::FreeResourcesOnDBShutdown, probe:15127
STOP : DATABASE: SAMPLE : DEACTIVATED: NO
linux中的grep命令的-p选项有完全不同的含义,而且没有其他选项来实现这个功能。换个思路,能不能把“一段”作为“一行”来处理呢?答案是可以的,这时候就显示awk的强大了,awk有两个关键字,如下:
ORS terminates each record on output, initially = "\n".
RS input record separator, initially = "\n".
RS表示行分割符,默认是换行符'\n'。如果把“一段”当作“一行”,那么“行”与“行”之间的分割符就是两个或以上的换行符,所以,只需要指定RS为"\n\n+"就可以了,awk手册中提供了一个绝佳的范本:
12. Multi-line records
Since mawk interprets RS as a regular expression, multi-line records are easy. Setting RS = "\n\n+",
makes one or more blank lines separate records. If FS = " " (the default), then single newlines, by
the rules for <SPACE> above, become space and single newlines are field separators.
For example, if a file is "a b\nc\n\n", RS = "\n\n+" and FS = " ", then there is one record
"a b\nc" with three fields "a", "b" and "c". Changing FS = "\n", gives two fields "a b" and
"c"; changing FS = "", gives one field identical to the record.
If you want lines with spaces or tabs to be considered blank, set RS = "\n([ \t]*\n)+". For compati-
bility with other awks, setting RS = "" has the same effect as if blank lines are stripped from the
front and back of files and then records are determined as if RS = "\n\n+". Posix requires that "\n"
always separates records when RS = "" regardless of the value of FS. mawk does not support this con-
vention, because defining "\n" as <SPACE> makes it unnecessary.
Most of the time when you change RS for multi-line records, you will also want to change ORS to
"\n\n" so the record spacing is preserved on output.
所以,在linux下面,命令如下:
$ awk 'BEGIN {RS = "\n\n+";ORS = "\n\n"} /DEACTIVATED/ {print $0}' db2diag.log
如果要反选,即不包含关键字的段落,在关键字前加上!
$ awk 'BEGIN {RS = "\n\n+";ORS = "\n\n"} !/DEACTIVATED/ {print $0}' db2diag.log
另外,也可以直接将RS设置为空串,效果是一样的
$ awk 'BEGIN {RS = "";ORS = "\n\n"} /DEACTIVATED/ {print $0}' db2diag.log
也可以按照其他方式分段,只需要指定正确的RS值即可。