前言
前天面试七牛的时候,问到了一些文本处理的东西,其中有一个问题是让我把nginx日志文件中的某一个域之后所有内容输出。当时第一个想到用cut来处理,cut的-f参数可以用来指定域,而且可以指定到末尾,速度很快速。当时选的awk来搞,调试了半天没弄出来,很是尴尬。菜鸟回来后查阅了下资料,重新整理下了以下几种方法来实现输出所有域的问题。
cut命令实现
cut
命令可以从一个文本文件或者文本流中提取文本列,是用来在文件中剪切数据的。 cut
是以每一行为一个处理对象的,这种机制和 sed/awk
是一样的。
命令格式:
cut -d '分隔符' -f fields
参数:
-d ∶后面接分隔符,用来切割文本。与-f一起使用;
-f ∶依据 -d 的分隔符一段文本分割成为数个域,用-f取出文本的域内容;
用例:
现在有一个邮件日志,我需要去分割从localhost开始的所有内容。下面awk也用这个文件进行测试
[root@hadoop-master log]# cat maillog-20150726
……
Jul 22 21:36:24 localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
Jul 22 21:36:24 localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
Jul 22 21:36:24 localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Jul 22 21:36:24 localhost postfix/qmgr[4846]: 25EA1340106: removed
Jul 23 01:11:57 localhost postfix/postfix-script[30240]: stopping the Postfix mail system
Jul 23 01:11:58 localhost postfix/master[4839]: terminating on signal 15
Jul 23 20:45:53 hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
Jul 23 20:47:10 hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
Jul 23 20:47:10 hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
Jul 23 21:26:03 hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
Jul 23 21:26:03 hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
Jul 23 21:26:03 hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……
用cut来实现
[root@hadoop-master log]# cat maillog-20150726 | cut -d ' ' -f 4-
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……
awk命令实现
以下方法学习自 junlee 的博客。
用printf来实现
输出第四个域以后的所有内容
[root@hadoop-master log]# awk '{for(i=4;i<=NF;i++) printf("%s ", $i)} {print ""}' maillog-20150726
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……
printf
控制输出时添加空格,由于 printf
不能自动换行,需要用 print
来控制换行。上述的命令也可以写成
[root@hadoop-master log]# awk '{for(i=4;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726
在 NF
不够 4
个的记录(行),将会打印出一个 空行 ,为了解决这个问题,可以添加判断 NF>4
[root@hadoop-master log]# awk 'NF>4 {for(i=4;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726
本文中域较多,不出现此问题,在测试中可选 i=15
开始打印会发现打印了很多空行,在头部添加判断 NF>15
,会消除空行。
[root@hadoop-master log]# awk 'NF>15 {for(i=15;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726
to
to
字符串函数index和substr
[root@hadoop-master log]# awk '{a=index($0,$4)} {print substr($0,a)}' maillog-20150726
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……
通过分析出 $4
这个字段的字串在 $0
中第一次出现的位置,记数为 a
,之后再截取 a
之后的字串,并打印。上述的命令也可以写成
[root@hadoop-master log]# awk '{a=index($0,$4); print substr($0,a)}' maillog-20150726
在 NF
不够 4
个的记录(行),将会 打印出整行记录 ;为解决这个问题,可以像上文一样加上 NF>4
[root@hadoop-master log]# awk 'NF>4 {a=index($0,$4); print substr($0,a)}' maillog-20150726
如果是测试,可适当将 i
的值放大去看下输出。
这个方法也会出现一些问题,如果 $4
域的值在之前出现了,返回的 a
值可能不是 $4
的位置,输出是会出错。
例如:
X1 X2 X3 X2 X5 X4
此时 $4
的值为 X2
,查找 X2
的位置为 2
,输出就会变成
X2 X3 X2 X5 X4
解决方法:可以把 $4
替换成某一字符后再进行处理
awk 'NF>4 {$4="z"$4;a=index($0,$4);print substr($0,a+1)}' maillog-20150726
域值替换
将某些域替换成空值,让后输出,不过会多些空格出来
[root@hadoop-master log]# awk '{for(i=1;i<4;i++) $i=""; print $0}' maillog-20150726
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……
总结
用 cut
命令比较简单, awk
命令稍微复杂些, awk
文本处理更加强大灵活,具体使用什么命令看个人需求吧。