Linux文本处理之只打印某个域后的内容

前言

前天面试七牛的时候,问到了一些文本处理的东西,其中有一个问题是让我把nginx日志文件中的某一个域之后所有内容输出。当时第一个想到用cut来处理,cut的-f参数可以用来指定域,而且可以指定到末尾,速度很快速。当时选的awk来搞,调试了半天没弄出来,很是尴尬。菜鸟回来后查阅了下资料,重新整理下了以下几种方法来实现输出所有域的问题。

cut命令实现

cut 命令可以从一个文本文件或者文本流中提取文本列,是用来在文件中剪切数据的。 cut 是以每一行为一个处理对象的,这种机制和 sed/awk 是一样的。

命令格式:

cut -d '分隔符' -f fields

参数:

-d ∶后面接分隔符,用来切割文本。与-f一起使用;
-f ∶依据 -d 的分隔符一段文本分割成为数个域,用-f取出文本的域内容;

用例:

现在有一个邮件日志,我需要去分割从localhost开始的所有内容。下面awk也用这个文件进行测试

[root@hadoop-master log]# cat maillog-20150726 
……
Jul 22 21:36:24 localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
Jul 22 21:36:24 localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
Jul 22 21:36:24 localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
Jul 22 21:36:24 localhost postfix/qmgr[4846]: 25EA1340106: removed
Jul 23 01:11:57 localhost postfix/postfix-script[30240]: stopping the Postfix mail system
Jul 23 01:11:58 localhost postfix/master[4839]: terminating on signal 15
Jul 23 20:45:53 hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
Jul 23 20:47:10 hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
Jul 23 20:47:10 hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
Jul 23 21:26:03 hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
Jul 23 21:26:03 hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
Jul 23 21:26:03 hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……

用cut来实现

[root@hadoop-master log]# cat maillog-20150726  | cut -d ' ' -f 4-
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……

awk命令实现

以下方法学习自 junlee 的博客。

用printf来实现

输出第四个域以后的所有内容

[root@hadoop-master log]# awk '{for(i=4;i<=NF;i++) printf("%s ", $i)} {print ""}' maillog-20150726 
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain> 
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active) 
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox) 
localhost postfix/qmgr[4846]: 25EA1340106: removed 
localhost postfix/postfix-script[30240]: stopping the Postfix mail system 
localhost postfix/master[4839]: terminating on signal 15 
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128 
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system 
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix 
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root> 
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain> 
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active) 
……

printf 控制输出时添加空格,由于 printf 不能自动换行,需要用 print 来控制换行。上述的命令也可以写成

[root@hadoop-master log]# awk '{for(i=4;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726

在 NF 不够 4 个的记录(行),将会打印出一个 空行 ,为了解决这个问题,可以添加判断 NF>4

[root@hadoop-master log]# awk 'NF>4 {for(i=4;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726

本文中域较多,不出现此问题,在测试中可选 i=15 开始打印会发现打印了很多空行,在头部添加判断 NF>15 ,会消除空行。

[root@hadoop-master log]# awk 'NF>15 {for(i=15;i<=NF;i++) printf("%s ", $i); print ""}' maillog-20150726 
to 
to
字符串函数index和substr
[root@hadoop-master log]# awk '{a=index($0,$4)} {print substr($0,a)}' maillog-20150726 
……
localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
localhost postfix/qmgr[4846]: 25EA1340106: removed
localhost postfix/postfix-script[30240]: stopping the Postfix mail system
localhost postfix/master[4839]: terminating on signal 15
hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……

通过分析出 $4 这个字段的字串在 $0 中第一次出现的位置,记数为 a ,之后再截取 a 之后的字串,并打印。上述的命令也可以写成

[root@hadoop-master log]# awk '{a=index($0,$4); print substr($0,a)}' maillog-20150726

在 NF 不够 4 个的记录(行),将会 打印出整行记录 ;为解决这个问题,可以像上文一样加上 NF>4

[root@hadoop-master log]# awk 'NF>4 {a=index($0,$4); print substr($0,a)}' maillog-20150726

如果是测试,可适当将 i 的值放大去看下输出。

这个方法也会出现一些问题,如果 $4 域的值在之前出现了,返回的 a 值可能不是 $4 的位置,输出是会出错。

例如:

X1 X2 X3 X2 X5 X4

此时 $4 的值为 X2 ,查找 X2 的位置为 2 ,输出就会变成

X2 X3 X2 X5 X4

解决方法:可以把 $4 替换成某一字符后再进行处理

awk 'NF>4 {$4="z"$4;a=index($0,$4);print substr($0,a+1)}' maillog-20150726
域值替换

将某些域替换成空值,让后输出,不过会多些空格出来

[root@hadoop-master log]# awk '{for(i=1;i<4;i++) $i=""; print $0}' maillog-20150726
……
   localhost postfix/cleanup[28928]: 25EA1340106: message-id=<20150723043624.25EA1340106@localhost.localdomain>
   localhost postfix/qmgr[4846]: 25EA1340106: from=<root@localhost.localdomain>, size=522, nrcpt=1 (queue active)
   localhost postfix/local[28930]: 25EA1340106: to=<root@localhost.localdomain>, orig_to=<root>, relay=local, delay=0.26, delays=0.1/0.15/0/0, dsn=2.0.0, status=sent (delivered to mailbox)
   localhost postfix/qmgr[4846]: 25EA1340106: removed
   localhost postfix/postfix-script[30240]: stopping the Postfix mail system
   localhost postfix/master[4839]: terminating on signal 15
   hadoop-master postfix[2087]: fatal: parameter inet_interfaces: no local interface found for 192.168.186.128
   hadoop-master postfix/postfix-script[2836]: starting the Postfix mail system
   hadoop-master postfix/master[2837]: daemon started -- version 2.6.6, configuration /etc/postfix
   hadoop-master postfix/pickup[2846]: D057D340106: uid=0 from=<root>
   hadoop-master postfix/cleanup[3414]: D057D340106: message-id=<20150724042603.D057D340106@hadoop-master.localdomain>
   hadoop-master postfix/qmgr[2847]: D057D340106: from=<root@hadoop-master.localdomain>, size=530, nrcpt=1 (queue active)
……

总结

用 cut 命令比较简单, awk 命令稍微复杂些, awk 文本处理更加强大灵活,具体使用什么命令看个人需求吧。

参考资料

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值