Linux中gsub函数,一次gawk gsub函数的实际应用

在做一个数据清洗需求的时候,需要查询两张表里几个字段相同的重复数据。大概思路就是用exists语句,类似:

select *

from a

where exists (select 1

from b

where a.col1 = b.col1

and a.col2 = a.col2);

但是这里麻烦的地方在于要匹配的列太多了:

a.INFOCODE,a.SOURCENAME,a.SOURCETYPE,a.PUBLISHTYPE,a.NOTICEDATE,a.ENDDATE,a.NOTICETITLE,a.LANGUAGE,a.IMPORTLEVEL,a.SOURCEURL,a.ATTACHTYPE,a.ATTACHNAME,a.ATTACHSIZE,a.FORM,a.ACCESSORYNUM,a.NOTICESTATE,a.PUBLISHDATE,a.FILENUMBER

用Linux文本处理的方法解决这个问题:

先将这段放到一个文本里:

root@bd-dev-mingshuo-183:/tmp#more 1

a.INFOCODE,a.SOURCENAME,a.SOURCETYPE,a.PUBLISHTYPE,a.NOTICEDATE,a.ENDDATE,a.NOTICETITLE,a.LANGUAGE,a.IMPORTLEVEL,a.SOURCEURL,a.ATTACHTYPE,a.ATTACHNAME,a.ATTACHSIZE,a.FORM,a.ACCESSORYNUM,

a.NOTICESTATE,a.PUBLISHDATE,a.FILENUMBER

这里介绍一下gawk里的gsub函数

gsub匹配所有的符合正则表达式的内容,然后替换,相当于 sed 's//g'

语法如下:

gsub(regular expression, subsitution string, target string);

处理的目标范围是第三个字段,匹配条件是第一个参数,匹配后,替换为第二个参数。

将一行文本处理为多行文本:

root@bd-dev-mingshuo-183:/tmp#more 1|gawk 'gsub(/,/,"\n",$0)'

a.INFOCODE

a.SOURCENAME

a.SOURCETYPE

a.PUBLISHTYPE

a.NOTICEDATE

a.ENDDATE

a.NOTICETITLE

a.LANGUAGE

a.IMPORTLEVEL

a.SOURCEURL

a.ATTACHTYPE

a.ATTACHNAME

a.ATTACHSIZE

a.FORM

a.ACCESSORYNUM

a.NOTICESTATE

a.PUBLISHDATE

a.FILENUMBER

复制每一列:

root@bd-dev-mingshuo-183:/tmp#more 1|gawk 'gsub(/,/,"\n",$0)'|gawk -F'\n' '{print "on",$0,"=",$0,"and"}'

on a.INFOCODE = a.INFOCODE and

on a.SOURCENAME = a.SOURCENAME and

on a.SOURCETYPE = a.SOURCETYPE and

on a.PUBLISHTYPE = a.PUBLISHTYPE and

on a.NOTICEDATE = a.NOTICEDATE and

on a.ENDDATE = a.ENDDATE and

on a.NOTICETITLE = a.NOTICETITLE and

on a.LANGUAGE = a.LANGUAGE and

on a.IMPORTLEVEL = a.IMPORTLEVEL and

on a.SOURCEURL = a.SOURCEURL and

on a.ATTACHTYPE = a.ATTACHTYPE and

on a.ATTACHNAME = a.ATTACHNAME and

on a.ATTACHSIZE = a.ATTACHSIZE and

on a.FORM = a.FORM and

on a.ACCESSORYNUM = a.ACCESSORYNUM and

on a.NOTICESTATE = a.NOTICESTATE and

on a.PUBLISHDATE = a.PUBLISHDATE and

on a.FILENUMBER = a.FILENUMBER and

替换

root@bd-dev-mingshuo-183:/tmp#more 1|gawk 'gsub(/,/,"\n",$0)'|gawk -F'\n' '{print $0,"=",$0,"and"}'|sed 's/= a/= b/g'

a.INFOCODE = b.INFOCODE and

a.SOURCENAME = b.SOURCENAME and

a.SOURCETYPE = b.SOURCETYPE and

a.PUBLISHTYPE = b.PUBLISHTYPE and

a.NOTICEDATE = b.NOTICEDATE and

a.ENDDATE = b.ENDDATE and

a.NOTICETITLE = b.NOTICETITLE and

a.LANGUAGE = b.LANGUAGE and

a.IMPORTLEVEL = b.IMPORTLEVEL and

a.SOURCEURL = b.SOURCEURL and

a.ATTACHTYPE = b.ATTACHTYPE and

a.ATTACHNAME = b.ATTACHNAME and

a.ATTACHSIZE = b.ATTACHSIZE and

a.FORM = b.FORM and

a.ACCESSORYNUM = b.ACCESSORYNUM and

a.NOTICESTATE = b.NOTICESTATE and

a.PUBLISHDATE = b.PUBLISHDATE and

a.FILENUMBER = b.FILENUMBER and

这样就得到了目标文本。黏贴到sql中,执行就好了。

处理过程比较简单,重点在于gawk里的gsub函数的应用,以及处理思路。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值