shell编程之sed编辑器＆gawk程序

最新推荐文章于 2021-09-28 10:11:43 发布

宋奇山

最新推荐文章于 2021-09-28 10:11:43 发布

阅读量1.1k

点赞数

分类专栏： shell编程

本文链接：https://blog.csdn.net/feixiang_song/article/details/17067107

版权

shell编程专栏收录该内容

9 篇文章 0 订阅

订阅专栏

原创作品，允许转载，转载时请务必以超链接形式标明文章原始出处、作者信息和本声明。否则将追究法律责任。 http://twentyfour.blog.51cto.com/945260/560372

sed和gawk介绍
知识体系：
#使用文本文件
#探索sed
#探索gawk
shell脚本可以把处理文本中包含的所有类型的数据这样的普通任务自动化。然而，只使用shell脚本命令处理文本文件的内容却有些麻烦。如果要在shell脚本中进行任何类型的数据处理，就要熟悉linux中现有的sed和gawk工具了。因为这两个命令行编辑器能够方便地格式化、插入、修改和删除文本元素。

1、文本处理
1.1、sed编辑器
sed编辑器称为流编辑器(stream editor)，能根据在vim、vi等编辑器处理数据之前事先提供的规则集编辑数据流。它每次从输入读取一行数据，将该数据与所提供的编辑器命令进行匹配，根据命令修改数据流中的数据，然后将新数据输出到STDOUT。
使用sed命令格式：
sed option xxx(script file)
options参数允许自定义sed命令的行为，如下表：
***********************************************************
选项    描述
-e script 将脚本中指定的命令添加到处理输入时执行的命令中
-f file  将文件中指定的命令添加到处理输入时执行的命令中
-n  不需要为每个命令产生输出，但要等待打印命令
***********************************************************
1.1.1、在命令行中定义编辑器命令
默认情况下，sed编辑器将指定的命令应用于STDIN输入流，这就允许将数据直接管道传送给sed编辑器处理，如下例子：
[root@wzp ~]# echo "welcome to tencent" | sed 's/tencent/51cto/'
welcome to 51cto
该例在sed编辑器中使用了s命令，表示说第二个文本字符替换第一个两个斜杠之间的文本字符，如上通过51cto替换了tencent。
当然，如上只是编辑了一行数据，在一个数据文件中也可以通过sed处理：
[root@wzp ~]# cat testfile
welcome to http://blog.51cto.com
welcome to http://blog.51cto.com
welcome to http://blog.51cto.com
[root@wzp ~]# sed 's/blog/twentyfour.blog/' testfile
welcome to http://twentyfour.blog.51cto.com
welcome to http://twentyfour.blog.51cto.com
welcome to http://twentyfour.blog.51cto.com
或者通过cat借助管道查看：
[root@wzp ~]# cat testfile | sed 's/blog/twentyfour.blog/'
welcome to http://twentyfour.blog.51cto.com
welcome to http://twentyfour.blog.51cto.com
welcome to http://twentyfour.blog.51cto.com
sed命令执行与返回数据几乎同时进行，但是sed编辑器并不修改文本文件中的数据，它只是把文本内容修改后发送到STDOUT，原文本文件没有改动过。
[root@wzp ~]# cat testfile
welcome to http://blog.51cto.com
welcome to http://blog.51cto.com
welcome to http://blog.51cto.com
1.1.2、在命令行中使用多个编辑器命令
要从sed命令行执行多个命令，只需要使用-e选项：
[root@wzp ~]# sed -e 's/blog/www/;s/51cto/baidu/' testfile
welcome to http://www.baidu.com
welcome to http://www.baidu.com
welcome to http://www.baidu.com
两个命令同时应用于文件中的每一行数据，命令必须使用分号隔开。
当然，也可以使用此提示符，而不用分号分隔命令：
[root@wzp ~]# sed -e '
> s/blog/www/
> s/51cto/wzp/' testfile
welcome to http://www.wzp.com
welcome to http://www.wzp.com
welcome to http://www.wzp.com
这里需要注意的是：要在后单引号出现的这一行完成整个命令，因为bash检测到后引号后就处理命令。
1.1.3、从文件读取编辑器命令
如果说，有太多的sed命令要处理，那么可以将他们保存在一个独立的文件中，然后在使用sed命令时候使用-f选项指定文件，如下例：
[root@wzp ~]# cat script24
s/blog/www/
s/51cto/163/
这里不需要使用分号和单引号了，sed编辑器将每一行视为单独一个命令。
[root@wzp ~]# sed -f script24 testfile
welcome to http://www.163.com
welcome to http://www.163.com
welcome to http://www.163.com
通过调用script24这个预先指定好的文件就可以实现数据的处理。

1.2、gawk程序
尽管sed编辑器是动态修改文本文件的便利工具，但它也有自己的局限性，这个时候就可以借助更高级、能够提供类似于编程环境的工具，它允许修改和组织文件中的数据。
gawk程序是unix中原awk程序的GNU版本，它提供一种编程语言而不仅仅是编辑器命令。
1.2.1、gawk命令格式
gawk options xxx(program file)
如下是gawk程序的可用选项：
***********************************************************
选项   描述
-F fs  指定描绘一行中数据字段的文件分隔符
-f file  指定读取程序的文件名
-v var=value 定义gawk程序中使用的变量和默认值
-mf N  指定数据文件中要处理的字段的最大数目
-mr N  指定数据文件中的最大记录大小
-W keyword 指定gawk的兼容模式或警告级别
***********************************************************
1.2.2、自命令行读取程序脚本
gawk程序脚本由左大括号和右大括号定义，脚本命令必须放置在两个大括号之间，如下载命令行上指定的一个简单gawk程序脚本：
[root@wzp ~]# gawk '{print "hello world!"}'
改程序定义了一个print命令才执行打印功能，将文本hello world!输出到STDOUT。然而，执行本命令不会发生任何显示信息。因为命令行中没有定义文件名，所以gawk程序要从STDIN获取数据，比如你随意输入任何内容，按回车键既可以实现print功能：
[root@wzp ~]# gawk '{print "hello world!"}'
try to test and enter
hello world!
test again
hello world!
同sed编辑器完全一样，gawk程序对数据流中可用的每一行文本执行程序脚本。由于本程序脚本设定为显示固定的文本字符串，所以无论数据流中输入什么，都得到相同的输出。如果要结束gawk程序，必须发送信号说明数据流已经结束。bash shell提供了生成end-of-file(EOF)字符的组合键来发送结束信号。在bash中，Ctrl+D组合键生成EOF字符。
1.2.3、使用数据字段变量
gawk的主要功能之一就是处理文本文件中数据的能力，它通过自动将变量分配给每行中的每个数据元素实现这一功能。默认情况下，gawk将下面的变量分配给在文本行中检测到的每个数据字段：
* $0表示整行文本；
* $1表示文本行中的第一个数据字段；
* $2表示文本行中的第二个数据字段；
* $n表示文本行中的第n个数据字段；
各个数据段是根据文本行中的字段分隔符(默认是空格)确定的，如下通过gawk程序读取文本文件的数据段：
[root@wzp ~]# cat test1
one line of test file
two line of test file
three line of test file //(故意在three后面留两个空格)
[root@wzp ~]# gawk '{print $1}' test1
one
two
three
通过$1字段变量即可显示每行文本的第一个字段；
[root@wzp ~]# gawk '{print $2}' test1
line
line
line
你会发现，上面两个空格都被视为一个分隔符了，即把第二个字段显示出来。
当我们要修改默认的分隔符的时候可以使用-F选项，比如把分隔符改成冒号来读取/etc/passwd文件：
[root@wzp ~]# tail -3 /etc/passwd
webalizer:x:67:67:Webalizer:/var/www/usage:/sbin/nologin
sabayon:x:86:86:Sabayon user:/home/sabayon:/sbin/nologin
mysql:x:27:27:MySQL Server:/home/mysql:/sbin/nologin
我们取最后三行来看，我们知道/etc/passwd就是以冒号来分隔每个字段的，所以我们要通过gawk查看指定某些字段时候就必须通过-F选项修改默认分隔符为冒号：
[root@wzp ~]# tail -3 /etc/passwd | gawk -F: '{print $1,$6,$7}'
webalizer /var/www/usage /sbin/nologin
sabayon /home/sabayon /sbin/nologin
mysql /home/mysql /sbin/nologin
这样我们就把第一、第六、第七个字段显示出来了。
1.2.4、在程序脚本中使用多个命令
有时候我们需要在命令行指定的脚本中使用多个命令，只需要在各命令之间加一个分号：
[root@wzp ~]# echo "welcome to 51cto" | gawk '{$3="bbs.51cto"; print $0}'
welcome to bbs.51cto
第一个命令为$3字段指定为bbs.51cto，第二个命令为打印整个数据字段。
当然，这里也可以通过此提示符每次输入一行程序脚本命令：
[root@wzp ~]# gawk '{
> $3="bbs.51cto"
> print $0}'
welcome to 51cto
welcome to bbs.51cto
welcome to netease
welcome to bbs.51cto
当把gawk '{ $3="bbs.51cto"; print $0}'写好后，输入的任何内容，第二个字段都被修改成bbs.51cto
如果要结束程序，可以使用ctrl+d组合键发送结束信号。
1.2.5、从文件读取程序
同sed编辑器一样，gawk编辑器允许将程序保存在文件中并在命令行中引用：
[root@wzp ~]# cat script24
{ print $5 "'s userrid is " $1}
[root@wzp ~]# tail -3 /etc/passwd | gawk -F: -f script24
Webalizer's userrid is webalizer
Sabayon user's userrid is sabayon
MySQL Server's userrid is mysql
当然，被调用的文件可以指定多个命令，则使得每行放置一个命令而不需要使用分号：
[root@wzp ~]# cat script25
{
text="'s userid is"
print $5 text $1
}
[root@wzp ~]# tail -3 /etc/passwd | gawk -F: -f script25
Webalizer's userid iswebalizer
Sabayon user's userid issabayon
MySQL Server's userid ismysql
我们可以看到得到的结果是一样的，不过有点需要注意的是gawk程序不像shell脚本不需要使用美元符号引用变量值，所以直接用print命令显示结果。
1.2.6、在处理数据之前运行脚本
gawk程序还允许指定运行脚本的时间。默认情况下，gawk从输入读取一行文本，然后执行程序脚本处理文本行中的数据。但有时需要在处理数据之前运行脚本，就必须使用BEGIN关键字了，如下：
[root@wzp ~]# gawk 'BEGIN {print "hello world!"}'
hello world!
print命令将在读取数据之前显示文本，不过它在显示该文本之后迅速退出而不等待数据输入。
1.2.7、在处理数据之后运行脚本
与BEGIN关键字类似，END关键字允许指定在读取数据之后gawk执行的程序脚本：
[root@wzp ~]# gawk 'BEGIN {print "hello world!" }{print $0} END {print "goodbye"}'
hello world!
this is a test
this is a test
i would enter ctrl+d
i would enter ctrl+d
goodbye
执行程序开始就print出hello world!字符，接着输入的任何数据都通过$0回显一次，当我ctrl+d结束程序的时候，程序就print出goodbye字符。所以，通过这种方法非常适合创建数据报告。

2、sed编辑器基础知识
成功的使用sed编辑器关键在于了解大量的命令和格式，这都有助于自定义文本编辑。

2.1、更多替换选项
前面已经看到过使用s命令实现新文本替换某一行中的文本，这里还有其他几个选项实现替换命令
2.1.1、替换标记
[root@wzp ~]# cat test2
this is a test of test file
This is the second test of the test file
[root@wzp ~]# sed 's/test/trial/' test2
this is a trial of test file
This is the second trial of the test file
从上面可以看出，替换命令在默认情况下仅仅替换各行中首次出现的文本，如果要实现继续替换后续文本，就必须使用替换标记，其格式为：
s/xxx/xxx/flags
这里头的flags有如下4种：
* 数字：表示新文本替换的模式
* g：表示用新文本替换现有的所有实例
* p：表示打印原始行的内容
* w file：将替换行的结果写入文件
所以向上面指定替换第二行的匹配模式则为：
[root@wzp ~]# sed 's/test/trial/2' test2
this is a test of trial file
This is the second test of the trial file
如果要替换所有匹配的模式：
[root@wzp ~]# sed 's/test/trial/g' test2
this is a trial of trial file
This is the second trial of the trial file
对于p替换标记会打印包含替换命令中匹配模式的那一行
[root@wzp ~]# cat test3
this is a test line
This is a different line
[root@wzp ~]# sed 's/test/trial/p' test3
this is a trial line
this is a trial line
This is a different line
p标记会输出所有已修改的行，但同时也会使得不匹配的行也print出来，所以可以通过-n选项禁止不匹配的行输出：
[root@wzp ~]# sed -n 's/test/trial/p' test3
this is a trial line
如上的这一行就是替换后生成的输出文本，当然我们还可以使用w选项保存到指定的文本中去：
[root@wzp ~]# sed -n 's/test/trial/w writefile' test3
表示执行替换匹配行后把匹配替换的那一行保存到writefile中，该文件是在保存的过程中自动创建：
[root@wzp ~]# cat writefile
this is a trial line
如果上面不使用-n选项，那么sed编辑器的正常输出将出现在STDOUT中，而匹配模式的那些行就保存到文件。
2.1.2、替换字符
有时候会在文本字符中遇到不容易替换的字符，如正斜杠/
这个时候就需要使用反斜杠使其转义，所以这挺复杂的，比如修改用户的shell类型：
[root@wzp ~]# cat /etc/passwd | sed 's/\/bin\/bash/\/bin\/csh/' | more
root:x:0:0:root:/root:/bin/csh
......
因为正斜杠用作字符串定界符，所以如果正斜杠出现在模式文本中，需要使用反斜杠进行转义。
但由于这种方式很很容易导致混淆或者错误，sed编辑器允许为替换命令中的字符而选择一个不同的字符：
[root@wzp ~]# cat /etc/passwd | sed 's!/bin/bash!/bin/csh/!' | more
root:x:0:0:root:/root:/bin/csh/
如上使用感叹号用作字符串定界符，从而使得阅读和理解路径名更为容易。

2.2、使用地址
默认情况下，在sed编辑器中使用的命令应用于所有文本数据行。如果仅想将某个命令应用于某一特定的文本数据行或者一组文本数据行，则必须使用行寻址。
在sed编辑器中，行寻址有两种形式：
* 行的数值范围
* 筛选行的文本模式
2.2.1、数字式行寻址
使用数字式行寻址，可以使用行在文本流中的位置引用行。sed编辑器指定文本流中的首行行号为1，每一个新行延续加1。当然，命令中可以是单个行号，也可以是由起始行号、逗号和结束行号指定范围，如下：
[root@wzp ~]# cat test4
my profession is something about linux
my profession is something about linux
my profession is something about linux
[root@wzp ~]# sed '2s/linux/unix/' test4
my profession is something about linux
my profession is something about unix
my profession is something about linux
如上将第二行中出现linux的匹配行进行替换为unix
[root@wzp ~]# sed '2,3s/linux/unix/' test4
my profession is something about linux
my profession is something about unix
my profession is something about unix
把第二行到第三行中匹配的文本进行替换
[root@wzp ~]# sed '2,$s/profession/hobby/' test4
my profession is something about linux
my hobby is something about linux
my hobby is something about linux
把第二行到最后一行中匹配的文本进行替换，所以这里可以借用美元符号。
2.2.2、使用文本模式筛选器
限制某个命令应用于哪些行的另一种方法如下：
[root@wzp ~]# tail -1 /etc/passwd | sed '/mysql/s!/sbin/nologin!/bin/login!'
mysql:x:27:27:MySQL Server:/home/mysql:/bin/login
对于出现mysql的匹配行，把/sbin/nologin修改成/bin/login
2.2.3、组合命令
如果需要要单独一行上执行多个命令，可以使用大括号将命令组合一起：
[root@wzp ~]# sed '2{
> s/profession/hobby/
> s/linux/unix/
> }' test4
my profession is something about linux
my hobby is something about unix
my profession is something about linux

2.3、删除行
要知道，替换命令不是sed编辑器唯一命令，如果要删除文本流中特定的文本行，可以使用d选项，只要包含匹配模式，所有文本都将会从文本流中删除，如下：
[root@wzp ~]# cat test5
this is number 1
this is number 2
this is number 3
this is number 4
[root@wzp ~]# sed 'd' test5
所以我们使用删除命令时可以按照指定的范围执行：
[root@wzp ~]# sed '2d' test5
this is number 1
this is number 3
this is number 4
这样就实现删除了第二行的文本，同样可以借用美元符号删除：
[root@wzp ~]# sed '3,$d' test5
this is number 1
this is number 2
表示从第三行到最后一行都删除。
当然，sed编辑器的匹配模式功能也适用于删除命令：
[root@wzp ~]# sed '/number 3/d' test5
this is number 1
this is number 2
this is number 4
表示含有number 3的文本行给删除掉。
注意：如上所有的删除操作不会处理原始文件，所删除的仅仅是输出到STDOUT的内容：
[root@wzp ~]# cat test5
this is number 1
this is number 2
this is number 3
this is number 4
我们可以看到test5被处理后的完整性！

2.4、插入和附加文本
sed编辑器允许向数据流插入和附加文本，两个操作的差别如下：
* 插入命令(i)在指定行之前添加新的一行
* 附加命令(a)在指定行之后添加新的一行
下面看一个例子就很好理解了：
[root@wzp ~]# cat test5
this is number 1
this is number 2
this is number 3
this is number 4
[root@wzp ~]# sed '3i\
> this is an insert number' test5
this is number 1
this is number 2
this is an insert number
this is number 3
this is number 4
在指定的第三行之前添加一行
[root@wzp ~]# sed '3a\
> this is an insert number' test5
this is number 1
this is number 2
this is number 3
this is an insert number
this is number 4
在指定的第三行之后添加一行(把a理解为after的首字母即可)
如果要在最前面添加新的一行，那么通过1i即可实现，如果是附加多行呢？那么在每一个新行之后使用一个反斜杠即可，直到到达要添加文本的最后一行：
[root@wzp ~]# sed '1i\
> this is first insert number\
> this is second insert number ' test5
this is first insert number
this is second insert number
this is number 1
this is number 2
this is number 3
this is number 4
这样子就把两行都添加到数据流中了。

2.5、更该行
如上除了可以添加行之外还可以更该行，直接通过c命令即可实现。如下：
[root@wzp ~]# sed '3c\
> this is the changed number' test5
this is number 1
this is number 2
this is the changed number
this is number 4
如上通过指定行号，把第三行给替换了。实际上也可以通过指定文本模式替换：
[root@wzp ~]# sed '/ber 3/c\
> this is the changed number' test5
this is number 1
this is number 2
this is the changed number
this is number 4
由于ber 3出现在第三行的位置，所以该行被替换了。
在更改命令中，可以使用地址范围，不过匹配的多行会被一行或多行替换：
[root@wzp ~]# sed '2,3c\
> this is the changed number' test5
this is number 1
this is the changed number
this is number 4
你会发现如上只出现一行this is the changed number，而不是两行。这点注意下即可！

2.6、变换命令
变换命令(y)是唯一对单个字符进行操作的sed编辑器命令，其格式为：
y/inchars/outchars
变换命令中将inchars和outchars的值进行一一对应，如果inchars和outchars的值长度不同将会报错，如：
[root@wzp ~]# sed 'y/12/56/' test5
this is number 5
this is number 6
this is number 3
this is number 4
如上可以看到1被换成5、2被换成6，这样就实现了一一对应的变换
[root@wzp ~]# sed 'y/123/56/' test5
sed: -e expression #1, char 9: strings for `y' command are different lengths
这个是因为inchars和outchars的值长度不同导致的报错！

2.7、打印命令
先前使用p标记显示sed编辑器更改的行，还有如下命令用于打印来自数据流的信息：
* 打印文本行的p命令
* 打印文本行的=命令
* 打印文本行的l命令
2.7.1、打印行
p命令常用于打印包含与文本模式匹配的文本行
[root@wzp ~]# sed -n '/number 1/p' test5
this is number 1
使用-n选项可以禁止所有其他行的输出，而仅打印包含匹配文本模式的行，这里打印包含有number 1的行。
[root@wzp ~]# sed -n '2,3p' test5
this is number 2
this is number 3
把第二行、第三行打印出来。当然也可以使用此提示符，并且在打印之前实现替换：
[root@wzp ~]# sed -n '/3/{
> p
> s/this/that/p
> }' test5
this is number 3
that is number 3
如上先实现把第三行的p效果，再者使用s替换命令把this改成that后p打印出来。
2.7.2、打印行号
使用=等号打印可以实现显示数据流当前行的行号。每次换行符出现在数据流中，sed编辑器就认为它结束了一行的文本。如下：
[root@wzp ~]# sed '=' test5
1
this is number 1
2
this is number 2
3
this is number 3
4
this is number 4
所以可以实现打印文本之前打印行号：
[root@wzp ~]# sed -n '/ber 2/{=;p}' test5
2
this is number 2
首先提取文本中匹配ber 2文本模式的行，然后通过等号打印匹配行的行号在打印出来。

2.8、将文件用于sed
sed编辑器命令允许处理文件而不必去替换文件，具体如下列出
2.8.1、写文件
w命令用于将文本行写入指定的文件
[root@wzp ~]# sed '1,2w test4' test5
this is number 1
this is number 2
this is number 3
this is number 4
如上表示把test5文件的第一二行保存到test4中去，如果test4文件不存在则创建之，原先就存在的话则把要保存的内容重定向到该文件中去覆盖原本的内容。当然，你可以使用-n选项使得STDOUT没显示这些行。
[root@wzp ~]# cat test4
this is number 1
this is number 2
可以看到匹配的文本行被保存到指定的文件中去，这里两个文件出于同一目录，否则可以指定绝对路径。
如果说需要基本文本值进行写文件也是同样的道理：
[root@wzp ~]# sed -n '/ber 3/w test4' test5
[root@wzp ~]# cat test4
this is number 3
把文本中含有ber 3的匹配行保存到test4中去，并且使得STDOUT不显示。
2.8.2、从文件读取数据
通过r命令可以实现sed编辑器在指定的匹配模式位置插入文件中的文本
[root@wzp ~]# cat insertfile
this is an added line
this is another added line
如上文件是准备被读取的文本文件
[root@wzp ~]# sed '2r insertfile' test5
this is number 1
this is number 2
this is an added line
this is another added line
this is number 3
this is number 4
指定从test5文件的第二行后面插入从文件insertfile中读取到的文本
当然，我们也可以使用文本匹配模式：
[root@wzp ~]# sed '/ber 3/r insertfile' test5
this is number 1
this is number 2
this is number 3
this is an added line
this is another added line
this is number 4
表示说从匹配ber 3文本的那一行后面插入从文件insertfile中读取到的文本
如果要添加在文本的末尾，想当然是使用美元符号了：
[root@wzp ~]# sed '$r insertfile' test5
this is number 1
this is number 2
this is number 3
this is number 4
this is an added line
this is another added line