shell读书笔记11：初识sed和gawk

云端漫步的程狗子

已于 2022-03-12 20:06:04 修改

阅读量171

点赞数

分类专栏： shell 文章标签： bash linux

于 2022-03-07 22:42:12 首次发布

本文链接：https://blog.csdn.net/a777122/article/details/123341243

版权

shell 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

一、文本处理

1、sed 编辑器

sed命令的格式如下：

sed options script file

sed命令选项

选项	描述
-e script	在处理输入时，将script中指定的命令添加到已有的命令中
-f file	在处理输入时，将file中指定的命令添加到已有的命令中
-n	不产生命令输出，使用print命令来完成输出

script参数指定了应用于流数据上的单个命令。如果需要用多个命令，要么使用-e选项在命令行中指定，要么使用-f选项在单独的文件中指定。

1.1 在命令行定义编辑器命令

$ echo "This is a test" | sed 's/test/big test/'
This is a big test
$

s命令会用斜线间指定的第二个文本字符串来替换第一个文本字符串模式。

$ cat data1.txt
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
$
$ sed 's/dog/cat/' data1.txt
The quick brown fox jumps over the lazy cat.
The quick brown fox jumps over the lazy cat.
The quick brown fox jumps over the lazy cat.
The quick brown fox jumps over the lazy cat.
$

重要的是，要记住，sed编辑器并不会修改文本文件的数据。它只会将修改后的数据发送到STDOUT。如果你查看原来的文本文件，它仍然保留着原始数据。

2、在命令行使用多个编辑器命令
要在sed命令行上执行多个命令时，只要用-e选项就可以了。
命令之间必须用分号隔开，并且在命令末尾和分号之间不能有空格。

$ sed -e 's/brown/green/; s/dog/cat/' data1.txt
The quick green fox jumps over the lazy cat.
The quick green fox jumps over the lazy cat.
The quick green fox jumps over the lazy cat.
The quick green fox jumps over the lazy cat.
$

3、从文件中读取编辑器命令
如果有大量要处理的sed命令，那么将它们放进一个单独的文件中通常会更方便一些。
可以在sed命令中用-f选项来指定文件。
不用在每条命令后面放一个分号，每行放置一条单独的命令。

$ cat script1.sed
s/brown/green/
s/fox/elephant/
s/dog/cat/
$
$ sed -f script1.sed data1.txt
The quick green elephant jumps over the lazy cat.
The quick green elephant jumps over the lazy cat.
The quick green elephant jumps over the lazy cat.
The quick green elephant jumps over the lazy cat.
$

2、gawk 程序

gawk程序是Unix中的原始awk程序的GNU版本，在gawk编程语言中，你可以做下面的事情：
 定义变量来保存数据；
 使用算术和字符串操作符来处理数据；
 使用结构化编程概念（比如if-then语句和循环）来为数据处理增加处理逻辑；
 通过提取数据文件中的数据元素，将其重新排列或格式化，生成格式化报告。
- 2.1、gawk命令格式
  
  gawk程序的基本格式如下：
```
gawk options program file
```

gawk选项

选项	描述
-F fs	指定行中划分数据字段的字段分隔符
-f file	从指定的文件中读取程序
-v var=value	定义gawk程序中的一个变量及其默认值
-mf N	指定要处理的数据文件中的最大字段数
-mr N	指定数据文件中的最大数据行数
-W keyword	指定gawk的兼容模式或警告等级

2.2、从命令行读取程序脚本
必须将脚本命令放到两个花括号（{}）中。
下面的例子在命令行上指定了一个简单的gawk程序脚本：
```
$ gawk '{print "Hello World!"}'
```
2.3、使用数据字段变量
gawk的主要特性之一是其处理文本文件中数据的能力。它会自动给一行中的每个数据元素分配一个变量。默认情况下，gawk会将如下变量分配给它在文本行中发现的数据字段：
 $0代表整个文本行；
 $1代表文本行中的第1个数据字段；
 $2代表文本行中的第2个数据字段；
 $n代表文本行中的第n个数据字段。
在文本行中，每个数据字段都是通过字段分隔符划分的。gawk在读取一行文本时，会用预定义的字段分隔符划分每个数据字段。gawk中默认的字段分隔符是任意的空白字符。
在下面的例子中，gawk程序读取文本文件，只显示第1个数据字段的值。
```
$ cat data2.txt
One line of test text.
Two lines of test text.
Three lines of test text.
$
$ gawk '{print $1}' data2.txt
One
Two
Three
$
```
该程序用$1字段变量来仅显示每行文本的第1个数据字段。
如果你要读取采用了其他字段分隔符的文件，可以用-F选项指定。
```
$ gawk -F: '{print $1}' /etc/passwd
root
bin
daemon
adm
lp
sync
shutdown
halt
mail
[...]
```
这个简短的程序显示了系统中密码文件的第1个数据字段。由于/etc/passwd文件用冒号来分隔数字字段，因而如果要划分开每个数据元素，则必须在gawk选项中将冒号指定为字段分隔符。
2.4、在程序脚本中使用多个命令
要在命令行上的程序脚本中使用多条命令，只要在命令之间放个分号即可。
```
$ echo "My name is Rich" | gawk '{$4="Christine"; print $0}'
My name is Christine
$
```
也可以用次提示符一次一行地输入程序脚本命令。
```
$ gawk '{
> $4="Christine"
> print $0}'
My name is Rich
My name is Christine
$
```

2.5、从文件中读取程序
跟sed编辑器一样，gawk编辑器允许将程序存储到文件中，然后再在命令行中引用。

$ cat script2.gawk
{print $1 "'s home directory is " $6}
$
$ gawk -F: -f script2.gawk /etc/passwd
root's home directory is /root
bin's home directory is /bin
daemon's home directory is /sbin
adm's home directory is /var/adm
lp's home directory is /var/spool/lpd
[...]
Christine's home directory is /home/Christine
Samantha's home directory is /home/Samantha
Timothy's home directory is /home/Timothy
$

二、sed 编辑器基础

1、更多的替换选项
1.1、替换标记
默认情况下它只替换每行中出现的第一处
要让替换命令能够替换一行中不同地方出现的文本必须使用替换标记

$ cat data4.txt
This is a test of the test script.
This is the second test of the test script.
$
$ sed 's/test/trial/' data4.txt
This is a trial of the test script.
This is the second trial of the test script.
$

替换标记会在替换命令字符串之后设置。

s/pattern/replacement/flags

有4种可用的替换标记：
 数字，表明新文本将替换第几处模式匹配的地方；

$ sed 's/test/trial/2' data4.txt
This is a test of the trial script.
This is the second test of the trial script.
$

 g，表明新文本将会替换所有匹配的文本；

$ sed 's/test/trial/g' data4.txt
This is a trial of the trial script.
This is the second trial of the trial script.
$

 p，表明原先行的内容要打印出来，通常会和sed的-n选项一起使用，效果就是只输出被替换命令修改过的行；

$ cat data5.txt
This is a test line.
This is a different line.
$
$ sed -n 's/test/trial/p' data5.txt
This is a trial line.
$

 w file，将替换的结果写到文件中。

$ sed 's/test/trial/w test.txt' data5.txt
This is a trial line.
This is a different line.
$
$ cat test.txt
This is a trial line.
$

1.2、替换字符
有时你会在文本字符串中遇到一些不太方便在替换模式中使用的字符。Linux中一个常见的例子就是正斜线（/）。

$ sed 's/\/bin\/bash/\/bin\/csh/' /etc/passwd

2、使用地址
如果只想将命令作用于特定行或某些行，则必须用行寻址（line addressing）
在sed编辑器中有两种形式的行寻址：
 以数字形式表示行区间
 用文本模式来过滤出行
两种形式都使用相同的格式来指定地址：

[address]command

也可以将特定地址的多个命令分组：

address {
command1
command2
command3
}

2.1、数字方式的行寻址

$ sed '2s/dog/cat/' data1.txt
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy cat
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
$

$ sed '2,3s/dog/cat/' data1.txt
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy cat
The quick brown fox jumps over the lazy cat
The quick brown fox jumps over the lazy dog
$

$ sed '2,$s/dog/cat/' data1.txt
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy cat
The quick brown fox jumps over the lazy cat
The quick brown fox jumps over the lazy cat
$

3、删除行
删除命令d名副其实，它会删除匹配指定寻址模式的所有行，如果你忘记加入寻址模式的话，流中的所有文本行都会被删除。

$ cat data1.txt
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog
$
$ sed 'd' data1.txt
$

当和指定地址一起使用时，删除命令显然能发挥出最大的功用。可以从数据流中删除特定的文本行，通过行号指定：

$ cat data6.txt
This is line number 1.
This is line number 2.
This is line number 3.
This is line number 4.
$
$ sed '3d' data6.txt
This is line number 1.
This is line number 2.
This is line number 4.
$

或者通过特定行区间指定：

$ sed '2,3d' data6.txt
This is line number 1.
This is line number 4.
$

或者通过特殊的文件结尾字符：

$ sed '3,$d' data6.txt
This is line number 1.
This is line number 2.
$

sed编辑器的模式匹配特性也适用于删除命令：

$ sed '/number 1/d' data6.txt
This is line number 2.
This is line number 3.
This is line number 4.
$

sed编辑器不会修改原始文件。你删除的行只是从sed编辑器的输出中消失了。原始文件仍然包含那些“删掉的”行。
一旦加了-i就会更新文件，不可逆

4、插入和附加文本
 插入（insert）命令（i）会在指定行前增加一个新行；
 附加（append）命令（a）会在指定行后增加一个新行。
格式如下：

sed '[address]command\
new line'

当使用插入命令时，文本会出现在数据流文本的前面。

$ echo "Test Line 2" | sed 'i\Test Line 1'
Test Line 1
Test Line 2
$

当使用附加命令时，文本会出现在数据流文本的后面。

$ echo "Test Line 2" | sed 'a\Test Line 1'
Test Line 2
Test Line 1
$

下面的例子是将一个新行插入到数据流第三行前。

$ sed '3i\
> This is an inserted line.' data6.txt
This is line number 1.
This is line number 2.
This is an inserted line.
This is line number 3.
This is line number 4.
$

下面的例子是将一个新行附加到数据流中第三行后。

$ sed '3a\
> This is an appended line.' data6.txt
This is line number 1.
This is line number 2.
This is line number 3.
This is an appended line.
This is line number 4.
$

5、修改行

$ sed '3c\
> This is a changed line of text.' data6.txt
This is line number 1.
This is line number 2.
This is a changed line of text.
This is line number 4.
$

6、转换命令
转换（transform）命令（y）是唯一可以处理单个字符的sed编辑器命令。
转换命令会对inchars和outchars值进行一对一的映射。inchars中的第一个字符会被转换为outchars中的第一个字符，第二个字符会被转换成outchars中的第二个字符。这个映射过程会一直持续到处理完指定字符。如果inchars和outchars的长度不同，则sed编辑器会产生一条错误消息。
转换命令格式如下。
```
[address]y/inchars/outchars/
```
```
$ sed 'y/123/789/' data8.txt
This is line number 7.
This is line number 8.
This is line number 9.
This is line number 4.
This is line number 7 again.
This is yet another line.
This is the last line in the file.
$
```