sed精彩实例解析

最新推荐文章于 2019-08-05 09:18:00 发布

jcwKyl

最新推荐文章于 2019-08-05 09:18:00 发布

阅读量2.4k

点赞数

分类专栏： Linux Tech 文章标签：正则表达式脚本 buffer parameters 文档 character

本文链接：https://blog.csdn.net/jcwKyl/article/details/4446447

版权

Linux Tech 专栏收录该内容

62 篇文章 0 订阅

订阅专栏

以下的例子都来自 sed 的 info 文档。近日安装 LFS ，安装软件包完全是重复劳动，所以想写一个自动安装脚本，趁机好好地学习了一把 sed 。看 info 文档中的 sed 的例子，感觉非常精彩，所以把它们加上注释，记录下来。
例一：把文本行居中显示。脚本代码：
1      #!/usr/bin/sed -f
2
3      # Put 80 spaces in the buffer
4      1 {
5        x
6        s/^$/          /
7        s/^.*$/&&&&&&&& /
8        x
9      }
10
11      # del leading and trailing spaces
12      y/tab/ /
13      s/^ *//
14      s/ *$//
15
16      # add a new-line and 80 spaces to end of line
17      G
18
19      # keep first 81 chars (80 + a new-line)
20      s/^/( ./{ 81 /}/) .*$//1 /
21
22      # /2 matches half of the spaces, which are moved to the beginning
23      s/^/( .*/)/n/( .*/)/2 //2/1 /
24
第一行是脚本头。
每 4 行的 1 是地址，表示第一行 ( 文件的行编号从 1 开始而不是从 0 开始 ) 。后面用大括号括起来的是组合命令，当我们需要把一组命令应用到一个地址或者地址范围上时，就用大括号把它们括起来。现在详细看这个组合命令做了什么：第 5 行的 x 命令表示交换 hold buffer 和 pattern buffer 中的内容， pattern buffer 是 sed 当前正在处理的缓冲区， hold buffer 是辅助缓冲区，下面用说到主缓冲区时指的是 pattern buffer ，说到副缓冲区时指的是 hold buffer 。副缓冲区在脚本开始时被初始化为空。所以现在用 x 命令交换后， sed 的主缓冲区被初始化成空。第 6 行是一个替换命令，把空行替换成 10 个空格。那么现在主缓冲区中的内容变成了 10 个空格，在命令中 ^ 表示行首， $ 表示行尾。第 7 行把主缓冲区中的内容重复 8 次，替换命令格式： s/REGEXP/REPLACEMENT/FLAGS ，在替换命令中用 & 符号表示 REGEXP 匹配到的内容。第 8 行再交换一次，现在 80 个空格放到了副缓冲区中。
第 12 行把 tab 字符替换成空格，其中 y/tab/ / 中的 ”tab” 应该被删掉，然后按下键盘上的 TAB 键插入一个 TAB 字符，否则会出错，但 sed 的 info 文档上是这样写，可能自己方法不对。第 13 ， 14 两行分别把行首和行尾的空格除掉。
第 17 行的 G 命令表示 Appending Get ，这条命令在主缓冲区最后加上一个换行，然后把副缓冲区的内容追加到主缓冲区后面。所以，现在的主缓冲区内容就是第一行输入首尾的空白字符被砍掉然后再后缀 80 个空格。
第 20 行， s/^/(./{ 81/}/).*$//1/ ， /1 表示 /(./{81/}/) ，即表示 81 个字符，这条命令保留主缓冲区中前 81 个字符，后面的全部抛掉。
第 23 行，这精彩的一行代码完成了文本居中显示的功能，仔细看看它： s/^/( .*/)/n/( .*/)/2 //2/1 / ，回顾下替换命令的格式 s/REGEXP/REPLACEMENT/FLAGS ，在这行代码中， REGEXP 部分是： ^/(.*/)/n/(.*/)/2 ，从第 17 行的 G 命令知道， /n 前面的是第一行输入文本，后面的全是空格，而后面的全部空格由正则表达式 /(.*/)/2 来匹配， /2 是回溯引用，引用的正是 /(.*/) 部分，全部空格就这样被巧妙地平均分成了等长的两半！

例二：给输入的整数做加 1 运算，代码：
1      #!/usr/bin/sed -f
2
3      /[^0-9]/ d
4
5      # replace all leading 9s by _ (any other character except digits, could
6      # be used)
7      :d
8      s / 9 /(_*/)$/ _/1/
9      t d
10
11      # incr last digit only.  The first line adds a most-significant
12      # digit of 1 if we have to add a digit.
13      #
14      # The `tn ' commands are not necessary, but make the thing
15      # faster
16
17      s /^/(_*/)$/ 1/1/ ; t n
18      s / 8 /(_*/)$/ 9/1/ ; t n
19      s / 7 /(_*/)$/ 8/1/ ; t n
20      s / 6 /(_*/)$/ 7/1/ ; t n
21      s / 5 /(_*/)$/ 6/1/ ; t n
22      s / 4 /(_*/)$/ 5/1/ ; t n
23      s / 3 /(_*/)$/ 4/1/ ; t n
24      s / 2 /(_*/)$/ 3/1/ ; t n
25      s / 1 /(_*/)$/ 2/1/ ; t n
26      s / 0 /(_*/)$/ 1/1/ ; t n
27
28      :n
29      y / _ / 0/
30

第 3 行把输入文件中所有包含非数字字符的行删掉。
第 7 行 :d 是一个 LABEL ，和 C 语言中的 goto 指定 LABEL 的含义差不多。跳转到一个 LABLE 有两种指令方式，一种是 b LABEL ，无条件跳转，另一种是 t LABLE ，表示需要有一次 s 命令的成功方可跳转。
第 8 行，注释上是说把所有的 leading 9s 换成 _ ，其实应该是把所有的 trailing 9s 换成 _ ，且看替换中的 REGEXP 部分： 9/(_*/)$ ，就是说 9 后面后缀一串 ( 或者没有 )_ 字符的，把 9 替换成 _ ，比如 899 ，第一次循环把末尾的 9 换成 _ ，替换后是 89_ ，第二次循环把 9_ 换成 __ ，替换后变成 8__ ，第三次循环替换失败，于是继续处理下一行，再比如 898, 第一次替换即失败，直接处理下一行。
从第 17 行到第 26 行，都是很简单的 s 命令，其中的 tn 如上所说它表示当 s 命令成功后就会生效，否则不生效，从第 17 行到第 26 行相当于一个 switch-case 结构，其中的 tn 相当于 switch case 中的 break 。到第 29 行，把所有的 _ 替换成 0 。整个过程代码很清晰简单，算法非常好。

例三：文件名大小写转换，代码在下面这个地址也能找到：
http://www.gnu.org/software/sed/manual/html_node/Rename-files-to-lower-case.html ，
1      #! /bin/sh
2      # rename files to lower/upper case...
3      #
4      # usage:
5      #    move-to-lower *
6      #    move-to-upper *
7      # or
8      #    move-to-lower -R .
9      #    move-to-upper -R .
10      #
11
12      help( )
13      {
14         ca t << eof
15      U s a ge: $ 0 [-n] [-r] [-h] files ...
16
17      -n      do nothing, only see wh a t would be done
18      -R      recursive (use find)
19      -h      this messa g e
20      file s    file s to remap to lo w er case
21
22      Example s :
23              $ 0 -n *         (see if everything is ok, then ... )
24              $ 0 *
25
26              $ 0 -R .
27
28      eof
29      }
30
31      apply_cmd ='sh '
32      finder='echo " $ @" | tr " " " /n "'
33      files_only =
34
35      while :
36      do
37          case "$1" in
38              -n) apply_cmd ='cat' ;;
39              -R) finder='find "$@" -type f';;
40              -h) help ; exit 1 ;;
41              *) break ;;
42          esac
43          shift
44      done
45
46      if [ -z "$1" ]; then
47              echo Usage: $0 [-h] [-n] [- r ] files...
48              exi t 1
49      fi
50
51      LOWER='ab cdefghijklmnopqrstuvwxyz '
52      UPPER='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
53
54      ca s e `base nam e $0 ` i n
55              *uppe r *) TO=$UPPER ; FROM= $ LOWER ;;
56              *)        FROM= $ UPPER ; TO= $ LOWER ;;
57      e s a c
58
59      eva l $finder | sed -n '
60
61      # remove a ll t railing slashes
62      s ///*$//
63
64      # add ./ if there is no path, only a filename
65      ! s /^/ .///
66
67      # save path+filename
68      h
69
70      # remove path
71      s /.*
72
73      # do conversion only on filename
74      y / ' $ FROM' / '$TO'/
75
76      # now line contains original path+file , while
77      # hold space contains the new filename
78      x
79
80      # add converted file name to line, which now contains
81      # path/file-name/nconverted -file-name
82      G
83
84      # check if converted file name is equal to original file name,
85      # if it is, do not print nothing
86      /^.*///(.*/)/n/1/ b
87
88      # now, transform path/fromfile /n, into
89      # mv path/fromfile path/tofile and print it
90      s /^/(.*///)/(.*/)/n/(.*/)$/ mv "/1/2 " "/1/3 "/ p
91
92      ' | $ appl y _ cmd
93

注意从网站上复制下来的代码每行前面都有些空白，要把这些空白去掉，否则会语法错误：
[root@test root]# sed 's /^[ /t]*/(.*/)$//1/' remap.sh > movetoupper.sh
[root@test root]# sed 's /^[ /t]*/(.*/)$//1/' remap.sh > movetolower.sh
help 函数显示该脚本的用法，其中用到了 here document 语法，其中 cat << eof 的 eof 没有用引号引起来，这表示在 here doc 中将会启用变量扩展。
31 到 33 行设置一些变量的默认值。 @ 在 bash 脚本中表示所有的 positional parameters ， $@ 用来取变量 @ 的值。在 bash 手册中的 special parameters 一节中有详细介绍。第 33 行的 files_only 变量只在这一个地方出现，没有什么作用。
35 到 44 行的 while 循环根据用户传进来的命令选项重新设定第 31 到 33 行定义的参数的值，共有三个可能的命令选项： -h, -n ,-R 。如果用户没有提供任何命令选项，直接以 ./movetoupper file1 file2 file3 这样的形式来使用，在这种情况下，第 32 行的 tr 命令起了作用，它把以空格分隔的各个文件名变成以换行来分隔，以使得作为参数的文件名可以一行一行地交给 sed 去处理。第 35 行 while 后面那个冒号是 bash 的 builtin 函数，它的返回值永远是 0, 所以这是一个死循环。
46 到 49 行做一个判断，如果用户没有提供任何命令选项和参数，或者在需要参数的命令选项后面没有提供参数，就输出帮助信息然后退出。
下面以 ./.eshell/lastdir 为例来说明 sed 命令部分：
62 行和 65 行的作用注释已经写的很清楚，经过第 62 、 65 行后，主缓冲区的内容是： ./.eshell/lastdir 。第 65 行相当于 sed ‘ !s /^/./// ’。
68 行的 h 命令的作用是把主缓冲区的内容放到副缓冲区去，覆盖副缓冲区原有的内容，现在主缓冲区内容是 ./.eshell/lastdir 。
71 行的作用是移除所有的上级路径部分，只留下文件名，正则表达式匹配具有贪婪性，这里正是利用了这一点。执行后，主缓冲区的内容是： lastdir ，副缓冲区的内容是： ./.eshell/lastdir 。查看主缓冲区和副缓冲区可以用 p; x; p; x 命令序列来实现。
74 行用一个 y 命令实现大小写转换，执行后，主缓冲区内容是 LASTDIR ，副缓冲区内容是： ./.eshell/lastdir 。
78 的 x 命令交换主缓冲区和副缓冲区的内容。
82 行的 G 命令是往主缓冲区中添加一个换行，然后把副缓冲区的内容后缀到主缓冲区中。执行后，主缓冲区的内容是： ./.eshell/lastdir/nLASTDIR ，副缓冲区的内容是： LASTDIR 。
这个脚本所实现的功能可以用 tr 命令很容易地实现。
86 行值得注意的是那个 b 命令，它表示无条件分支跳转，当跳转的目标被忽略时，则放弃当前处理，直接开始处理下一条输入。它的 info 文档的原文讲解是：

`b LABEL'
Unconditionally branch to LABEL. The LABEL may be omitted, in
which case the next cycle is started.

86 行正则表达式部分没有特殊的，对照 82 行执行后的主缓冲区内容看得更清楚些。如果正则表达式匹配成功，就执行 b 命令，否则，自然就不执行 b 命令而直接执行 90 的命令。
90 行是一个 s 命令， REGEXP 部分是 /^/(.*///)/(.*/)/n/(.*/)$/ ， REPLACEMENT 部分是 mv "/1/2" "/1/3" ，对照 82 行执行后的主缓冲区的内容可以理解。 90 行的 s 命令还带了一个 p 命令， s 命令中的 p 命令的意思是如果替换成功完成则输出新的替换后的主缓冲区内容。
90 行的输出通过管道传递给 $apply_cmd 执行，如果用户指定了 -n 选项， apply_cmd 变量的值就是 cat ，它只是执行一个简单的输出，如果用户没有指定 -n 选项，这个变量的值就是默认的 sh ，它将执行 90 行输出的 mv 命令。
语言并不复杂，其背后的算法思想非常漂亮。

例四：打印 bash 环境变量，代码：
1      #!/ bin/sh
2
3      set | sed -n '
4      :x
5
6      # if no occurrence of "=()" print and load next line
7      /=()/! { p ; b; }
8      / () $/! { p ; b; }
9
10      # possible start of functions section
11      # save the line in case this is a var like FOO="() "
12      h
13
14      # if the next line has a brace, we quit because
15      # nothing comes after functions
16      n
17      /^{/ q
18
19      # print the old line
20      x; p
21
22      # work on the new line now
23      x; bx
24      '
25
26
这个脚本比较短，也很清晰。
用到的命令都可以在 sed 的 info 文档中查到。

例四的功能是反转一个字符串，这个脚本稍后再说。截止现在也看到了，时刻了解 sed 脚本执行时主缓冲区和副缓冲区中的内容是学习和理解 sed 的好方法。那么，有没有一种工具来自动监视 sed 执行时它的缓冲区内容呢？幸运的是，网上有很多 sed 调试器，它们可以实时显示 sed 执行时的命令和缓冲区，在 http://sedsed.sourceforge.net/ 就可以下载到一款，这是一个用 python 写成的 sed 调试器，作者是 Aurelio Jargas 。
现在已经下载到这个调试器，它放在 ~/sedDebug 目录下， sed 的脚本都放在 ~/sedExample 下。我们试着用一用这个调试器，以下这个 sed info 文档中的例子脚本完成 tac 命令的功能：
     #!/ usr/bin/sed -nf

     # reverse all lines of input, i.e. first line became last, ...

     # from the second line, the buffer (which contains all previous lines)
     # is *appended* to current line, so, the order will be reversed
     1 ! G

     # on the last line we're done -- print everything
     $ p

     # store everything on the buffer again
     h

这个脚本只有三行命令，我们执行它：
[root@test sedExample ]# cat > test << "eof "
> first line
> 2line
> here , this is the third line
> ok , last
> eof
[root@test sedExample ]# cat test | ../sedDebugger/sedsed-1.0 -d -f tac.sed

PATT:first line$
HOLD:$
COMM:1 !G
PATT:first line$
HOLD:$
COMM:$ p
PATT:first line$
HOLD:$
COMM:h
PATT:first line$
HOLD:first line$
first line
PATT:2line $
HOLD:first line$
COMM:1 !G
PATT:2line /nfirst line$
HOLD:first line$
COMM:$ p
PATT:2line /nfirst line$
HOLD:first line$
COMM:h
PATT:2line /nfirst line$
HOLD:2line /nfirst line$
2line
first line
PATT:here , this is the third line$
HOLD:2line /nfirst line$
COMM:1 !G
PATT:here , this is the third line/n2line/nfirst line$
HOLD:2line /nfirst line$
COMM:$ p
PATT:here , this is the third line/n2line/nfirst line$
HOLD:2line /nfirst line$
COMM:h
PATT:here , this is the third line/n2line/nfirst line$
HOLD:here , this is the third line/n2line/nfirst line$
here , this is the third line
2line
first line
PATT:ok , last$
HOLD:here , this is the third line/n2line/nfirst line$
COMM:1 !G
PATT:ok , last/nhere , this is the third line/n2line/nfirst line$
HOLD:here , this is the third line/n2line/nfirst line$
COMM:$ p
ok , last
here , this is the third line
2line
first line
PATT:ok , last/nhere , this is the third line/n2line/nfirst line$
HOLD:here , this is the third line/n2line/nfirst line$
COMM:h
PATT:ok , last/nhere , this is the third line/n2line/nfirst line$
HOLD:ok , last/nhere , this is the third line/n2line/nfirst line$
ok , last
here , this is the third line
2line
first line

以上是 sed 调试器的输出结果。 PATT 后面是 pattern space 即主缓冲区的内容， HOLD 后面是 hold space 即副缓冲区的内容。 COMMAND 后面是当前正在执行的命令。其它的都是 sed 脚本执行的输出结果。

有了这个 sed 调试器后，理解学习 sed 就非常快捷非常方便了。然而这并不代表自此就成为了 sed 高手。 sed 只是定义了一些表达方式，学习 sed ，了解它的表达方式，然后就可以用这些表达方式创作出美丽的篇章。
现在复习一下上面的例子，总结一下它的表达方式。两块缓冲区，一主一从，这是物质基础，除了 s 命令外，所有操作都是针对缓冲区全部内容的，比如对整个缓冲区的 delete, append, flush 等等，只有 s 命令可以对缓冲区中的部分内容进行操作，因此 s 命令也是 sed 中最灵活的命令了。
下面是反转字符串的 sed 脚本，它展示了使用 s 命令的一种思路：
#!/ usr/bin/sed -f

/../! b

# Reverse a line. Begin embedding the line between two new-lines
s /^.* $/ /
& /
/

# Move first character at the end. The regexp matches until
# there are zero or one characters between the markers
tx
: x
s / /( /n . /)/( .* /)/( . /n/) / /3/2/1 /
tx

# Remove the new-line markers
s / /n // g
可以用调试器仔细看它的工作流程。

sed 的 info 文档包含了很多实用的示例，上面是其中的五个。前两天在网上查找使用 sed 删除文件中的所有换行符，当前看不明白，现在明白了，把那个代码贴在这里：
sed –e :a –e ‘ $!N;s//n//;ta ’ datafile