详解Shell编程之正则表达式与文本处理器

最新推荐文章于 2024-07-17 00:15:00 发布

tanwenlong01

最新推荐文章于 2024-07-17 00:15:00 发布

阅读量464

点赞数

分类专栏： shell脚本

本文链接：https://blog.csdn.net/boyuser/article/details/109702327

版权

shell脚本专栏收录该内容

7 篇文章 0 订阅

订阅专栏

本文详细介绍了正则表达式的基本概念、用途和分类，讲解了基础正则表达式元字符，以及在Shell编程中如何利用sed和awk进行文本处理。同时，提到了常用的文件排序工具sort、uniq和tr的应用。

摘要由CSDN通过智能技术生成

文章目录

1.正则表达式的定义：
正则表达式是使用单个字符串来描述、匹配一系列符合某个句法规则的字符串，简单来说，是一种匹配字符串的方法，通过一些特殊符号，实现快速查找、删除、替换某个特定字符串。
正则表达式是由普通字符与元字符组成的文字模式。其中普通字符包括大小写字母、数字、标点符号及一些其他符号，元字符则是指那些在正则表达式中具有特殊意义的专用字符，可以用来规定其前导字符（即位于元字符前面的字符）在目标对象中的出现模式。

2.正则表达式的用途
正则表达式对于系统管理员来说是非常重要的，系统运行过程中会产生大量的信息，这些信息有些是非常重要的，有些则仅是告知的信息。身为系统管理员如果直接看这么多的信息数据，无法快速定位到重要的信息，如"用户账号登录失败" “服务启动失败”等信息。这时可以通过正则表达式快速提取“有问题”的信息。如此一来，可以将运维工作变得更加简单、方便。

3.正则表达式的分类
正则表达式的字符串表达方法根据不同的严谨程度与功能分为基本正则表达式与扩展正则表达式。基础正则表达式是常用正则表达式最基础的部分。在Linux系统中常见的文件处理工具中grep与sed支持基础正则表达式，而egrep与awk支持扩展正则表达式。

1.正则表达式概述

正则表达式定义

正则表达式，又称正规表达式、常规表达式
使用字符串来描述、匹配一系列符合某个规则的字符串
正则表达式
	普通字符：大小写字母、数字、标点符号及一些其他符号
	元字符：在正则表达式中具有特殊意义的专用字符

正则表达式层次
	基础正则表达式
	扩展正则表达式
Linux中文本处理工具
	grep
	egrep
	sed
	awk

2.基础正则表达式示例

grep的使用规则
-n 表示显示行号
-i 表示不区分大小写
-v 表示反向过滤
[] 查找集合字符

示例

[root@server ~]# cat test.txt
he was short and fat.
He was wearing a blue polo shirt with black pants.
The home of Football on BBC Sport online.
the tongue is boneless but it breaks bones.12!
google is the best tools for search keyword.
The year ahead will test our political establishment to the limit.
PI=3.141592653589793238462643383249901429.
a wood cross!
Actions speak louder than words.
#woood #
#woooooood #

AxyzxyzxyzxyzC
I bet this place is really spooky late at night!
Misfortunes never come alone/single.
I shouldn't have lett so tast.

[root@server ~]# grep -n 'the' test.txt
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.

[root@server ~]# grep -in 'the' test.txt
3:The home of Football on BBC Sport online.
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword.
6:The year ahead will test our political establishment to the limit.

//若反向选择，如查找不包含“the”字符的行，则需要通过grep命令的“-v”选项实现，并配合“-n”一起使用显示行号
[root@server ~]# grep -vn 'the' test.txt
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants.
3:The home of Football on BBC Sport online.
7:PI=3.141592653589793238462643383249901429.
8:a wood cross!
9:Actions speak louder than words.
10:#woood #
11:#woooooood #
12:
13:AxyzxyzxyzxyzC
14:I bet this place is really spooky late at night!
15:Misfortunes never come alone/single.
16:I shouldn't have lett so tast.
17:

//利用中括号“[]”来查找集合字符
说“[io]”表示匹配“i”或者“o”
[root@server ~]# grep -n 'sh[io]rt' test.txt
1:he was short and fat.
2:He was wearing a blue polo shirt with black pants

//查找包含重复单个字符“oo”时，只需要执行以下命令即可
[root@server ~]# grep -n 'oo' test.txt
3:The home of Football on BBC Sport online.
5:google is the best tools for search keyword.
8:a wood cross!
10:#woood #
11:#woooooood #
14:I bet this place is really spooky late at night!

//若查找“oo”前面不是“w”的字符串，只需要通过集合字符的反向选择“[^]”来实现该目的
例如执行“grep -n '[^w]oo' test.txt命令表示在test.txt文本中查找“oo”前面不是“w”的字符串
[root@server ~]# grep -n '[^w]oo' test.txt
3:The home of Football on BBC Sport online.
5:google is the best tools for search keyword.
10:#woood #
11:#woooooood #
14:I bet this place is really spooky late at night!

//若不希望“oo”前面存在小写字母，可以使用“grep -n '[^a-z]oo' test.txt”命令实现
[root@server ~]# grep -n '[^a-z]oo' test.txt
3:The home of Football on BBC Sport online.

//查找包含数字的行可以通过“grep -n '[0-9]' test.txt”命令来实现
[root@server ~]# grep -n '[0-9]' test.txt
4:the tongue is boneless but it breaks bones.12!
7:PI=3.141592653589793238462643383249901429.

//查找行首“^”与行尾字符“$”
[root@server ~]# grep -n '^the' test.txt
4:the tongue is boneless but it breaks bones.12!
//查询以小写字母开头的行可以通过“^[a-z]”规则来过滤
[root@server ~]# grep -n '^[a-z]' test.txt
1:he was short and fat.
4:the tongue is boneless but it breaks bones.12!
5:google is the best tools for search keyword.
8:a wood cross!
//查询以大写字母开头的行则使用“^[A-Z]”规则
[root@server ~]# grep -n '^[A-Z]' test.txt
2:He was wearing a blue polo shirt with black pants.
3:The home of Football on BBC Sport online.
6:The year ahead will test our political establishment to the limit.
7:PI=3.141592653589793238462643383249901429.
9:Actions speak louder than words.
13:AxyzxyzxyzxyzC
14:I bet this place is really spooky late at night!
15:Misfortunes never come alone/single.
16:I shouldn't have lett so tast.

//查询空白行时，执行“grep -n '^$' test.txt”命令
[root@server ~]# grep -n '^$' test.txt
12:
17:
//正则表达式中小数点（.）也是一个元字符，代表任意一个字符
[root@server ~]# grep -n 'w..d' test.txt
5:google is the best tools for search keyword.
8:a wood cross!
9:Actions speak louder than words.

//若想要查询oo、ooo、oooo等资料，则需要使用星号（*）元字符。但需要注意的是，“*”代表的是重复零个或者多个前面单字符。“o*”表示拥有零个（即为空字符）或大于等于一个“o”的字符，因为允许空字符，所以执行“grep -n 'o*' test.txt”命令将会文本中所有的内容都输出打印。如果是“oo*”，则第一个o必须存在，第二个以上的字符串，则执行“grep -n 'ooo*' test.txt”命令即可
[root@server ~]# grep -n 'ooo*' test.txt
3:The home of Football on BBC Sport online.
5:google is the best tools for search keyword.
8:a wood cross!
10:#woood #
11:#woooooood #
14:I bet this place is really spooky late at night!

//查询以w开头d结尾，中间包含至少一个o的字符串
[root@server ~]# grep -n 'woo*d' test.txt
8:a wood cross!
10:#woood #
11:#woooooood #

//执行以下命令即可查询以w开头d结尾，中间的字符串可有可无的字符串
[root@server ~]# grep -n 'w.*d' test.txt
1:he was short and fat.
5:google is the best tools for search keyword.
8:a wood cross!
9:Actions speak louder than words.
10:#woood #
11:#woooooood #
[root@server ~]# grep -n '[0-9][0-9]*' test.txt
4:the tongue is boneless but it breaks bones.12!
7:PI=3.141592653589793238462643383249901429.

//查找连续字符范围“{}”
因为“{}”在shell中具有特殊意义，所以在使用“{}”字符时，需要利用转义字符“\”，将“{}”字符转换成普通字符
查询两个o的字符
[root@server ~]# grep -n 'o\{2\}' test.txt
3:The home of Football on BBC Sport online.
5:google is the best tools for search keyword.
8:a wood cross!
10:#woood #
11:#woooooood #
14:I bet this place is really spooky late at night!

查询以w开头以d结尾，中间包含2~5个o的字符串
[root@server ~]# grep -n 'wo\{2,5\}d' test.txt
8:a wood cross!
10:#woood #

查询以w开头以d结尾，中间包含2个或2个以上o的字符串
[root@server ~]# grep -n 'wo\{2,\}d' test.txt
8:a wood cross!
10:#woood #
11:#woooooood #

3.基础正则表达式元字符

基础正则表达式是常用的正则表达式部分
处理普通字符外，常见到以下元字符
	\：转义字符，\!、\n等
	^：匹配字符串开始的位置，例如：^a、^the、^#
	$：匹配字符串结束的位置，例如：word$
	.：匹配除\n之外的任意的一个字符，例如：go.d、g..d
	*：匹配前面子表达式0次或者多次，例如：goo*d、go.*d
	[list]：匹配list列表中的一个字符，例如：go[o|a]d、[abc]、[a-z]、[a-z0-9]
	[^list]：匹配任意不在list列表中的一个字符，例如:[^a-z]、[^0-9]、[^a-z0-9]
	\{n,m\}：匹配前面的子表达式n到m次，有\{n\}、\{n,\}、\{n,m\}三种格式，例如：go\{2\}d、go\{2,3\}d、go\{2,\}d

4.sed

sed(Stream EDitor)是一个强大而简单的文本解析转换工具，可以读取文本，并根据指定的条件对文本内容进行编辑（删除、替换、添加、移动等），最后输出所有的行或者仅输出处理某些行。

sed的工作流程注意包括读取、执行和显示三个过程

读取：sed从输入流（文件、管道、标准输入）中读取一行内容并存储到临时的缓冲区（又称模式空间，pattern space）

执行：默认情况下，所有的sed命令都在模式空间中顺序地执行，除非指定了行的地址，否则sed命令将会在所有的行上依次执行

显示：发送修改后的内容到输出流。在发送数据后，模式空间将会被清空

在所有的文件内容都被处理完成之前，上述过程将重复执行，直至所有内容被处理完

注：默认情况下所有的sed命令都是在模式空间内执行的，因此输入的文件并不会发生任何变化，除非是用重定向存储输出

1.sed命令常见用法

Sed  [选项]  ‘操作’  参数
Sed  [选项]  -f  scripfile  参数
选项的基本命令如下：
-e  script ：指定sed编辑命令
-f  scriptfile ：指定的文件中是sed编辑命令
-h ：显示帮助
-n ：表示仅显示处理后的结果
-i ：直接编辑文本文件
操作的基本命令如下：
a ：增加，在当前行下面增加一行指定内容
c ：替换，将选定行替换为指定内容
d ：删除，删除选定的行
i ：插入，在选定行上面插入一行指定内容
p ：打印
s ：替换，替换指定字符
y ：字符转换

用法示例

1.输出符号条件的文本（p表示正常输出）
[root@server ~]# sed -n 'p' test.txt		//输出所有内容，等同于cat test.txt
[root@server ~]# sed -n '3p' test.txt		//输出第三行
[root@server ~]# sed -n '3,5p' test.txt		//输出3-5行
[root@server ~]# sed -n 'p;n' test.txt		//输出所有奇数行，n表示读入下一行资料
[root@server ~]# sed -n 'n;p' test.txt		//输出所有偶数行，n表示读入下一行资料
[root@server ~]# sed -n '1,5{p;n}' test.txt	//输出1-5行奇数行
[root@server ~]# sed -n '9,${n;p}' test.txt//输出10（以第二行偶数，用9）至文件末尾之间的偶数行

[root@server ~]# sed -n '/the/p' test.txt	//输出包含the的行
[root@server ~]# sed -n '4,/the/p' test.txt	//输出第四行开始包含the的行
[root@server ~]# sed -n '/the/=' test.txt	//输出包含the所在的行号，等号（=）用来输出行号
[root@server ~]# sed -n '/^PI/p' test.txt	//输出以PI开头的行
[root@server ~]# sed -n '/[0-9]$/p' test.txt	//输出以数字结尾的行
[root@server ~]# sed -n '/\<wood\>/p' test.txt	//输出包含wood的行,\<、\>代表边界

2.删除符号条件的文本(d)
nl命令用于计算文件的行数(不含空格)
[root@server ~]# nl test.txt | sed '3d'		//删除第3行
[root@server ~]# nl test.txt | sed '3,5d'	//删除3-5行
[root@server ~]# nl test.txt | sed '/cross/d'	//删除包含cross的行
[root@server ~]# nl test.txt | sed '/cross/!d'	//删除不包含cross的行
[root@server ~]# sed  ‘/^[a-z]/d’  test.txt		//删除以小写字母开头的行
[root@server ~]# sed '/\.$/d' test.txt			//删除以.结尾的行
[root@server ~]# sed '/^$/d' test.txt			//删除空行

3.替换符合条件的文本
使用sed命令进行替换操作时需要用到s(字符串替换)、c(整行/整块替换)、y(字符转换)命令选项
sed 's/the/THE/' test.txt		//将每行中的第一个the替换成THE
sed 's/l/L/2' test.txt			//将每行中的第2个l替换为L
sed 's/the/THE/g' test.txt		//将文件中的所有the替换为THE
sed 's/o//g' test.txt			//将文件中的所有o删除（替换为空串）
sed 's/^/#/' test.txt			//在每行行首插入#号
sed '/the/s/^/#/' test.txt		//在包含the的每行行首插入#号
sed 's/$/EOF/' test.txt			//在每行行尾插入字符串EOF
sed '3,5s/the/THE/g' test.txt	//将第3-5行中的所有the替换成THE
sed '/the/s/o/O/g' test.txt		//将包含the的所有行中的o都替换成O

4.迁移符合条件的文本
在使用sed命令迁移符合条件的文本时，常用到以下参数
H:复制到剪切板
g、G:将剪切板中的数据覆盖/追加至指定行
w:保存为文件
r:读取指定文件
a:追加指定内容
sed '/the/{H;d};$G' test.txt	//将包含the的行迁移至文件末尾，{;}用于多个操作
sed '1,5{H;d};17G' test.txt		//将第1-5行内容转移到17行后
sed '/the/w out.file' test.txt	//将包含the的行另存为文件out.file
sed '/the/r /etc/hostname' test.txt		//将文件/etc/hostname的内容添加到包含the的每行以后
sed '3aNew' test.txt			//在第3行后插入一个新行，内容为New
sed '/the/aNew' test.txt		//在包含the的每行后插入一个新行，内容为New
sed '3aNew\nNew2' test.txt		//在第三行后插入多行内容，中间的\n表示换行

5.使用脚本编辑文件
使用sed脚本将多个编辑指令存放到文件中（每一行一条编辑指令），通过“-f”选项来调用
[root@server ~]# vi list
1,5H
1,5d
17G
[root@server ~]# sed -f list test.txt

6.sed直接操作文件示例
编写一个脚本，用来调整vsftpd服务配置，要求禁止匿名用户登录，但允许本地用户（也允许写入）
#! /bin/bash
# ftp配置文件更改
CONFIG="/etc/vsftpd/vsftpd.conf"
[!-e "'$CONFIG. bak" ] && cp $CONFIG $CONFIG.bak
sed -i -e '/^anonymous_enable/s/YES/N0/g' $CONFIG
sed -i -e '/^local_enable/s/N0/YES/g' $CONFIG
sed -i -e '/^write_enable/s/N0/YES/g' $CONFIG
sed -i -e 's/^#chroot_local_user=YES/chroot_local_user=YES/g' $CONFIG
sed -i -e 's/allow_writeable_chroot=YES//g' $CONFIG
sed -i '127aallow_writeable_chroot=YES' $CONFIG
sed -i -e '/listen/s/N0/YES/g' $CONFIG
sed -i -e '/isten_ipv6/s/YES/N0/g' $CONFIG
systemctl start vsftpd
netstat -anpt| grep vsftpd

5.扩展正则表达式-元字符总结

grep命令仅支持基础正则表达式，如果使用扩展正则表达式，需要使用egrep或awk命令
+ ：重复一个或者一个以上的前一个字符
？：零个或者一个的前一个字符
| ：使用或者（or）的方式找出多个字符
( ) ：查找“组“字符串
( ) + ：辨别多个重复的组
Grep、sed、awk更是shell编程中经常用到的文本处理工具，被称之为shell编程三剑客

6.awk

awk是一个功能强大的编辑工具，逐行读取输入文本，并根据指定的匹配模式进行查找，对符合条件的内容进行格式的情况下实现想到哪复杂的文本操作

awk选项‘模式或条件  {编辑命令}’  文件1  文件2 …		//过滤输出文件中符合条件的内容
awk  -f  脚本文件  文件1  文件2…			//从脚本中调用编辑命令，过滤并输出内容

例：若需要查找出/etc/passwd的用户名，用户ID，组ID序列，执行以下awk命令
awk  -F:  ‘{print  $1,$3,$4}’  /etc/passwd

awk包含几个特殊的内建变量如下所示：
FS ：指定每行文本的字段分隔符，默认为空格或制表位
NF ：当期处理的行的字段个数
NR ：当前处理的行的行号
$0 ：当前处理的行的整行内容
$n ：当前处理的行的第n个字段
FILENAME ：被处理的文件名
RS ：数据记录分隔符，默认为\n,即每行为一条记录

用法示例
1.按行输出文本
awk  ‘{print}’  test.txt      //输出所有内容，等同于cat  test.txt
awk  ‘{print  $0}’  test.txt     //输出所有内容
awk  ‘NR==1,NR==3{print}’  test.txt   //输出第1~3行内容
awk  ‘(NR>=1)&&(NR<=3){print}’  test.txt    //输出第1~3行内容
awk  ‘NR==1 || NR==3{print}’  test.txt    //输出第1行第3行内容
awk  ‘(NR%2)==1{print}’  test.txt      //输出奇数行内容
awk  ‘(NR%2)==0{print}’  test.txt      //输出偶数行内容
awk  ‘/^root/{print}’  /etc/passwd     //输出以root开头的行
awk  ‘/nologin$/{print}’  /etc/passwd      //输出以nologin结尾的行
awk  ‘BEGIN  {x=0};/\/bin\/bash$/{x++};END {printx}’  /etc/passwd   //统计以/bin/bash结尾的内容
awk  ‘BEGIN{RS=””};END{print NR}’  test.txt   //统计以空行分隔的文本段落数

2.按字段输出文本
awk  ‘{print  $3}’  test.txt      //输出每行的第3个字段
awk  ‘{print  $1,$3}’  test.txt     //输出每行的第1、3个字段
awk  -F:  ‘$2==””{print}’  /etc/shadow    //输出密码为空的用户
awk  ‘BEGIN  {FS=”:”};$2==””{print}’  /etc/shadow   //输出密码为空的用户
awk  -F:  ‘$7~”/bash”{print  $1}’  /etc/passwd     //输出以冒号分隔且第7个字段中包含/bash的行的第1个字段
awk  -F:  ‘($7!=”/bin/bash”)&&($7!=”/sbin/nologin”){print}}’  /etc/passwd  //输出第7个字段即不为/bin/bash，也不为/sbin/nologin的所有行
awk  ‘($1~”nfs”)&&(NF==8){print  $1,$2}’  /etc/services     //输出包含8个字段且第1个字段中包含nfs的行的第1、2个字段

3.通过管道、双引号调用shell命令
awk  -F:  ‘/bash$/{print  |  “wc  -l”}’  /etc/passwd     //调用wc  -l命令统计使用bash的个数，等同于grep  -c  “bash$”  /etc/passwd
awk  ‘BEGIN  {while  (“w”  |  getline)n++;{print  n-2}}’    //调用w命令，并用来统计在线用户数
awk  ‘BEGIN  {“hostname”  |  getline;print  $0}’    //调用hostname，并输出当前的主机名

脚本实验：在apache网站站点里面创建十个网页，名字自拟，每个网页内容事web1、web2......
vim  web.sh
#!/bin/bash
# 网页创建
dir=/var/www/html
test=web
for  ((i=1;i<=10;i++))
do
cd  $dirif  [  !  -e  index$i.html  ]
then  echo  $test$i > index$i.html
fi
done
systemctl  start  httpd
netstat  -anpt  |  grep  httpd
for  ((i=1;i<=10;i++))
do
   curl  http://localhost/index$i.html
done

7.常用的文件排序工具有三种：sort、uniq和tr

1.sort命令的语法为sort [选项]参数，其中常见的选项包括以下几种

-f ：忽略大小写
-b ：忽略每行前面的空格
-M ：按照月份进行排序
-n ：按照数字进行排序
-r ：反向排序
-u ：等同于uniq，表示相同数据仅显示一行
-t ：指定分隔符，默认使用[tab]键分隔
-o ：将排序后的结果转存至指定行
-k ：指定排序区域

示例1：将/etc/passwd文件中的账户进行排序
Sort  /etc/passwd

示例2：将/etc/passwd文件中第三列进行反向排序
Sort  -t  “:”  -k  3  /etc/passwd

2.uniq工具

-c ：进行计数
-d ：仅显示重复行
-u ：仅显示出现一次的行

示例：查找testfile文件中的重复行
      uniq  -d  testfile

3.tr工具

tr  [选项]  [参数]
-c ：取代所有不属于第一字符集的字符
-d ：删除所有属于第一字符集的字符
-s ：把连续重复的字符以单独一个字符表示
-t ：先删除第一字符集较第二字符集多出的字符

示例1：将输入字符有大写转换为小写
        echo  “KGC”  |  tr  ‘A-Z’’a-z’

示例2：压缩输入中重复的字符
        echo  “this  is  a  text  linnnnne”  |  tr  -s  ‘sn’

示例3：删除字符串中某些字符
        echo  ‘helloworld’  |  tr  -d  ‘od’