七、shell脚本语言文本处理awk（二）

最新推荐文章于 2023-06-07 21:22:22 发布

jiang0615csdn

最新推荐文章于 2023-06-07 21:22:22 发布

阅读量349

点赞数

分类专栏： shell脚本开发集文章标签： linux 大数据运维

本文链接：https://blog.csdn.net/jiang0615csdn/article/details/125619477

版权

shell脚本开发集专栏收录该内容

23 篇文章 1 订阅

订阅专栏

上一篇：七、shell脚本语言文本处理awk

8.5、break 和 continue 语句

9.2、通过 NR 设置记录下标，下标从 1 开始

10.4、asort()和 asorti()

10.11、tolower()和 toupper()

10.12、时间处理

8、 awk的流程控制

8.1、if流程控制

if 语句格式：if (condition) statement [ else statement ]

单分支：
[root@~ test]#  seq 5 |awk '{if($0==3)print $0}' 
3

也支持正则匹配判断，一般在写复杂语句时使用：
[root@~ test]# echo "123abc#456cde 789aaa#aaabbb " |xargs -n1 |awk -F# '{if($2~/[0-9]/)print $2}' 
456cde

[root@~ test]# echo "123abc#456cde 789aaa#aaabbb " |xargs -n1 |awk -F# '{if($2!~/[0-9]/)print $2}' 
aaabbb
或
[root@~ test]# echo "123abc#456cde 789aaa#aaabbb" |xargs -n1 |awk -F# '$2!~/[0-9]/{print $2}' 
aaabbb

双分支：
[root@~ test]# seq 5 |awk '{if($0==3)print $0;else print "no"}' 
no
no
3
no
no

多分支：
[root@cnsz92vl17661 test]# cat file 
1 2 3
4 5 6
7 8 9

[root@~ test]#  awk '{if($1==4){print "1"} else if($2==5){print "2"} else if($3==6){print "3"} else {print "no"}}' file 
no
1
no

8.2、while 语句流程控制

格式：while (condition) statement

NF ----------------------------统计当前记录中字段个数

遍历打印所有字段：
[root@cnsz92vl17661 test]# cat file 
1 2 3
4 5 6
7 8 9

[root@cnsz92vl17661 test]# awk '{i=1;while(i<=NF){print $i;i++}}' file
1
2
3
4
5
6
7
8
9

注： awk 是按行处理的，每次读取一行，并遍历打印每个字段。

8.3、for 语句 C 语言风格

格式：for (expr1; expr2; expr3) statement

遍历打印所有字段：
[root@~ test]# cat file 
1 2 3
4 5 6
7 8 9

[root@~ test]# awk '{for(i=1;i<=NF;i++)print $i}' file
1
2
3
4
5
6
7
8
9

倒叙打印内容
[root@cnsz92vl17661 test]# awk '{for(i=NF;i>=1;i--)print $i}' file 
3
2
1
6
5
4
9
8
7

都换行了，这并不是我们要的结果。怎么改进呢？
[root@cnsz92vl17661 test]#  awk '{for(i=NF;i>=1;i--){printf $i" "};print ""}' file
3 2 1 
6 5 4 
9 8 7 
或
[root@cnsz92vl17661 test]#  awk '{for(i=NF;i>=1;i--)if(i==1)printf $i"\n";else printf $i" "}' file 
3 2 1
6 5 4
9 8 7

在这种情况下，是不是就排除第一行和倒数第一行呢？我们正序打印看下 
排除第一行：
[root@cnsz92vl17661 test]#  awk '{for(i=2;i<=NF;i++){printf $i" "};print ""}' file 
2 3 
5 6 
8 9 
排除第二行：
[root@cnsz92vl17661 test]# awk '{for(i=1;i<=NF-1;i++){printf $i" "};print ""}' file 
1 2 
4 5 
7 8 


IP 加单引号
[root@~ test]# echo '10.10.10.1 10.10.10.2 10.10.10.3' |awk '{for(i=1;i<=NF;i++)printf "\047"$i"\047"}'
'10.10.10.1''10.10.10.2''10.10.10.3'

\047 是 ASCII 码，可以通过 showkey -a 命令查看。

8.4、for 语句遍历数组

格式：for (var in array) statement

[root@~ ~]# seq -f "str%.g" 5 |awk '{a[NR]=$0}END{for(v in a)print v,a[v]}' 
4 str4
5 str5
1 str1
2 str2
3 str3

8.5、break 和 continue 语句

break ----------跳过所有循环，

continue ---------跳过当前循环。

[root@~ ~]# awk 'BEGIN{for(i=1;i<=5;i++){if(i==3){break};print i}}' 
1
2
[root@~ ~]# awk 'BEGIN{for(i=1;i<=5;i++){if(i==3){continue};print i}}' 
1
2
4
5

8.6、删除数组和元素

格式：

delete array[index] ---------------------- 删除数组元素

delete array --------------------------------删除数组

删除数组
[root@~ test]# seq -f "str%.g" 5 |awk '{a[NR]=$0}END{delete a;for(v in a)print v,a[v]}' 

遍历数组
[root@~ test]# seq -f "str%.g" 5 |awk '{a[NR]=$0}END{for(v in a)print v,a[v]}' 
4 str4
5 str5
1 str1
2 str2
3 str3

删除数组元素
[root@~ test]# seq -f "str%.g" 5 |awk '{a[NR]=$0}END{delete a[4];for(v in a)print v,a[v]}' 
5 str5
1 str1
2 str2
3 str3

8.7、exit 语句

格式：

exit [ expression ] exit 退出程序，与 shell 的 exit 一样。

[ expr ]是 0-255 之间的数字

[root@~ test]# seq 5 |awk '{if($0~/3/)exit (123)}'
[root@~ test]# echo $?
123

9、awk的数组使用

数组：存储一系列相同类型的元素，键/值方式存储，通过下标（键）来访问值。 awk 中数组称为关联数组，不仅可以使用数字作为下标，还可以使用字符串作为下标。数组元素的键和值存储在 awk 程序内部的一个表中，该表采用散列算法，因此数组元素是随机排序。数组格式：array[index]=value

9.1、自定义数组

[root@~ test]# awk 'BEGIN{a[0]="test";print a[0]}'
test

9.2、通过 NR 设置记录下标，下标从 1 开始

NR ------------------------统计记录编号，每处理一行记录，编号就会+1

[root@~ test]# tail -n3 /etc/passwd 
centos:x:1002:1002:Cloud User:/home/centos:/bin/bash
nginx:x:799:798:Nginx web server:/var/lib/nginx:/sbin/nologin
tcpdump:x:72:72::/:/sbin/nologin

[root@~ test]# tail -n3 /etc/passwd | awk -F":" '{a[NR]=$1}END{print a[1]}'
centos

[root@~ test]# tail -n3 /etc/passwd | awk -F":" '{a[NR]=$1}END{print a[0]}'


[root@~ test]# tail -n3 /etc/passwd | awk -F":" '{a[NR]=$1}END{print a[2]}'
nginx

[root@~ test]# tail -n3 /etc/passwd | awk -F":" '{a[NR]=$1}END{print a[3]}'
tcpdump

[root@~ test]# tail -n3 /etc/passwd | awk -F":" '{a[NR]=$1}END{print a[4]}'

9.3、通过 for 循环遍历数组

下面打印的 v是数组的下标。第一种 for 循环的结果是乱序的，刚说过，数组是无序存储。第二种 for 循环通过下标获取的情况是排序正常。所以当下标是数字序列时，还是用 for(expr1;expr2;expr3)循环表达式比较好，保持顺序不变

[root@~ test]# head -n5 /etc/passwd | awk -F: '{a[NR]=$1}END{for(v in a)print a[v],v}'
adm 4
lp 5
root 1
bin 2
daemon 3

[root@~ test]# head -n5 /etc/passwd | awk -F: '{a[NR]=$1}END{for(v=1;v<=NR;v++)print a[v],v}'
root 1
bin 2
daemon 3
adm 4
lp 5

9.4、通过++方式作为下标

x 被 awk 初始化值是 0，每循环一次+1

[root@~ test]# head -n5 /etc/passwd | awk -F: '{a[x++]=$1}END{for(i=0;i<=(x-1);i++)print a[i],i}'
root 0
bin 1
daemon 2
adm 3
lp 4

9.5、使用字段作为下标

[root@~ test]# head -n5 /etc/passwd |awk -F: '{a[$1]=$7}END{for(v in a)print a[v],v}'
/sbin/nologin bin
/sbin/nologin adm
/sbin/nologin daemon
/bin/bash root
/sbin/nologin lp

9.6、统计相同字段出现次数

第一个字段作为下标，值被++初始化是 0，每次遇到下标（第一个字段）一样时，对应的值就会被 +1，因此实现了统计出现次数。想要实现去重的的话就简单了，只要打印下标即可。

[root@~ test]# tail servers | awk '{a[$1]++}END{for(v in a)print a[v],v}'
2 com-bardac-dw
1 3gpp-cbsp
1 nimgtw
2 iqobject
2 isnetserv
2 blp5

[root@~ test]# tail servers | awk '/blp5/{a[$1]++}END{for(v in a)print a[v],v}'
2 blp5

9.7、统计 TCP 连接状态

[root@~ test]# netstat -antp | awk '/^tcp/{a[$6]++}END{for(v in a)print a[v],v}'
6 LISTEN
1 ESTABLISHED

9.8、只打印出现次数大于等于 2

[root@~ test]# cat servers | awk '{a[$1]++}END{for(v in a) if(a[v]>=2){print a[v],v}}'
2 com-bardac-dw
2 iqobject
2 isnetserv
2 blp5

9.9、去重

先明白一个情况，当值是 0 是为假，非 0 整数为真，知道这点就不难理解了。

只打印重复的行说明：当处理第一条记录时，执行了++，初始值是 0 为假，就不打印，如果再遇到相同的记录，值就会+1，不为 0，则打印。

不打印重复的行说明：当处理第一条记录时，执行了++，初始值是 0 为假，感叹号取反为真，打印，如果再遇到相同的记录，值就会+1，不为 0 为真，取反为假就不打印

?$1:"no"表示如果不存在则打印no，存在则不用打印no

查询重复行的内容
[root@~ test]# awk 'a[$1]++' servers 
isnetserv         48128/udp            # Image Systems Network Services 
blp5              48129/udp            # Bloomberg locator 
com-bardac-dw     48556/udp            # com-bardac-dw 
iqobject          48619/udp            # iqobject 


查询不是重复行的内容
[root@~ test]# awk '!a[$1]++' servers 
nimgtw            48003/udp            # Nimbus Gateway 
3gpp-cbsp         48049/tcp            # 3GPP Cell Broadcast Service Protocol 
isnetserv         48128/tcp            # Image Systems Network Services 
blp5              48129/tcp            # Bloomberg locator 
com-bardac-dw     48556/tcp            # com-bardac-dw 
iqobject          48619/tcp            # iqobject 


三目运算
[root@~ test]# awk '{print a[$1]++?$1:"no"}' servers 
no
no
no
isnetserv
no
blp5
no
com-bardac-dw
no
iqobject

[root@~ test]# awk '{if(!a[$1]++)print $1}' servers 
nimgtw
3gpp-cbsp
isnetserv
blp5
com-bardac-dw
iqobject

9.10、统计每个相同字段的某字段总数：

[root@~ test]# awk -F'[ /]+' '{a[$1]+=$2}END{for(v in a)print v,a[v]}' servers 
com-bardac-dw 97112
3gpp-cbsp 48049
nimgtw 48003
iqobject 97238
isnetserv 96256
blp5 96258

9.11、多维数组

awk 的多维数组，实际上 awk 并不支持多维数组，而是逻辑上模拟二维数组的访问方式，比如 a[a,b]=1，使用 SUBSEP（默认\034）作为分隔下标字段，存储后是这样 a\034b

[root@~ test]# awk 'BEGIN{a["x","y"]=123;for(v in a) print v,a[v]}' 
xy 123

我们可以重新复制 SUBSEP 变量，改变下标默认分隔符： 
[root@~ test]# awk 'BEGIN{SUBSEP=":";a["x","y"]=123;for(v in a) print v,a[v]}' 
x:y 123


根据指定的字段统计出现次数：
[root@cnsz92vl17661 test]# cat kwa 
A 192.168.1.1 HTTP
B 192.168.1.2 HTTP
B 192.168.1.2 MYSQL
C 192.168.1.1 MYSQL
C 192.168.1.1 MQ
D 192.168.1.4 NGINX

[root@~ test]# awk 'BEGIN{SUBSEP="-"}{a[$1,$2]++}END{for(v in a)print a[v],v}' kwa 
1 D-192.168.1.4
1 A-192.168.1.1
2 C-192.168.1.1
2 B-192.168.1.2

10、awk调用内置函数

内置函数	作用描述
int(expr)	截断为整数
sqrt(expr)	平方根
rand()	返回一个随机数 N，0 和 1 范围，0 < N < 1
srand([expr])	使用 expr 生成随机数，如果不指定，默认使用当前时间为种子，如果前面有种子则使用生成随机数
asort(a, b)	对数组 a 的值进行排序，把排序后的值存到新的数组 b 中，新排序的数组下标从 1 开始 asorti
asorti(a,b)	对数组 a 的下标进行排序，同上
sub(r, s [, t])	对输入的记录用 s 替换 r 正则匹配，t 可选针对某字段替换，但只替换第一个字符串
gsub(r, s [, t])	对输入的记录用 s 替换 r 正则匹配，t 可选针对某字段替换，否则替换所有字符串
gensub(r, s, h [, t])	对输入的记录用 s 替换 r 正则匹配，h 替换指定索引位置
index(s, t)	返回 s 中字符串 t 的索引位置，0 为不存在
length([s])	返回 s 的长度
match(s, r [, a])	测试字符串 s 是否包含匹配 r 的字符串，如果不包含返回 0
split(s, a [, r [, seps] ])	根据分隔符 seps 将 s 分成数组 a
substr(s, i [, n])	截取字符串 s 从 i 开始到长度 n，如果 n 没指定则是剩余部分
tolower(str)	str 中的所有大写转换成小写
toupper(str)	str 中的所有小写转换成大写
systime()	当前时间戳
strftime([format [, timestamp[, utcflag]]])	格式化输出时间，将时间戳转为字符串

10.1、int()

int(expr) -------截断为整数，这里的整数是指数值的

[root@~ test]#  echo -e "123abc\nabc123\n123abc123" | awk '{print int($0)}' 
123
0
123

[root@~ test]# awk 'BEGIN{print int(10/3)}'
3

10.2、sqrt()

sqrt(expr) -------------平方根，就是数学里的运算

[root@~ test]# awk 'BEGIN{print sqrt(9)}'
3
[root@~ test]# awk 'BEGIN{print sqrt(10)}'
3.16228

10.3、rand()和 srand()

rand() -------------返回一个随机数 N，0 和 1 范围，0 < N < 1

srand([expr]) --------------使用 expr 生成随机数，如果不指定，默认使用当前时间为种子，如果前面有种子则使用生成随机数

rand()并不是每次运行就是一个随机数，会一直保持一个不变：
[root@~ test]#  awk 'BEGIN{print rand()}' 
0.237788
[root@~ test]#  awk 'BEGIN{print rand()}' 
0.237788

当执行 srand()函数后，rand()才会发生变化，所以一般在 awk 着两个函数结合生成随机数，但是 也有很大几率生成一样
[root@~ test]#  awk 'BEGIN{srand();print rand()}'
0.0213984
[root@~  test]#  awk 'BEGIN{srand();print rand()}'
0.414332

如果想生成 1-10 的随机数可以这样
[root@~ test]# awk 'BEGIN{srand();print int(rand()*10)}'
9
[root@~ test]# awk 'BEGIN{srand();print int(rand()*10)}'
9
[root@~ test]# awk 'BEGIN{srand();print int(rand()*10)}'
3

10.4、asort()和 asorti()

asort(a, b) ------------------------对数组 a 的值进行排序，把排序后的值存到新的数组 b 中，新排序的数组下标从 1 开始

asorti(a, b) --------------------对数组 a 的下标进行排序，同上

如下的示例解析：

asort 将 a 数组的值放到数组 b，a 下标丢弃，并将数组 b 的总行号赋值给 s，新数组 b 下标从 1 开始，然后遍历

排序数组：
[root@~ test]# seq -f "str%.g" 5 |awk '{a[x++]=$0}END{s=asort(a,b);for(i=1;i<=s;i++)print b[i],i}'str1 1
str2 2
str3 3
str4 4
str5 5

[root@~ test]#  seq -f "str%.g" 5 |awk '{a[x++]=$0}END{s=asorti(a,b);for(i=1;i<=s;i++)print b[i],i}'
0 1
1 2
2 3
3 4
4 5

10.5、sub()和 gsub()

sub(r, s [, t]) ------------------对输入的记录用 s 替换 r 正则匹配，t 可选针对某字段替换，但只替换第一个字符串

gsub(r, s [, t]) ------------------对输入的记录用 s 替换 r 正则匹配，t 可选针对某字段替换，否则替换所有字符串

替换正则匹配的字符串
[root@~ test]# awk '/blp5/{sub(/tcp/,"icmp");print $0}' servers 
blp5              48129/icmp            # Bloomberg locator 
blp5              48129/udp            # Bloomberg locator 

[root@~ test]# awk '/blp5/{gsub(/c/,"9");print $0}' servers 
blp5              48129/t9p            # Bloomberg lo9ator 
blp5              48129/udp            # Bloomberg lo9ator 

[root@~  test]#  echo "1 2 2 3 4 5" |awk 'gsub(2,7,$2){print $0}' 
1 7 2 3 4 5
[root@~ test]# echo "1 2 3 a b c" |awk 'gsub(/[0-9]/, '0'){print $0}' 
0 0 0 a b c

在指定行前后加一行

[root@~ test]# seq 5 | awk 'NR==2{sub('/.*/',"txt\n&")}{print}' 
1
txt
2
3
4
5

[root@~ test]# seq 5 | awk 'NR==2{sub('/.*/',"&\ntxt")}{print}' 
1
2
txt
3
4
5

10.6、index（）

index(s, t) -----------返回 s 中字符串 t 的索引位置，0 为不存在

[root@~ test]# tail -n5 servers | awk '{print index($2,"tcp")}'
0
7
0
7
0
[root@~ test]# tail -n5 servers 
blp5              48129/udp            # Bloomberg locator 
com-bardac-dw     48556/tcp            # com-bardac-dw 
com-bardac-dw     48556/udp            # com-bardac-dw 
iqobject          48619/tcp            # iqobject 
iqobject          48619/udp            # iqobject

10.7、length()

length([s]) -------------返回 s 的长度

统计字段长度
[root@~ test]# tail -n5 servers | awk '{print length($2)}'
9
9
9
9
9

统计数组的长度
[root@~ test]# tail -n5 servers | awk '{a[$1]=$2}END{print length(a)}'
3

10.8、match

match(s, r [, a]) ----------------测试字符串 s 是否包含匹配 r 的字符串，如果不包含返回 0

[root@~ test]# echo "123abc#456cde 789aaa#234bbb 999aaa#aaabbb" |xargs  -n1 |awk '{print match($0,234)}'
0
8
0

如果记录匹配字符串 234，则返回索引位置，否则返回 0。 那么，我们只想打印包含这个字符串的记录就可以这样：
[root@~ test]#  echo "123abc#456cde 789aaa#234bbb 999aaa#aaabbb" |xargs  -n1 |awk '{if(match($0,234)!=0)print $0}'
789aaa#234bbb

10.9、split()

split(s, a [, r [, seps] ]) ------------根据分隔符 seps 将 s 分成数组 a

切分记录为数组 a
[root@~ test]# echo -e "123#456#789\nabc#cde#fgh" |awk '{split($0,a);for(v in a)print a[v],v}' 
123#456#789 1
abc#cde#fgh 1

以#号切分记录为数据 a
[root@~ test]# echo -e "123#456#789\nabc#cde#fgh" |awk '{split($0,a,"#");for(v in a)print a[v],v}' 
123 1
456 2
789 3
abc 1
cde 2
fgh 3

10.10、substr()

substr(s, i [, n]) ------------截取字符串 s 从 i 开始到长度 n，如果 n 没指定则是剩余部分

截取字符串索引 4 到最后：
[root@~ test]# echo -e "123#456#789\nabc#cde#fgh" |awk '{print substr($0,4)}'
#456#789
#cde#fgh

截取字符串索引 4 到长度 5：
[root@~ test]# echo -e "123#456#789\nabc#cde#fgh" |awk '{print substr($0,4,5)}' 
#456#
#cde#

10.11、tolower()和 toupper()

tolower(str) ---------str 中的所有大写转换成小写

toupper(str) ---------str 中的所有小写转换成大写

转换小写： 
[root@~ test]# echo -e "123#456#789\nABC#cde#fgh" |awk '{print tolower($0)}' 
123#456#789
abc#cde#fgh

转换大写： 
[root@~ test]#  echo -e "123#456#789\nabc#cde#fgh" |awk '{print toupper($0)}' 
123#456#789
ABC#CDE#FGH

10.12、时间处理

strftime([format [, timestamp[, utcflag]]]) ---格式化输出时间，将时间戳转为字符串

返回当前时间戳
[root@~ test]# awk 'BEGIN{print systime()}' 
1657094148

将时间戳转为日期和时间 
[root@~ test]#  echo "1483297766" |awk '{print strftime("%Y-%m-%d %H:%M:%S",$0)}' 
2017-01-02 03:09:26