【转】正则表达式文本处理三剑客的总结

最新推荐文章于 2021-12-02 13:21:58 发布

tangxc10

最新推荐文章于 2021-12-02 13:21:58 发布

阅读量467

点赞数

分类专栏： awk 正则表达式 sed find grep

awk 同时被 3 个专栏收录

3 篇文章 0 订阅

订阅专栏

正则表达式

1 篇文章 0 订阅

订阅专栏

sed

1 篇文章 0 订阅

订阅专栏

0 正则表达式的基础

^ 行首标识

$ 篇尾标识或行尾标识

. 代表任意一个字符

? 代表前趋字符的一次出现

* 代表0个或多个前趋字符出现

[1-9] 代表一个属于1-9的字符

[^1-9] 代表一个不包含于1-9的字符

/< 词首标识

/> 词尾标识

/( /) 引用标识，可以多次引用，并在后面以/1 /2来引用

x/{m,n/} 代表x的至少m次，至多n次出现

正则扩展

| 用于使用多个正则条件，匹配之一即可

+ 与. *类似，表示1个或多个重复字符

() 用于将多个内容组成单元组

1 grep

通用格式 grep [option] regx file

选项

-E 使用正则扩展

-e pattern 使用pattern中的正则

-f file 使用文件中的正则

-i 忽略大小写(性能较差，最好先用tr统一转换成大写或小写)

-v 反向显示不匹配的行

-V 显示版本号

输出控制选项：

-n 输出行号

-q 不显示未匹配的内容

-r 递归方式扫描文件

-l 只输出匹配的文件名

-L 只输出不匹配的文件名

-c 只显示匹配的个数

常用示例：

grep -i 'root' /etc/passwd 不区分大小写显示文件中有root的行

grep -v '^root' /etc/passwd 显示文件中开关不是root和行

grep -n 'root' /etc/passwd 显示文件中含有root的行，且打印此行在文件中的行号

grep -lr '$root' /etc 递归查找/etc下包含行结尾为root的文件名

grep -Lr 'root' /etc 递归查找/etc下文件中不包含root的文件名

grep -c 'root' /etc/passwd 统计文件中root出现的行数

grep "hello | world" 1.cpp 匹配包含hello的行和world的行

2 sed

通用格式 sed [option] [-e script] 'address command/cmd arg' file

'/regx/command/cmd arg'

sed的处理模型：sed每次取文件的一行，放在模式缓冲区，然后对比匹配模式，如果匹配则执行匹配后面相应的命令，对不匹配的行，输出，然后输出处理后的行

常用选项：

-f 指定过滤脚本文件名

-e 后跟匹配表达式

-n 不显示默认输出

sed脚本上的命令：

sed '/root/a/text' /etc/passwd 在文件的root那一行后增加新行text。

sed '/root/c/text' /etc/passwd 在文件的root那一行替换为text

sed '/root/i/text' /etc/passwd 在匹配行前插入text

sed '/root/d' /etc/passwd 删除文件中含有root的那一行

h/H 复制或附加模式缓冲区到一个buffer

g/G 从buffer中取出并复制或附加到当前模式缓冲区

sed -e '/root/{h;d;}' -e '$g' /etc/passwd 将root行放在最后一行

p 打印行

sed -n '/root/{n;p;}' /etc/passwd 打印root行的下一行

sed '1,3y/abcdef/ABCDEF/' /etc/passwd 映射1-3行的小写为大写

s/xxx/yyy/g 文本替换

sed定址：

sed -n '1,3p' /etc/passwd 打印文件1-3行

sed -n '/root/,/sshd/p' /etc/passwd 打印文件root行与sshd行之间的行

sed -n '5,/^northeast/p' file

3 awk

通用格式：gawk 'pattern {action}' file

cmd | gawk 'pattern {action}'

如果没有pattern，则对所有行都采用action，如果没有action，则打印匹配行。在pattern中可以使用各种定义的变量$0,,NF,NR等.

工作原理：awk 扫描一行，放入变量$0中，然后行被分隔成各个域，以指定的分隔符进行分离，默认为空格，可以通过参数FS指定。各个域都存于变量$i中，至多100个域。

gawk -F : '{print $1}' /etc/passwd 打印所有用户名

格式化输出：
print 支持使用转义字符，及OFMT变量定义的输出数字格式
nawk '/Sally/{print "/t/tHave a nice day, " $1, $2 "/!"}' employees
nawk 'BEGIN{OFMT="%.2f"; print 1.2456789, 12E 2}'
printf支持C语言同名函数的所有功能
echo "UNIX" | nawk ' {printf "|%-15s|/n", $1}'
nawk '{printf "The name is: %-15s ID is %8d/n", $1, $3}' employees
域分隔符：
nawk F'[ :/t]' '{print $1, $2, $3}' employees

pattern
pattern{ action statement; action statement; etc. }
pattern可以是正则表达式，也可以是条件表达式，条件表达式甚至可以进行数学运算
~ 匹配运算 nawk '$1 !~ /ly$/' employees
比较表达式：支持==,>=,<=,!=,~,!~等各种比较操作，并支持&&,||等逻辑表达式连接多个比较表达式。
awk '$3 * $4 > 500' filename
范围模式：
awk '/Tom/,/Suzanne/' filename

Action
{}中的Action的极其类似C语言的子句，里面可以嵌套子句，可以使用条件、循环、支持变量函数定义、使用自定义或内部变量、内部函数，调用系统命令，输入输出重定向等强大的能力。
变量：var=value，若变量没有初始化，字符串为""，数字为0。
nawk '$1 ~ /Tom/ {wage = $2 * $3; print wage}' filename
内置变量：
ARGC                 Number of command-line argument
ARGIND               Index in ARGV of the current file being processed from the command line (gawk only)
ARGV                 Array of command-line arguments
CONVFMT              Conversion format for numbers, %.6g, by default (gawk only)
ENVIRON              An array containing the values of the current environment variables passed in from the shell
ERRNO                Contains a string describing a system error occurring from redirection when reading from the getline function or when using the close function (gawk only)
FIELDWIDTHS          A whitespace-separated list of fieldwidths used instead of FS when splitting records of fixed fieldwidth (gawk only)
FILENAME             Name of current input file
FNR                  Record number in current file
FS                   The input field separator, by default a space
IGNORECASE           Turns off case sensitivity in regular expressions and string operations (gawk only)
NF               当前记录的域个数，$NF可以引用到最后一个域
NR                   当前的记录序号
OFMT                 Output format for numbers
OFS                  Output field separator
ORS                  Output record separator
RLENGTH              Length of string matched by match function
RS                   Input record separator
RSTART               Offset of string matched by match function
RT                   The record terminator; gawk sets it to the input text that matched the character or regex specified by RS
SUBSEP               Subscript separator
BEGIN模式后跟的ACTION，表示在awk处理文本以前进行的动作，可以用来初始化各种内部变量，或其他动作。
END 模式后跟的ACTION，表示在awk处理结束后进行的动作。
重定向：
nawk '$4 >= 70 {print $1, $2 > "passing_file" }' filename
nawk 'BEGIN{while("ls" | getline) print}'
条件语句
{if ( $3 > 89 && $3 < 101 ) Agrade++
else if ( $3 > 79 ) Bgrade++
else if ( $3 > 69 ) Cgrade++
else if ( $3 > 59 ) Dgrade++
else Fgrade++
}
循环：支持while,for的的标准循环结构及break,continue等。
{
for ( x = 3; x <= NF; x++ )
if ( $x == 0 ) { print "Get next item"; continue}
}
数组：awk的数组是map类型的，索引可以是数字也可是字符串。同时支持多维数组。
nawk '{id[NR]=$3};END{for(x = 1; x <= NR; x++) print id[x]}' employees
nawk '/^Tom/{name[NR]=$1};END{for(i in name){print name[i]}}' db
nawk '{count[$2]++}END{for(name in count)print name,count[name] }' datafile4
split(string,array,FS) 按照分隔符FS将string分成多个域放在array中。

内置函数：
(g)sub(regx,string,[tstring])   (在tstring位置处)将regx的第一次(全部)出现替换为string。
index(string,substr)            返回子串的位置
length(string)                  返回字串的长度
substr(string,start,[len])      返回start开始长为len的串
match(string,regx)              返回正则表达式在string中的匹配位置
sprintf()                       返回指定格式的串
awk '{line = sprintf ( "% 15s %6.2f ", $1 , $3 ); print line}' filename
sin cos exp int log rand atan2 sqrt srand等
求子串经常用来格式化具有固定长但没有分隔符的域。而gsub通常用来替换某些无用的字符，使用替换后字符串更有意义。
自定义函数：
function name ( parameter, parameter, parameter, ... )
{
statements
return expression
}