在Linux中,对文本处理的命令主要及常用的命令有awk、sed、grep
awk是对列的处理,sed是对行的处理
先聊聊grep命令,对文本的搜索过滤
通过grep --help命令来查看grep命令的参数及用法
$ grep --help
Usage: grep [OPTION]... PATTERN [FILE]...
Search for PATTERN in each FILE or standard input.
PATTERN is, by default, a basic regular expression (BRE).
Example: grep -i 'hello world' menu.h main.c
Regexp selection and interpretation:
-E, --extended-regexp PATTERN is an extended regular expression (ERE) 正则表达式
-F, --fixed-strings PATTERN is a set of newline-separated fixed strings
-G, --basic-regexp PATTERN is a basic regular expression (BRE)
-P, --perl-regexp PATTERN is a Perl regular expression
-e, --regexp=PATTERN use PATTERN for matching
-f, --file=FILE obtain PATTERN from FILE
-i, --ignore-case ignore case distinctions
-w, --word-regexp force PATTERN to match only whole words
-x, --line-regexp force PATTERN to match only whole lines
-z, --null-data a data line ends in 0 byte, not newline
Miscellaneous:
-s, --no-messages suppress error messages
-v, --invert-match select non-matching lines
-V, --version print version information and exit
--help display this help and exit
--mmap ignored for backwards compatibility
Output control:
-m, --max-count=NUM stop after NUM matches
-b, --byte-offset print the byte offset with output lines
-n, --line-number print line number with output lines
--line-buffered flush output on every line
-H, --with-filename print the filename for each match
-h, --no-filename suppress the prefixing filename on output
--label=LABEL print LABEL as filename for standard input
-o, --only-matching show only the part of a line matching PATTERN
-q, --quiet, --silent suppress all normal output
--binary-files=TYPE assume that binary files are TYPE;
TYPE is `binary', `text', or `without-match'
-a, --text equivalent to --binary-files=text
-I equivalent to --binary-files=without-match 只列出匹配的文件名
-d, --directories=ACTION how to handle directories;
ACTION is `read', `recurse', or `skip'
-D, --devices=ACTION how to handle devices, FIFOs and sockets;
ACTION is `read' or `skip'
-R, -r, --recursive equivalent to --directories=recurse
--include=FILE_PATTERN search only files that match FILE_PATTERN
--exclude=FILE_PATTERN skip files and directories matching FILE_PATTERN
--exclude-from=FILE skip files matching any file pattern from FILE
--exclude-dir=PATTERN directories that match PATTERN will be skipped.
-L, --files-without-match print only names of FILEs containing no match 列出不匹配的文件名
-l, --files-with-matches print only names of FILEs containing matches
-c, --count print only a count of matching lines per FILE
-T, --initial-tab make tabs line up (if needed)
-Z, --null print 0 byte after FILE name
Context control:
-B, --before-context=NUM print NUM lines of leading context
-A, --after-context=NUM print NUM lines of trailing context
-C, --context=NUM print NUM lines of output context
-NUM same as --context=NUM
--color[=WHEN],
--colour[=WHEN] use markers to highlight the matching strings;
WHEN is `always', `never', or `auto'
-U, --binary do not strip CR characters at EOL (MSDOS)
-u, --unix-byte-offsets report offsets as if CRs were not there (MSDOS)
`egrep' means `grep -E'. `fgrep' means `grep -F'.
Direct invocation as either `egrep' or `fgrep' is deprecated.
With no FILE, or when FILE is -, read standard input. If less than two FILEs
are given, assume -h. Exit status is 0 if any line was selected, 1 otherwise;
if any error occurs and -q was not given, the exit status is 2.
Report bugs to: bug-grep@gnu.org
GNU Grep home page: <http://www.gnu.org/software/grep/>
General help using GNU software: <http://www.gnu.org/gethelp/>
grep的参数很多,先学习下常用的几个参数
常用选项:
-E :开启扩展(Extend)的正则表达式。
-i :忽略大小写(ignore case)。
-v :反过来(invert),只打印没有匹配的,而匹配的反而不打印。
-n :显示行号
-w :被匹配的文本只能是单词,而不能是单词中的某一部分,如文本中有liker,而我搜寻的只是like,就可以使用-w选项来避免匹配liker
-c :显示总共有多少行被匹配到了,而不是显示被匹配到的内容,注意如果同时使用-cv选项是显示有多少行没有被匹配到。
-o :只显示被模式匹配到的字符串。
–color :将匹配到的内容以颜色高亮显示。
-A n:显示匹配到的字符串所在的行及其后n行,after
-B n:显示匹配到的字符串所在的行及其前n行,before
-C n:显示匹配到的字符串所在的行及其前后各n行,context
$ grep 'root' /etc/passwd # 找出文件passwd中包含root的行
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
$ grep -i 'ROOT' /etc/passwd # 找出文件中不区分大小写root的行
root:x:0:0:root:/root:/bin/bash
operator:x:11:0:operator:/root:/sbin/nologin
$ grep -n 'root' /etc/passwd # 找出文件中包含root的行并显示行数
1:root:x:0:0:root:/root:/bin/bash
11:operator:x:11:0:operator:/root:/sbin/nologin
$ grep -c 'root' /etc/passwd # 统计文件中包含root的行数
2
$ grep -vc 'root' /etc/passwd # 统计文件中不包含root的行数
33
$ grep -o 'root' /etc/passwd # 将文件中所有的root输出
root
root
root
root
$ more rt_data.txt
10026 201501120030 5170
10026 201501120100 5669
10026 201501120130 2396
10026 201501120200 1498
10026 201501120230 1997
10026 201501120300 1188
10026 201501120330 598
10026 201501120400 479
10026 201501120430 1587
10026 201501120530 799
10027 201501120030 2170
10027 201501120100 1623
10027 201501120130 3397
10027 201501120200 1434
10027 201501120230 1001
10028 201501120300 1687
10028 201501120330 1298
10028 201501120400 149
10029 201501120430 2587
10029 201501120530 589
$ grep -w '10026' rt_data.txt # 将单词为10026的行输出
10026 201501120030 5170
10026 201501120100 5669
10026 201501120130 2396
10026 201501120200 1498
10026 201501120230 1997
10026 201501120300 1188
10026 201501120330 598
10026 201501120400 479
10026 201501120430 1587
10026 201501120530 799
$ grep -A 2 '201501120030' rt_data.txt # 找到包含201501120030的行及后两行
10026 201501120030 5170
10026 201501120100 5669
10026 201501120130 2396
--
10027 201501120030 2170
10027 201501120100 1623
10027 201501120130 3397
$ grep -B 2 '201501120030' rt_data.txt # 找到包含201501120030的行及前两行
10026 201501120030 5170
--
10026 201501120430 1587
10026 201501120530 799
10027 201501120030 2170
$ grep -C 2 '201501120030' rt_data.txt # 找到包含201501120030的行及前后两行
10026 201501120030 5170
10026 201501120100 5669
10026 201501120130 2396
--
10026 201501120430 1587
10026 201501120530 799
10027 201501120030 2170
10027 201501120100 1623
10027 201501120130 3397
将匹配到的内容以颜色高亮显示。
正则表达式匹配
匹配字符:
. :任意一个字符。
[abc] :表示匹配一个字符,这个字符必须是abc中的一个。
[a-zA-Z] :表示匹配一个字符,这个字符必须是a-z或A-Z这52个字母中的一个。
[^123] :匹配一个字符,这个字符是除了1、2、3以外的所有字符。
对于一些常用的字符集,系统做了定义:
[A-Za-z] 等价于 [[:alpha:]]
[0-9] 等价于 [[:digit:]]
[A-Za-z0-9] 等价于 [[:alnum:]]
tab,space 等空白字符 [[:space:]]
[A-Z] 等价于 [[:upper:]]
[a-z] 等价于 [[:lower:]]
标点符号 [[:punct:]]
匹配次数:
\{m,n\} :匹配其前面出现的字符至少m次,至多n次。
\? :匹配其前面出现的内容0次或1次,等价于\{0,1\}。
* :匹配其前面出现的内容任意次,等价于\{0,\},所以 ".*" 表述任意字符任意次,即无论什么内容全部匹配。
$ grep -n '/.*sh' /etc/passwd
1:root:x:0:0:root:/root:/bin/bash
7:shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
32:sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
34:rdedu:x:500:500::/home/rdedu:/bin/bash
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
$ grep -n '/.\{0,3\}sh' /etc/passwd
1:root:x:0:0:root:/root:/bin/bash
7:shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
32:sshd:x:74:74:Privilege-separated SSH:/var/empty/sshd:/sbin/nologin
34:rdedu:x:500:500::/home/rdedu:/bin/bash
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
$ grep -n -w '.\{0,3\}sh' /etc/passwd
1:root:x:0:0:root:/root:/bin/bash
34:rdedu:x:500:500::/home/rdedu:/bin/bash
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
位置锚定:
^ :锚定行首
$ :锚定行尾。技巧:"^$"用于匹配空白行。
\b或\<:锚定单词的词首。如"\blike"不会匹配alike,但是会匹配liker
\b或\>:锚定单词的词尾。如"\blike\b"不会匹配alike和liker,只会匹配like
\B :与\b作用相反
$ grep -n --color 'sh\b' /etc/passwd
1:root:x:0:0:root:/root:/bin/bash
34:rdedu:x:500:500::/home/rdedu:/bin/bash
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
$ grep -n --color '^mysql' /etc/passwd
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
$ grep -n --color 'sync$' /etc/passwd
6:sync:x:5:0:sync:/sbin:/bin/sync
$ grep -n --color '\bmy' /etc/passwd
35:mysql:x:27:27:MySQL Server:/var/lib/mysql:/bin/bash
分组及引用:
\(string\) :将string作为一个整体方便后面引用
\1 :引用第1个左括号及其对应的右括号所匹配的内容。
\2 :引用第2个左括号及其对应的右括号所匹配的内容。
\n :引用第n个左括号及其对应的右括号所匹配的内容。
$ grep '\(nt\)\([a-zA-Z]\).*\1' /etc/passwd # 将出现两次nt的行输出
ntp:x:38:38::/etc/ntp:/sbin/nologin
$ grep '^\(n\)\([a-zA-Z]\).*\1' /etc/passwd # 将以nt开头的行输出
nobody:x:99:99:Nobody:/:/sbin/nologin
nfsnobody:x:65534:65534:Anonymous NFS User:/var/lib/nfs:/sbin/nologin
ntp:x:38:38::/etc/ntp:/sbin/nologin