Linux下常用的文本分析工具

最新推荐文章于 2022-07-09 00:19:08 发布

Deepmilu

最新推荐文章于 2022-07-09 00:19:08 发布

阅读量529

点赞数

分类专栏： linux 文章标签： linux shell

本文链接：https://blog.csdn.net/ldy_2017/article/details/103462496

版权

linux 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

在linux下用vim创建一个test文本文件用来测试，编辑文件，输入以下内容：
hello:world:name:phone
what’s your name
you should work hard or you will failure
phone number:162-8990-0988
This dish tast good
too many problem need solve
this is a test file,write some words to test

1、 grep

grep分析一行信息，若当中有我们所需要的信息，就将该行拿出来。简单的语法如下：
grep [-acinv] [–color=auto] ‘查找字符’ filename
选项与参数：

-a：将二进制文件以文本文件的方式查找数据
-c：计算找到查找字符的次数
-i：忽略大小写的不同
-n：输出行号
-v：反向选择，亦即显示出没有‘查找字符’内容的那一行
--color：将找到的关键字部分加上颜色显示

[范例1]查找含有‘test’字符的数据行
vagrant@vagrant-ubuntu-trusty-64:~$ grep test test
输出：this is a test file

[范例2]统计含有‘name’字符的数据行数
vagrant@vagrant-ubuntu-trusty-64:~$ grep -c name test
输出：2

[范例3]利用中括号查找集合字符
vagrant@vagrant-ubuntu-trusty-64:~$ grep -n 't[ae]st' tes
输出：
5:This dish  tast good
6:this is a test file,write some  words to test
vagrant@vagrant-ubuntu-trusty-64:~$ grep -n '[^t]oo' test
输出：5:This dish  tast good

[范例4]行首与行位字符^$
vagrant@vagrant-ubuntu-trusty-64:~$ grep -n '^this' test
输出：7:this is a test file,write some  words to test
vagrant@vagrant-ubuntu-trusty-64:~$ grep -n '^$' test
输出：空白行

2、 egrep

等价于grep -E 命令，grep默认支持基础正则表达式，egrep支持扩展型的正则表达式，支持一下RE字符：+、？、|、（）、（）+

3、 awk

awk是一个很好地数据处理工具，sed常常作用于一整个行的处理，awk则比较倾向于一行当中分成数个字段来处理。awk以行为一次处理的单位，为以字段为最小的处理单位。因此awk相当适合处理小型的文本数据。
awk常见的运行模式是这样的：
awk ‘条件类型1{操作1} 条件类型2{操作2}’ filename
awk用$1、$2…来标识每一行数据中的每一列，列之间的分隔以空格或tab键来隔开。特殊的是，$0代表的是整行数据，此外还有以下内置变量：

变量名称	意义
NF	每一行的字段总数
NR	目前awk处理的是第几行数据
FS	目前的分隔字符，默认是空格键

[范例1：]打印字段总数，第几行数据，和第一列。
vagrant@vagrant-ubuntu-trusty-64:~$ last|awk '{print $1 "\t  lines: " NR "\t columns: " NF}'
输出：
vagrant	  lines: 1	 columns: 10
reboot	  lines: 2	 columns: 11
vagrant	  lines: 3	 columns: 10
reboot	  lines: 4	 columns: 11
	  	  lines: 5	 columns: 0
wtmp	  lines: 6	 columns: 7

awk的逻辑运算符：

运算单元	代表的意义
>	大于
<	小于
>=	大于或等于
<=	小于或等于
==	等于
!=	不等于

[范例2：]使用条件表达式
vagrant@vagrant-ubuntu-trusty-64:~$ cat /etc/passwd| awk 'BEGIN {FS=":"} $3<10 {print $1 "\t " $3}'
输出：
root	 0
daemon	 1
bin	 	 2
sys	 	 3
sync	 4
games	 5
man	 	 6
lp	 	 7
mail	 8
news	 9

这里因为/etc/password文件中的分隔符是“:”，因此提前用BEGIN关键词定义FS。

接下来新建一个salary.txt文件测试awk的计算功能，文件内包含以下内容：

Name June July August
xiaoming 3008 4999 8722
damao 788 8999 1000
meimei 8991 9011 10003

[范例3：]计算每个员工六七八三个月的工资总和
vagrant@vagrant-ubuntu-trusty-64:~$ cat salary.txt | awk 'NR==1{printf  "%10s %10s %10s %10s %10s\n",$1,$2,$3,$4,"Total"}  NR>=2{total=$2+$4+$3 ; printf "%10s %10s %10s %10s %10.2f\n",$1,$2,$3,$4,total}'

输出：
Name       June       July     August      Total
xiaoming       3008       4999       8722   16729.00
  damao        788       8999       1000   10787.00
  meimei       8991       9011      10003   28005.00

4、 sed

sed是一个好用的文本分析工具，可以将数据进行替换、删除、新增。
用法：sed [-nefr] [操作]
选项与参数：
-n：安静模式，在sed的一般用法中，所有来自stdin的数据一般都会被列到屏幕上，但如果加上-n选项，则只有经过sed特殊处理的那一行才会被列出来。
-e：直接在命令行模式上进行sed的操作编辑
-f: 直接将sed的操作写在一个文件内，-f filename则可以执行filename内的sed操作。
-i：直接修改读取的文件内容，而不是由屏幕输出
-r:使用扩展的正则表达式语法，默认是基础正则表达式语法。

操作说明：[n1[,n2]] function
n1,n2：不一定会存在，一般代表进行操作的行数，例如，操作需要在10到20行之间进行，则【10,20[操作行为]】

function有下面这些内容：
a：新增，a的后面可以接字符，这些字符会在新的一行出现（当前的下一行）；
c：替换，c的后面可以接字符，这些字符可以替换n1，n2之间的行；
d：删除，后面不接任何东西；
i：插入，i的后面可以接字符，这些字符会在新的一行出现（目前的上一行）；
p：打印，将某个选择的数据打印，通常p会与参数sed -n一起运行；
s：替换，可以直接进行替换操作，通常s的操作可以搭配正则表达式，例如，1,20s/old/new/g就是。

[范例1：]删除2-6行
vagrant@vagrant-ubuntu-trusty-64:~$ nl test |sed '2,6d'
输出：
1	hello:world:name:phone
7	this is a test file,write some  words to test

[范例2：]在第二行后面加上【hello  world】
vagrant@vagrant-ubuntu-trusty-64:~$ nl test |sed '2a hello world'
输出：
     1	hello:world:name:phone
     2	what's  your name
hello world
     3	you should work hard or you will failure
     4	phone number:162-8990-0988
     5	This dish  tast good
     6	too many problem need solve
     7	this is a test file,write some  words to test
增加多行：vagrant@vagrant-ubuntu-trusty-64:~$ nl test |sed '2a hello world\nyou name'
输出：
 1	hello:world:name:phone
     2	what's  your name
hello world
you name
     3	you should work hard or you will failure
     4	phone number:162-8990-0988
     5	This dish  tast good
     6	too many problem need solve
     7	this is a test file,write some  words to test

[例3：]整行替换
vagrant@vagrant-ubuntu-trusty-64:~$ nl test |sed '2,3c  changed'
输出：
	 1	hello:world:name:phone
changed
     4	phone number:162-8990-0988
     5	This dish  tast good
     6	too many problem need solve
     7	this is a test file,write some  words to test


[例4：] 部分数据的查找和替换
vagrant@vagrant-ubuntu-trusty-64:~$ grep 162 test | sed 's/[0-9]*-[0-9]*-[0-9]*/110/g'
输出：
phone number:110

[例5：] 直接修改文件
vagrant@vagrant-ubuntu-trusty-64:~$ sed -i '$a #add by sed'  test
test文件最后一行添加了：#add by sed

5、 cut

cut命令可以将一段信息的某一段给切出来，处理的信息是以行为单位的。
选项与参数：
-d：后面接分隔字符，与-f一起使用；
-f：根据-d的分隔字符将一段信息划分为数段，用-f取出第几段的意思；
-c：以字符（characters）的单位取出固定字符区间；

[范例1]取出第一行文本，以“:”为分隔符的第二段字符
vagrant@vagrant-ubuntu-trusty-64:~$ grep hello test | cut -d':' -f 2    
输出：world

[范例2]取出第一行文本，第10到末尾的所有字符
vagrant@vagrant-ubuntu-trusty-64:~$ grep hello test | cut -c 10-
输出：ld:name:phone

[范例3]取出第一行文本，以“:”为分隔符的第二段和第四段字符
vagrant@vagrant-ubuntu-trusty-64:~$ grep hello test | cut -d':' -f 2,4   
输出：world:phone

*参考书目：《鸟哥的Linux私房菜基础学习篇》