摘自 Linux Shell 脚本攻略 第四章 让文本飞
统计特定文件中的词频
$ cat word_frea.sh
#!/bin/bash
# 文件名:word_freq.sh
# 用途: 计算文件中单词的词频
if [ $# -ne 1 ];then
echo "Usage: $0 filename";
exit -1
fi
filename=$1
# 模式\b[[:alpha:]]+\b能够匹配每个单词并去除空白字符和标点符号
# 选项-o打印出匹配到的单词,一行一个
# count 为关联数组 键为egrep检索出的每行信息 值为 出现次数
egrep -o "\b[[:alpha:]]+\b" $filename | awk '{ count[$0]++ }END{ printf("%-14s%s\n","Word","Count") ;
for(ind in count){
printf("%-14s%d\n",ind,count[ind]);
}
}'
$ ./word_frea.sh coco.sh
\Word Count
USSR 1
yes 2
ssh 1
r 2
amlogic 1
expect 3
password 1
spawn 1
bash 1
null 1
OF 2
of 1
send 2
Devi 1
no 1
bin 2
压缩或解压缩JavaScript
$ cat sample.js
function sign_out()
{
$("#loading").show();
$.get("log_in",{logout:"True"},
function(){
window.location="";
});
}
# tr -d '\n\t' 移除换行符和制表符
# tr -s ' ' 移除重复的空格
# sed 's:/\*.*\*/::g' 替换 /* ... */ 内容 s: 将分隔符替换成:
# sed 's/ \?\([{}();,:]\) \?/\1/g' 去除括号冒号逗号分号前后的空格
# \? 匹配零个或一个空格 ?需要转义
$ cat sample.js |tr -d '\n\t'|tr -s ' '|sed 's:/\*.*\*/::g' |sed 's/ \?\([{}();,:]\) \?/\1/g' > obfuscated.txt
function sign_out(){$("#loading").show();$.get("log_in",{logout:"True"},function(){window.location="";});}
$ cat obfuscated.txt | sed 's/;/;\n/g; s/{/{\n\n/g; s/}/\n\n}/g'
function sign_out(){
$("#loading").show();
$.get("log_in",{
logout:"True"
},function(){
window.location="";
});
按列合并多个文件
可以用paste命令实现按列合并
$ cat file1.txt
1
2
3
4
5
$ cat file2.txt
slynux
gnu
bash
hack
$ paste file1.txt file2.txt
1 slynux
2 gnu
3 bash
4 hack
5
# 默认的分隔符是制表符,也可以用-d指定分隔符
$ paste file1.txt file2.txt -d ","
1,slynux
2,gnu
3,bash
4,hack
5,
打印文件或行中的第n个单词或列
这种任务通常都是使用awk来完成
$ awk '{ print $5 }' filename
$ awk '{ print $1 }' file1.txt
1
2
3
4
5
也可以打印多列数据并在各列间插入指定的字符串
$ ls -l | awk '{print$1": "$NF}'
total: 391720
drwxrwxr-x: 3d
drwxrwxr-x: 4k
-rw-rw-r--: [4KH264_29.970fps_11.1Mbps_8bit]2020-2160P.mp4
-rw-rw-r--: 2Mbps.mp4
-rw-rw-r--: [4KVP9_29.97fps_10Mbps_8bit]UD_Europe_by_Dominic_0725.webm
drwxrwxr-x: a
-rwxrwxr-x: build1.sh
打印指定行或模式之间的文本
awk、grep和sed都可以根据条件打印部分行。最简单的方法是使用grep打印匹配模式的行不过,最全能的工具还是awk
打印从M行到N行之间的文本
awk 'NR==M, NR==N' filename
$ awk 'NR==1, NR==3' coco.sh
#!/bin/bash
$ cat coco.sh |awk 'NR==1, NR==3'
#!/bin/bash
打印位于模式start_pattern与end_pattern之间的文本
$ cat section.txt
line with pattern1
line with pattern2
line with pattern3
line end with pattern4
line with pattern5
$ awk '/pa.*3/, /end/' section.txt
line with pattern3
line end with pattern4
以逆序形式打印行
最简单的实现方法是使用tac命令。当然也可以用awk来搞定
$ seq 5 |tac
5
4
3
2
1
$ echo "1,2" | tac -s,
2
1
使用awk 实现方式如下
$ seq 5| awk '{ lifo[NR]=$0 }END { for(lno=NR;lno>-1;lno--){ print lifo[lno]; }}'
5
4
3
2
1
解析文本中的电子邮件地址和URL
能够匹配电子邮件地址的正则表达式如下:
[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}
$ cat url_email.txt
this is a line of text contains,<email> #slynux@slynux.com.
</email> and email address, blog "http://www.google.com",
test@yahoo.com dfdfdfdddfdf;cool.hacks@gmail.com<br />
<a href="http://code.google.com"><h1>Heading</h1>
$ egrep -o '[A-Za-z0-9._]+@[A-Za-z0-9.]+\.[a-zA-Z]{2,4}' url_email.txt
slynux@slynux.com
test@yahoo.com
cool.hacks@gmail.com
能够匹配HTTP URL的正则表达式如下
http://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,4}
$ egrep -o "http://[a-zA-Z0-9.]+\.[a-zA-Z]{2,3}" url_email.txt
http://www.google.com
http://code.google.com
删除文件中包含特定单词的句子
$ cat sentence.txt
Linux refers to the family of Unix-like computer operating systems that use the Linux kernel. Linux can be installed on a wide variety of computer hardware, ranging from mobile phones, tablet computers and video game consoles, to mainframes and supercomputers. Linux is predominantly known for its use in servers.
$ cat sentence.txt | sed 's/ [^.]*mobile phones[^.]*\.//g'
Linux refers to the family of Unix-like computer operating systems that use the Linux kernel. Linux is predominantly known for its use in servers.
这里不考虑 语句换行的情况
对目录中的所有文件进行文本替换
在目录下查找sh文件 并将bin替换为binn 打印出来
$ find . -name "*.sh" -print0 | xargs -I {} -0 sed -n 's/bin/binn/p' {}
#!/binn/bash
#!/binn/bash
/USSR/binn/expect <<-OF &>/Devi/null
#!/binn/bash
#!/binn/bash
# !/binn/bash
#!/binn/bash
#!/binn/bash
#!/binn/bash
# !/binn/bash
# !/binn/bash
# 或者使用 -exec 传递参数
$ find . -name "*.sh" -exec sed -n 's/bin/binn/p' {} \;
#!/binn/bash
#!/binn/bash
/USSR/binn/expect <<-OF &>/Devi/null
#!/binn/bash
#!/binn/bash
# !/binn/bash
#!/binn/bash
#!/binn/bash
#!/binn/bash
# !/binn/bash
# !/binn/bash
文本切片与参数操作
替换字符
$ var="This is a line of text"
$ echo ${var/line/REPLACED}
This is a REPLACED of text
单词line被替换成了REPLACED
字符切片
$ string=abcdefghijklmnopqrstuvwxyz
# 打印第5个字符之后的内容
$ echo ${string:4}
efghijklmnopqrstuvwxyz
# 打印第5个字符开始到第8个之间的内容
$ echo ${string:4:8}
efghijkl