Shell(3):sort和uniq

一、实现功能

sort和uniq的基本用法和实例。

二、sort具体实例

原始文本

[root@hadoop shellexercise]# cat orders.csv 
3,c,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
1,B,88.25,20-May-2008
3,A,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
3,D,25.02,22-Jan-2009

1.-r 以相反的顺序来排序。

[root@hadoop shellexercise]# sort apache_access.log    
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
[root@hadoop shellexercise]# sort -r apache_access.log 
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -


2.-t 指定分隔符

(1)默认是空格

[root@hadoop shellexercise]# sort  apache_access.log            
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
[root@hadoop shellexercise]# sort -t " " -k2n apache_access.log 
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -


(2)以“,”逗号为例,按照第一列逆序排序

[root@hadoop shellexercise]# sort orders.csv 
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
[root@hadoop shellexercise]# sort -t "," -k1nr  orders.csv     
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008

3.-n 根据字符串的数值进行排序

(1)以第一列数值进行顺序排序(其实,不加n也可以排序)

[root@hadoop shellexercise]# sort -t "," -k1n    orders.csv   
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009

(2)以第一列数值进行逆序排序

[root@hadoop shellexercise]# sort -t "," -k1r    orders.csv  
3,D,25.02,22-Jan-2009
3,c,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,A,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008

4.-k 指定按照第几列进行排序

[root@hadoop shellexercise]# sort orders.csv 
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009

(1)对第1列逆序,同时第2列顺序

[root@hadoop shellexercise]# sort -t "," -k1nr -k2    orders.csv 
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008

(2)对第1列逆序,同时第3列逆序

[root@hadoop shellexercise]# sort -t "," -k1nr -k3nr    orders.csv   
3,D,25.02,22-Jan-2009
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008

(3)对某一列的第几个字符排序

[root@hadoop shellexercise]# sort orders.csv 
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009

(4)对第三列的第二个字符进行逆序排序8-5-2-2-2-2

[root@hadoop shellexercise]# sort -t "," -k3.2r    orders.csv          
1,B,88.25,20-May-2008
3,D,25.02,22-Jan-2009
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007

4.其他常用命令总结

-b 忽略每行前面开始出的空格字符。
-c 检查文件是否已经按照顺序排序。
-d 排序时,处理英文字母、数字及空格字符外,忽略其他的字符。
-f 排序时,将小写字母视为大写字母。
-i 排序时,除了040至176之间的ASCII字符外,忽略其他的字符。
-m 将几个排序好的文件进行合并。
-M 将前面3个字母依照月份的缩写进行排序。
-n 依照数值的大小排序。
-u 意味着是唯一的(unique),输出的结果是去完重了的。
-o<输出文件> 将排序后的结果存入指定的文件。
-r 以相反的顺序来排序。
-t<分隔字符> 指定排序时所用的栏位分隔字符。
+<起始栏位>-<结束栏位> 以指定的栏位来排序,范围由起始栏位到结束栏位的前一栏位。
--help 显示帮助。
--version 显示版本信息。

三、uniq具体实例

测试数据

[root@hadoop shellexercise]# cat testlog.txt 
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.9

1.不加其他参数,uniq默认只是把相邻的相同内容去除!

[root@hadoop shellexercise]# cat testlog.txt 
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.9
[root@hadoop shellexercise]# uniq testlog.txt 
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.8
10.0.0.9

2.完全去重:通过sort排序配合完全去重

[root@hadoop shellexercise]# sort testlog.txt | uniq
10.0.0.7
10.0.0.8
10.0.0.9

扩展:sort直接去重
-u, --unique              with -c, check for strict ordering;
                              without -c, output only the first of an equal run

[root@hadoop shellexercise]# sort -u testlog.txt 
10.0.0.7
10.0.0.8
10.0.0.9

3.-c:去重计数【等效:求取重复行】

-c, --count           prefix lines by the number of occurrences

两步:(1)排序,(2)去重。解释:因为-c只是计算相邻的重复的,所以,需要排序使重复行相邻!

[root@hadoop shellexercise]# sort testlog.txt 
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.8
10.0.0.9
10.0.0.9
[root@hadoop shellexercise]# sort testlog.txt|uniq -c
      2 10.0.0.7
      3 10.0.0.8
      2 10.0.0.9

4.-d:查找重复行

总共两步:(1)排序;(2)uniq查重。解释,因为uniq依旧是求解相邻重复行的。

[root@bigdata datas]# cat orders.csv 
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
[root@bigdata datas]# sort orders.csv |uniq -d
3,D,25.02,22-Jan-2009

5.案例:取出ip并根据ip访问次数排序进行排序处理

[root@hadoop shellexercise]# sort testlog.txt 
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.8
10.0.0.9
10.0.0.9
[root@hadoop shellexercise]# sort testlog.txt|uniq -c|sort -r
      3 10.0.0.8
      2 10.0.0.9
      2 10.0.0.7


四、参考

1.https://www.cnblogs.com/wangkongming/p/5033676.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值