一、实现功能
sort和uniq的基本用法和实例。
二、sort具体实例
原始文本
[root@hadoop shellexercise]# cat orders.csv
3,c,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
1,B,88.25,20-May-2008
3,A,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
3,D,25.02,22-Jan-2009
1.-r 以相反的顺序来排序。
[root@hadoop shellexercise]# sort apache_access.log
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
[root@hadoop shellexercise]# sort -r apache_access.log
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
2.-t 指定分隔符
(1)默认是空格
[root@hadoop shellexercise]# sort apache_access.log
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
[root@hadoop shellexercise]# sort -t " " -k2n apache_access.log
10.0.0.46 1 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 4 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 5 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 7 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.47 8 - - [03/Dec/2010:23:27:02 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.46 12 - - [03/Dec/2010:23:27:03 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.41 22 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.42 23 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
10.0.0.43 43 - - [03/Dec/2010:23:27:01 +0800] "HEAD /checkstatus.jsp HTTP/1.0" 200 -
(2)以“,”逗号为例,按照第一列逆序排序
[root@hadoop shellexercise]# sort orders.csv
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
[root@hadoop shellexercise]# sort -t "," -k1nr orders.csv
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008
3.-n 根据字符串的数值进行排序
(1)以第一列数值进行顺序排序(其实,不加n也可以排序)
[root@hadoop shellexercise]# sort -t "," -k1n orders.csv
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
(2)以第一列数值进行逆序排序
[root@hadoop shellexercise]# sort -t "," -k1r orders.csv
3,D,25.02,22-Jan-2009
3,c,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,A,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008
4.-k 指定按照第几列进行排序
[root@hadoop shellexercise]# sort orders.csv
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
(1)对第1列逆序,同时第2列顺序
[root@hadoop shellexercise]# sort -t "," -k1nr -k2 orders.csv
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008
(2)对第1列逆序,同时第3列逆序
[root@hadoop shellexercise]# sort -t "," -k1nr -k3nr orders.csv
3,D,25.02,22-Jan-2009
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
1,B,88.25,20-May-2008
(3)对某一列的第几个字符排序
[root@hadoop shellexercise]# sort orders.csv
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
(4)对第三列的第二个字符进行逆序排序8-5-2-2-2-2
[root@hadoop shellexercise]# sort -t "," -k3.2r orders.csv
1,B,88.25,20-May-2008
3,D,25.02,22-Jan-2009
3,A,12.95,02-Jun-2008
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
2,C,32.00,30-Nov-2007
4.其他常用命令总结
-b 忽略每行前面开始出的空格字符。
-c 检查文件是否已经按照顺序排序。
-d 排序时,处理英文字母、数字及空格字符外,忽略其他的字符。
-f 排序时,将小写字母视为大写字母。
-i 排序时,除了040至176之间的ASCII字符外,忽略其他的字符。
-m 将几个排序好的文件进行合并。
-M 将前面3个字母依照月份的缩写进行排序。
-n 依照数值的大小排序。
-u 意味着是唯一的(unique),输出的结果是去完重了的。
-o<输出文件> 将排序后的结果存入指定的文件。
-r 以相反的顺序来排序。
-t<分隔字符> 指定排序时所用的栏位分隔字符。
+<起始栏位>-<结束栏位> 以指定的栏位来排序,范围由起始栏位到结束栏位的前一栏位。
--help 显示帮助。
--version 显示版本信息。
三、uniq具体实例
测试数据
[root@hadoop shellexercise]# cat testlog.txt
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.9
1.不加其他参数,uniq默认只是把相邻的相同内容去除!
[root@hadoop shellexercise]# cat testlog.txt
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.9
[root@hadoop shellexercise]# uniq testlog.txt
10.0.0.9
10.0.0.8
10.0.0.7
10.0.0.8
10.0.0.9
2.完全去重:通过sort排序配合完全去重
[root@hadoop shellexercise]# sort testlog.txt | uniq
10.0.0.7
10.0.0.8
10.0.0.9
扩展:sort直接去重
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
[root@hadoop shellexercise]# sort -u testlog.txt
10.0.0.7
10.0.0.8
10.0.0.9
3.-c:去重计数【等效:求取重复行】
-c, --count prefix lines by the number of occurrences
两步:(1)排序,(2)去重。解释:因为-c只是计算相邻的重复的,所以,需要排序使重复行相邻!
[root@hadoop shellexercise]# sort testlog.txt
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.8
10.0.0.9
10.0.0.9
[root@hadoop shellexercise]# sort testlog.txt|uniq -c
2 10.0.0.7
3 10.0.0.8
2 10.0.0.9
4.-d:查找重复行
总共两步:(1)排序;(2)uniq查重。解释,因为uniq依旧是求解相邻重复行的。
[root@bigdata datas]# cat orders.csv
1,B,88.25,20-May-2008
2,C,32.00,30-Nov-2007
3,A,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
3,B,12.95,02-Jun-2008
3,c,12.95,02-Jun-2008
3,D,25.02,22-Jan-2009
[root@bigdata datas]# sort orders.csv |uniq -d
3,D,25.02,22-Jan-2009
5.案例:取出ip并根据ip访问次数排序进行排序处理
[root@hadoop shellexercise]# sort testlog.txt
10.0.0.7
10.0.0.7
10.0.0.8
10.0.0.8
10.0.0.8
10.0.0.9
10.0.0.9
[root@hadoop shellexercise]# sort testlog.txt|uniq -c|sort -r
3 10.0.0.8
2 10.0.0.9
2 10.0.0.7