Linux 中sort 和awk经常需要结合使用,单纯的awk sed 怎么使用就不在这里细讲了, 这篇文章只单独讲两个关键字组合使用的情况
举例:
1) 如awk文档中讲到的下面例子
key1|url1|192.80.80.1
key1|url1|192.80.80.1
key1|url2|192.80.80.2
key1|url1|192.80.80.2
key2|url1|192.80.80.1
key2|url2|192.80.80.2
key2|url1|192.80.80.2
现在想要统计的结果是:查看同一个关键字和URL总的访问的次数,以及多少个不同的IP,输出到一个文件中
awk -F"|" '{a[$1" "$2]++;b[$1" "$2" "$3]++}(b[$1" "$2" "$3]==1){++c[$1" "$2]}END{ for (i in a) print i,c[i],a[i]}' file
具体为什么这么用,请查看http://blog.csdn.net/u011517841/article/details/53390810
这里只讲和sort的结合使用
现在想得到同一个关键字和url访问次数的从大到小排列,使用
awk -F"|" '{a[$1" "$2]++;b[$1" "$2" "$3]++}(b[$1" "$2" "$3]==1){++c[$1" "$2]}END{ for (i in a) print i,c[i],a[i]}' file | sort -k 4,4nr
之前不熟的时候,我竟然想用
awk -F"|" '{a[$1" "$2]++;b[$1" "$2" "$3]++}(b[$1" "$2" "$3]==1){++c[$1" "$2]}END{ for (i in a) print i,c[i],a[i]}' file | sort $4
结果发现自己将awk 和sort搞混了。 awk 使用$1 $2 这样的形式,sort 使用-k 1,1这样的形式
sort -k 4,4nr 4,4以第四个域来排列 并且要以数字(n表示数字)来排列,而且是降序(r表示降序)
2) 面试中经常遇见的问题
统计访问某网站次数最多的IP
2016-12-05 11:00:24.379 [qtp18029089-112] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.76 - - [05/Dec/2016:11:00:24 +0800] "GET /atfcapi/suiteCase/descriptionList?caseId=1106 HTTP/1.0" 403 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.324 [qtp18029089-28] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.78 - - [05/Dec/2016:11:00:26 +0800] "GET /atfcapi/sendRequest/getVariables?projectId=12&suiteIds=1106&offset=0&limit=20 HTTP/1.0" 405 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.359 [qtp18029089-27] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.76 - - [05/Dec/2016:11:00:26 +0800] "GET /atfcapi/database/getAllAlias HTTP/1.0" 405 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.396 [qtp18029089-25] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.75 - - [05/Dec/2016:11:00:26 +0800] "POST /atfcapi/project/mockDetail HTTP/1.0" 200 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.473 [qtp18029089-24] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.75 - - [05/Dec/2016:11:00:26 +0800] "GET /atfcapi/sendRequest/get/1106/0 HTTP/1.0" 500 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.520 [qtp18029089-25] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.76 - - [05/Dec/2016:11:00:26 +0800] "GET /atfcapi/commonConfig/getAllEvn HTTP/1.0" 200 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:26.520 [qtp18029089-112] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.80 - - [05/Dec/2016:11:00:26 +0800] "POST /atfcapi/suiteCase/getAll HTTP/1.0" 200 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
2016-12-05 11:00:33.492 [qtp18029089-27] INFO org.eclipse.jetty.server.RequestLog - 192.168.112.75 - - [05/Dec/2016:11:00:31 +0800] "POST /atfcapi/sendRequest/executeSendRequest HTTP/1.0" 200 - "https://atfcapi.alpha.elenet.me/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36"
IP为第7个域名,先用awk 划分,统计个数
awk '{a[$7]++} END{for (i in a) {print i,a[i]}}' log.txt
结果显示如下:
192.168.112.75 3
192.168.112.76 3
192.168.112.78 1
192.168.112.80 1
然后进行排序
awk '{a[$7]++} END{for (i in a) {print i,a[i]}}' log.txt | sort -k 2,2nr
如果想在前面加一个标注呢。
awk 'BEGIN{ print "Ip num"} {a[$7]++} END{for (i in a) {print i,a[i]}}' log.txt | sort -k 2,2nr
这样是不行的,标注也会进行排序的,结果如下
192.168.112.75 3
192.168.112.76 3
192.168.112.78 1
192.168.112.80 1
Ip num
那只好将sort排序完的结果再使用awk输出了
awk '{a[$7]++} END{for (i in a) {print i,a[i]}}' log.txt | sort -k 2,2nr | awk 'BEGIN{print "IP num"} {print $1, $2}
IP num
192.168.112.75 3
192.168.112.76 3
192.168.112.78 1
192.168.112.80 1
2) 我有这样一堆数据,这个模式大约3w条
我希望判断出第二个域内重复的选项。然后重复的选项比较第四个域的大小,取出小的输出第一个域,如果相同比较第三个域的大小,同样取出小的输出第一个域,若还是相同则输出第一个,这个该如何写脚本或者awk判断?
11 elex337_u000014 9 0
12 elex337_Golden214 14 0
14 elex337_u000017 9 0
15 elex337_u000019 11 0
16 elex337_u000020 9 0
17 elex337_Lokio 9 0
18 elex337_u000022 19 0
19 elex337_u000023 11 0
20 elex337_u000024 14 0
21 elex337_swordas15 9 0
22 elex337_Koann 19 0
23 elex337_Vylex 26 0
24 elex337_u000028 19 0
25 elex337_u000014 1 0
26 elex337_Golden214 35 1
27 elex337_u000016 0 0
28 elex337_u000017 22 0
29 elex337_u000019 10 0
30 elex337_u000020 11 0
31 elex337_Lokio 9 0
32 elex337_u000022 9 0
33 elex337_u000023 32 0
34 elex337_u000024 9 0
35 elex337_swordas15 22 0
36 elex337_Koann 11 0
37 elex337_Vylex 22 0
39 elex337_u000042 11 0
40 elex337_u000043 10 0
这里就要用到awk的去重功能了。如果想了解此功能,可查看http://blog.csdn.net/u011517841/article/details/53406883
解题思路:
将文件依次按照 第二行,第三行,第四行排序,这样,如果第二行相同,只需要列出第一条即可
先排序
sort -k2,2 -k4,4n -k3,3n -k1,1n infile
然后去重
sort -k2,2 -k4,4n -k3,3n -k1,1n infile | awk '!a[$2++]'
还会继续补充中.....