对于含有以下内容的文件visited_url.db
http://www.csdn.net
http://www.csdn.net/blog
http://www.csdn.net/blog/201312
http://www.csdn.net/info
http://www.csdn.net/app
http://www.csdn.net/info
http://www.csdn.net
http://www.csdn.net/app/2131
http://www.youku.com/news/1
http://www.youku.com/news/2
http://www.youku.com/news/3
http://www.youku.com/news
http://www.youku.com/news
http://www.csdn.net
1.去除重复url,使重复的url只显示一次,然后输出到标准输出
具体命令如下:
john@john-IdeaPad:~/c_workspace$ cat visited_url.db | sort | uniq
http://www.csdn.net
http://www.csdn.net/app
http://www.csdn.net/app/2131
http://www.csdn.net/blog
http://www.csdn.net/blog/201312
http://www.csdn.net/info
http://www.youku.com/news
http://www.youku.com/news/1
http://www.youku.com/news/2
http://www.youku.com/news/3
2.去除重复的url,使重复的url只显示一次并统计url出现的次数,然后输出到标准输出
使用uniq 的--count参数即可
john@john-IdeaPad:~/c_workspace$ cat visited_url.db | sort | uniq -c
3 http://www.csdn.net
1 http://www.csdn.net/app
1 http://www.csdn.net/app/2131
1 http://www.csdn.net/blog
1 http://www.csdn.net/blog/201312
2 http://www.csdn.net/info
2 http://www.youku.com/news
1 http://www.youku.com/news/1
1 http://www.youku.com/news/2
1 http://www.youku.com/news/3
3.只显示不重复的url,输出到标准输出
uniq -u
john@john-IdeaPad:~/c_workspace$ cat visited_url.db | sort | uniq -u
http://www.csdn.net/app
http://www.csdn.net/app/2131
http://www.csdn.net/blog
http://www.csdn.net/blog/201312
http://www.youku.com/news/1
http://www.youku.com/news/2
http://www.youku.com/news/3
4.之显示重复的url,并统计出现次数并输出到标准输出
uniq -d
john@john-IdeaPad:~/c_workspace$ cat visited_url.db | sort | uniq -d
http://www.csdn.net
http://www.csdn.net/info
http://www.youku.com/news
5.获取url中的域名部分,去除重复并统计出现次数
结合awk、sort、uniq
john@john-IdeaPad:~/c_workspace$ cat visited_url.db | awk -F'[/:]' '{print $4}' | sort | uniq -c
9 www.csdn.net
5 www.youku.com