数据格式如下:
20111230000005 57375476989eea12893c0c3811607bcf 奇艺高清 1 1 http://www.qiyi.com/
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 3 1 http://www.booksky.org/BookDetail.aspx?BookID=1050804&Level=120111230000007 b97920521c78de70ac38e3713f524b50 118图库 1 1 http://www.bblianmeng.com/
20111230000008 6961d0c97fe93701fc9c0d861d096cd9 华南师范大学图书馆 1 1 http://lib.scnu.edu.cn/
20111230000008 f2f5a21c764aebde1e8afcc2871e086f 满江红 2 1 http://proxyie.cn/
20111230000009 96994a0480e7e1edcaef67b20d8816b7 1 1 http://movie.douban.com/review/1128960/
20111230000009 698956eb07815439fe5f46e9a4503997 youku 1 1 http://www.youku.com/
20111230000009 599cd26984f72ee68b2b6ebefccf6aed 安徽合肥365房产网 1 1 http://hf.house365.com/
20111230000010 f577230df7b6c532837cd16ab731f874 奇艺高清 1 1 http://www.kz321.com/
20111230000010 285f88780dd0659f5fc8acc7cc4949f2 www.sogou.cn 1 1 http://www.iqshuma.com/
20111230000010 57375476989eea12893c0c3811607bcf 推荐待机时间长的手机 1 1 http://mobile.zol.com.cn/148/1487938.html
20111230000010 3d1acc7235374d531de1ca885df5e711 满江红 1 1 http://baike.baidu.com/view/6500.htm
20111230000010 dbce4101683913365648eba6a85b6273 奇艺高清 1 1 http://zhidao.baidu.com/question/38626533
20111230000011 58e7d0caec23bcb4daa7bbcc4d37f008 张国立的电视剧 2 1 http://tv.sogou.com/vertical/2xc3t6wbuk24jnphzlj35zy.html?p=40230600
20111230000011 a3b83dc38b2bbc35660dffcab4ed9da8 www.baidu.com 1 1 http://www.7183.info/
20111230000011 b89952902d7821db37e8999776b32427 满江红 1 1 http://wenwen.soso.com/z/q131927207.htm
20111230000011 7c54c43f3a8a0af0951c26d94a57d6c8 百度一下 你就知道 1 1 http://www.baidu.com/
20111230000005 66c5bb7774e31d0a22278249b26bc83a 凡人修仙传 5 1 http://www.dy241.com/
20111230000011 11097724dae8b9fdcc60bd6fa4ce4df2 118图库 2 1 http://118123.net/
20111230000012 1d374b57fbbc81aa0cc38e6f4efb88ec www.qiyi.com 1 1 http://tui.qihoo.com/28302631/article_2893190.html
要求:
访问时间(fdate)
用户ID(UID)
搜索内容(topic)
该URL在搜索返回结果中的排名(page_num)
用户点击该网页的顺序号(click_num)
用户点击的URL(url)
4大类需求
筛选有效数据
1)非空查询条数(查询内容为空) 2)非空非重复条数(重复指同一时间、UID、查询内容)
统计有效数据
1)数据总条数
2)独立UID数(非重复UID个数)
UID分析
1)UID查询次数分布(按UID分组,并count())
2)UID平均查询次数 ( 总记录数/独立UID )
用户行为分析
直接输入URL作为查询词所占的比例(以下为URL格式,计算(count(www.*.com)+count(www.*cn))/总记录数)
1)www.*.com
2)www.*cn
独立用户行为分析(过滤出指定UID的所有数据)
1)针对单个用户的查询数据分析
数据展现
--------筛选有效数据统计
so = load 'sogou_20.txt' as (fdate:chararray,uid:chararray,topic:chararray,page_num:long,click_num:int,url:chararray);
topic为空值记录数统计:
so_null = filter so by topic is null;
null_grp = group so_null all;
count_null = foreach null_grp generate 'null_count',
COUNT (so_null) as count_num;
dump count_null;
(all,1)
topic不为空值记录数统计:
so_notnull = filter so by topic is not null;
notnull_grp = group so_notnull all;
count_notnull = foreach notnull_grp generate 'notnull_count',
COUNT (so_notnull) as count_num;
dump count_notnull;
(all,19)
非空非重复条数(重复指同一时间、UID、查询内容):
方法-、
notnull_distinct = group so_notnull by (fdate,uid,topic);
notnull_distinct_grp = group notnull_distinct all;
notnull_distinct_count = foreach notnull_distinct_grp generate group,COUNT(notnull_distinct);
dump notnull_distinct_count;
(all,18)
方法二、
f1 = foreach so_notnull generate fdate,uid,topic;
d1 = distinct f1;
g1 = group d1 all;
d1_count = foreach g1 generate group,COUNT(d1);
dump d1_count;
(all,18)
----------统计有效数据
1、数据总条数 = 非空非重复条数
2、独立UID数(非重复UID个数)
g2 = group f1 by uid;
g3 = group g2 all;
f2 = foreach g3 generate group,COUNT(g2);
dump f2;
(all,17)
--------UID分析
1、UID查询次数分布(按UID分组,并count())
so_notnull = filter so by topic is not null;
not_null_grp = group so_notnull by uid;
uid_topic_count = foreach not_null_grp {
f3 = foreach so_notnull generate fdate,uid,topic;
d3 = distinct f3;
generate group,COUNT(d3);
};
dump uid_topic_count;
结果:18条记录
2、UID平均查询次数 ( 总记录数/独立UID )
== 非空非重复条数/独立UID数
g4 = group f1 by uid;
g5 = group g4 all;
f4 = foreach g5 generate 'topic_avg_count',COUNT(g4)/18.0;
dump f4;
(topic_avg_count,0.9444444444444444)
--------用户行为分析