001|2014-09-10 00-09|TKH
001|2014-09-10 09-17|TKH
003|2014-09-10 00-09|TKH
002|2014-09-10 00-09|TKH
002|2014-09-10 09-17|BEIJING ROAD
003|2014-09-10 09-17|TMALL
004|2014-09-10 00-09|TKH
001|2014-09-10 17-24|BEIJING ROAD
004|2014-09-10 00-09|TKH
001|2014-10-10 00-09|XINHUA BOOKSHOP
003|2014-10-10 00-09|TKH
004|2014-09-10 00-09|TKH
001|2014-10-10 09-17|TMALL
以上是数据, 第一列代表的是imsi,类似用户ID,第二列是时间段,第三列是用户停留的地方
这次的练习是计算用户从一个地方去到另一个地方的概率统计,要求转移地方的时间间隔不超过12小时,否则数据作废,不纳入统计,最后求出概率前三的地方
以上是代码,主要是为了熟悉pig语法
读取数据,'/user/hdfs/week4pig/DATA.txt 这条路径是HDFS上的路径,不是local
data = load '/user/hdfs/week4pig/DATA.txt' using PigStorage('|') as (imsi:chararray, time:chararray, loc:chararray);
注册jar,这里使用了相对路径,因为我从linux进入pig的时候,就在路径下就有这两个包
register piggybank.jar;
register joda-time-1.6.2.jar;
自定义函数名字
define CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
toISO = foreach data generate imsi, CustomFormatToISO(SUBSTRING(time,0,13),'YYYY-MM-dd HH') as time:chararray,loc;
grp = group toISO by imsi;
register datafu-1.2.0.jar;
define MarkovPairs datafu.pig.stats.MarkovPairs();
pairs = foreach grp
prj = foreach pairs generate data.imsi as imsi, data.time as time, next.time as next_time, data.loc as loc, next.loc as next_loc;
define ISOHoursBetween org.apache.pig.piggybank.evaluation.datetime.diff.ISOHoursBetween();
filter时间间隔超过12小时的数据
flt = filter prj by ISOHoursBetween(next_time, time) < 12L;
total_count = foreach (group flt by (time,loc)) generate FLATTEN(group) as (time, loc), COUNT(flt) as total;
pairs_count = foreach (group flt by (time, loc,next_loc)) generate FLATTEN(group) as (time, loc, next_loc), COUNT(flt) as cnt;
pig的join用法
jnd = JOIN pairs_count BY (time,loc), total_count BY (time, loc) USING 'replicated';
prob = foreach jnd generate pairs_count::time as time, pairs_count::loc as loc, pairs_count::next_loc as next_loc, (double) cnt / (double) total as probability;
求概率前三
top3 = foreach (group prob by (time,loc))
将结果存进HDFS
store top3 into '/user/hdfs/week4pig/result.txt';