Pig练习

001|2014-09-10 00-09|TKH
001|2014-09-10 09-17|TKH
003|2014-09-10 00-09|TKH
002|2014-09-10 00-09|TKH
002|2014-09-10 09-17|BEIJING ROAD
003|2014-09-10 09-17|TMALL
004|2014-09-10 00-09|TKH
001|2014-09-10 17-24|BEIJING ROAD
004|2014-09-10 00-09|TKH
001|2014-10-10 00-09|XINHUA BOOKSHOP
003|2014-10-10 00-09|TKH
004|2014-09-10 00-09|TKH
001|2014-10-10 09-17|TMALL

 

以上是数据, 第一列代表的是imsi,类似用户ID,第二列是时间段,第三列是用户停留的地方

 

这次的练习是计算用户从一个地方去到另一个地方的概率统计,要求转移地方的时间间隔不超过12小时,否则数据作废,不纳入统计,最后求出概率前三的地方

 

以上是代码,主要是为了熟悉pig语法

 

读取数据,'/user/hdfs/week4pig/DATA.txt 这条路径是HDFS上的路径,不是local

data = load '/user/hdfs/week4pig/DATA.txt' using PigStorage('|') as (imsi:chararray, time:chararray, loc:chararray);

 

注册jar,这里使用了相对路径,因为我从linux进入pig的时候,就在路径下就有这两个包

register piggybank.jar;
register joda-time-1.6.2.jar;

 

自定义函数名字
define CustomFormatToISO org.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();

 

toISO = foreach data generate imsi, CustomFormatToISO(SUBSTRING(time,0,13),'YYYY-MM-dd HH') as time:chararray,loc;

 

grp = group toISO by imsi;

 

register datafu-1.2.0.jar;
define MarkovPairs datafu.pig.stats.MarkovPairs();

pairs = foreach grp
  {
   sorted = order toISO by imsi, time;
   pair = MarkovPairs(sorted);
   generate FLATTEN(pair) as (data:tuple(imsi,time,loc), next:tuple(imsi,time,loc));

  };
  
prj = foreach pairs generate data.imsi as imsi, data.time as time, next.time as next_time, data.loc as loc, next.loc as next_loc;

 

define ISOHoursBetween org.apache.pig.piggybank.evaluation.datetime.diff.ISOHoursBetween();

 

filter时间间隔超过12小时的数据

flt = filter prj by ISOHoursBetween(next_time, time) < 12L;

 

total_count = foreach (group flt by (time,loc)) generate FLATTEN(group) as (time, loc), COUNT(flt) as total;

pairs_count = foreach (group flt by (time, loc,next_loc)) generate FLATTEN(group) as (time, loc, next_loc), COUNT(flt) as cnt;

 

pig的join用法

jnd = JOIN pairs_count BY (time,loc), total_count BY (time, loc) USING 'replicated';

prob = foreach jnd generate pairs_count::time as time, pairs_count::loc as loc, pairs_count::next_loc as next_loc, (double) cnt / (double) total as probability;

 

求概率前三

top3 = foreach (group prob by (time,loc))
  {
   sorted = order prob by probability DESC;
   top = LIMIT sorted 3;
   generate FLATTEN(top);
  };

将结果存进HDFS  
store top3 into '/user/hdfs/week4pig/result.txt';
  

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值