1,首先在hadoop集群中启动jobtrakker
2,hive以提供远程服务模式启动
nohup hive –service hiveserver &
3,用户关系表user_relation
字段 uid1, uid2
样例数据 1 2
2 1
2 5
5 2
4,根据需求执行分析所有用户一度好友
select a.uid1,a.uid2 from user_relation a join user_relation b on (a.uid2=b.uid1 and a.uid1=b.uid2)
总数据量:198,340,072
对于一次表关联hive会作为一个job执行,
执行结果如下:
User: hdfs
Job Name: select e.user,e.fans,f.secfans f...f.secfans(Stage-1)
Job File: hdfs://X.X.X.X:9000/home/hdfs/tmp/mapred/staging/hdfs/.staging/job_201110132010_0001/job.xml
Submit Host: XXX
Submit Host Address: X.X.X.X
Job-ACLs: All users are allowed
Job Setup: Successful
Status: Succeeded
Started at: Thu Oct 13 20:20:07 CST 2011
Finished at: Thu Oct 13 21:53:31 CST 2011
Finished in: 1hrs, 33mins, 24sec
Job Cleanup: Successful
5,根据需求执行分析所有用户二度好友
select e.uid1,f.uid2
from
(select a.uid1,a.uid2 from user_relation a join user_relation b on (a.uid2=b.uid1 and a.uid1=b.uid2)) e join
(select a.uid1,a.uid2 from user_relation a join user_relation b on (a.uid2=b.uid1 and a.uid1=b.uid2)) f
where e.uid1<>f.uid2
总数据量:198,340,072
对于子表关联,hive会分成多个任务进行串行在上sql中会分成三个job并注意是进行串行执行的。
执行结果如下
storage1:
Started at: Fri Oct 14 00:21:15 CST 2011
Finished at: Fri Oct 14 01:54:40 CST 2011
Finished in: 1hrs, 33mins, 25sec
storage2:
Started at: Fri Oct 14 01:54:42 CST 2011
Finished at: Fri Oct 14 03:24:58 CST 2011
Finished in: 1hrs, 30mins, 16sec
storage3:
Started at: Fri Oct 14 03:25:00 CST 2011
Finished at: Fri Oct 14 03:39:59 CST 2011
Finished in: 14mins, 58sec
共消耗时间为:3小时17分钟
注:hive所有job不管是多个程序启动的job还是job内的多个job都是串行的,可考虑是否可降多个程序job建立job是否可以并行执行?