hive案例——微博

最新推荐文章于 2022-11-28 19:40:01 发布

jin6872115

最新推荐文章于 2022-11-28 19:40:01 发布

阅读量1.4k

点赞数

分类专栏： hive 文章标签： hive 微博案例

本文链接：https://blog.csdn.net/jin6872115/article/details/79874103

版权

hive 专栏收录该内容

24 篇文章 0 订阅

订阅专栏

数据样例：
[{"beCommentWeiboId":"",
"beForwardWeiboId":"",
"catchTime":"1387157643",
"commentCount":"682",
"content":"喂！2014。。。2014！喂。。。",
"createTime":"1387086483",
"info1":"",
"info2":"",
"info3":"",
"mlevel":"",
"musicurl":[],
"pic_list":["http://ww1.sinaimg.cn/square/47119b17jw1ebkc9b07x9j218g0xcair.jpg","http://ww4.sinaimg.cn/square/47119b17jw1ebkc9ebakij218g0xc113.jpg","http://ww2.sinaimg.cn/square/47119b17jw1ebkc9hml7dj218g0xcgt6.jpg","http://ww3.sinaimg.cn/square/47119b17jw1ebkc9kyakyj218g0xcqb3.jpg"],
"praiseCount":"1122",
"reportCount":"671",
"source":"iPhone客户端",
"userId":"1192336151",
"videourl":[],
"weiboId":"3655768039404271",
"weiboUrl":"http://weibo.com/1192336151/AnoMrDstN"}]

2、字段描述
总共19个字段
beCommentWeiboId  是否评论
beForwardWeiboId 是否是转发微博
catchTime 抓取时间
commentCount 评论次数
content	内容
createTime 创建时间
info1 信息字段1
info2信息字段2
info3信息字段3
mlevel   no sure
musicurl	音乐链接
pic_list	照片列表（可以有多个）
praiseCount	点赞人数
reportCount	转发人数
source	数据来源
userId	用户id
videourl	视频链接	
weiboId	微博id
weiboUrl	微博网址

1、数据处理：针对数据问题，请给出对应的解决方案（15分）
数据文件过多：要合并，请给出解决方案
输入中小文件过多主要是通过参数调节
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat ：设置Inputformat。
set mapred.max.split.size=256000000：用来设置每个Map最大的输入大小。
set mapred.min.split.size.per.node=100000000：用来设置每个map中从一个node读取的最小数据量。合并的时候，会优先把一个node下的数据合并到一个split里，这样比较容易保证数据节点和计算节点本地行。
set mapred.min.split.size.per.rack=100000000：用来设置每个map中从一个机架上读取的最小数据量。如果没法保证一个split中数据尽量在一个node上，会尽量保证数据在一个机架上。

2、组织数据
（创建Hive表weibo_json(json string)，表只有一个字段，导入所有数据，并验证查询前5条数据）
create external table weibo_json(json  string) location ‘/hw/weibo/wdata/’;
（解析完weibo_json当中的json格式数据到拥有19个字段的weibo表中，写出必要的SQL语句）
给出的数据不是常规json格式，需要去除收尾的'[]'，采用substring(json,2,length(json)-2),获取json格式
然后解析，采用hive内置函数get_json_object进行解析，或者使用json_tuple(jsonStr, k1, k2, ...)
创建表：
create table weibo(beCommentWeiboId string,beForwardWeiboId string,
catchTime string,commentCount int,content string, createTime string,
info1 string,info2 string,info3 string,mlevel string,musicurl string,
pic_list string, praiseCount int, reportCount int, source string,
userId string,videourl string, weiboId string,weiboUrl string)
row format delimited fields terminated by '\t';
插入数据
insert into table weibo select json_tuple
(substring(a.json,2,length(a.json)-2),'beCommentWeiboId',
'beForwardWeiboId','catchTime','commentCount','content','createTime',
'info1','info2','info3','mlevel','musicurl','pic_list','praiseCount',
'reportCount','source','userId','videourl','weiboId','weiboUrl') 
from weibo_json a limit 2;

截取部分加载，总量为117062条

3、统计微博总量 和 独立用户数
select count (*) from weibo;
+---------+
|   _c0   |
+---------+
| 117062  |
+---------+
 select count(distinct userId) from weibo;
+--------+
|  _c0   |
+--------+
| 10336  |
+--------+

4、统计用户所有微博被转发的次数之和，输出top5用户，并给出次数

select userId,sum(reportCount) sumcount from weibo 
group by userId
order by sumcount desc
limit 5;
+-------------+-----------+
|   userid    | sumcount  |
+-------------+-----------+
| 1266321801  | 18556781  |
| 1188552450  | 16467001  |
| 1195230310  | 16318202  |
| 1195242865  | 15928934  |
| 1192329374  | 14109943  |
+-------------+-----------+

5、统计带图片的微博数
默认含有[]，长度是2

select count(*) from weibo 
where length(pic_list)>2;
+--------+
|  _c0   |
+--------+
| 67494  |
+--------+

6、统计使用iphone发微博的独立用户数
select count(distinct userId) from weibo 
where source like '%iPhone%' or source like '%ipone%';
+------+
| _c0  |
+------+
| 69   |
+------+

7、将微博的点赞人数和转发人数相加求和，
并将相加之和降序排列，取前10条记录，输出userid和总次数
select userId,praiseCount,reportCount,(praiseCount+reportCount) as total
from weibo order by total desc limit 10;
+-------------+--------------+--------------+----------+
|   userid    | praisecount  | reportcount  |  total   |
+-------------+--------------+--------------+----------+
| 2202387347  | 1            | 2692012      | 2692013  |
| 1266286555  | 537628       | 1014922      | 1552550  |
| 1266286555  | 537628       | 1014903      | 1552531  |
| 1266286555  | 537627       | 1014756      | 1552383  |
| 1266286555  | 537627       | 1014734      | 1552361  |
| 1266286555  | 537627       | 1014697      | 1552324  |
| 1793285524  | 530857       | 974092       | 1504949  |
| 1793285524  | 530857       | 974089       | 1504946  |
| 1793285524  | 530857       | 973979       | 1504836  |
| 1793285524  | 530857       | 973978       | 1504835  |
+-------------+--------------+--------------+----------+

8、统计微博中评论次数小于1000的用户ID与数据来源信息，将其放入视图，
然后统计视图中数据来源是”ipad客户端”的用户数目
create view weibo_v_8 as 
select userId,source from weibo 
where commentCount < 1000;

select count(distinct userId) from weibo_v_8
where source like '%ipad%' or source like '%iPad%';
+------+
| _c0  |
+------+
| 34   |
+------+

9、统计微博内容中出现”iphone”次数最多的用户，
最终结果输出用户id和次数（注意：该次数是”iphone”的出现次数，
不是出现”iphone”的微博数目）
select userId,sum((length(content)-length(regexp_replace(content,'iphone','')))/6)s 
from weibo group by userId
order by s desc limit 3;
+-------------+-------+
|   userid    |   s   |
+-------------+-------+
| 2120456303  | 13.0  |
| 1287708222  | 7.0   |
| 1567852087  | 6.0   |
+-------------+-------+

10、求每天发微博次数最多的那个家伙的ID和发微博的条数
create view weibo_10_t as
select userId,weiboId,from_unixtime(cast(createTime as Bigint),
'yyyy-MM-dd') dt
from weibo 
limit 3;

select w.userId,count(*) cc
from weibo_10_t w
group by w.userId,w.dt
order by cc  desc limit 3;
+-------------+------+
|  w.userid   |  cc  |
+-------------+------+
| 1638781994  | 120  |
| 1638781994  | 117  |
| 1618051664  | 95   |
+-------------+------+

11、求出所有被多次引用（同一张照片出现在多条微博中，
超过1条就算多条）的照片的数目
照片引用裂变，进行计数，数目>1就可以
create view weibo_11_t as
select explode(split(substring(pic_list,2,length(pic_list)-2),','))pic
from weibo
limit 5;

select count(*) from
(select count(*) cc from weibo_11_t
where length(pic) > 0
group by pic)a
where a.cc > 1;
+-------+
|  _c0  |
+-------+
| 4052  |
+-------+

jin6872115

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
hive案例——微博

数据样例：[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387157643","commentCount":"682","content":"喂！2014。。。2014！喂。。。","createTime"
复制链接

扫一扫