首先Hive提供了三种方案(应用于不同的场景):
建表
JsonSerDe
内置函数
get_json_object
内置函数
json_tuple
1.使用JsonSerDe创建表加载JSON文件
注意此种方式需要满足的条件:
json文件必须是形如如下格式:
下面
xxx
可以是对象,也可以是数组
{
key1:xxxx,
key2:xx,
key3,xxx
}
另外这个文件必须存的是一个json对象,而不是多个,不然没法加载,例如下面的加载不报错,查询报错:
下面两个json对象
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387160257","commentCount":"33","content":"【视频直播:专家正同步解读嫦娥三号落月全程】嫦娥三号今晚九时许落月。月球软着陆是嫦娥三号任务成败关键,它碰到石头壕沟会怎么办?没空气摩擦能刹车吗?原嫦娥一号和嫦娥二号地面应用系统副总指挥刘建忠做客新浪嘉宾聊天室,详解最关键“黑色720秒”软着陆诸多悬念。直播间:http://t.cn/8kXy6Ne","createTime":"1387024649","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"149","reportCount":"94","source":"","userId":"1618051664","videourl":[],"weiboId":"3655508684241652","weiboUrl":"http://weibo.com/1618051664/Ani28hNrK"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387160257","commentCount":"33","content":"【视频直播:专家正同步解读嫦娥三号落月全程】嫦娥三号今晚九时许落月。月球软着陆是嫦娥三号任务成败关键,它碰到石头壕沟会怎么办?没空气摩擦能刹车吗?原嫦娥一号和嫦娥二号地面应用系统副总指挥刘建忠做客新浪嘉宾聊天室,详解最关键“黑色720秒”软着陆诸多悬念。直播间:http://t.cn/8kXy6Ne","createTime":"1387024649","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"149","reportCount":"94","source":"","userId":"1618051664","videourl":[],"weiboId":"3655508684241652","weiboUrl":"http://weibo.com/1618051664/Ani28hNrK"}]
甚至是这样的都不能直接解析为对应表(多了一层数组)
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387160257","commentCount":"33","content":"【视频直播:专家正同步解读嫦娥三号落月全程】嫦娥三号今晚九时许落月。月球软着陆是嫦娥三号任务成败关键,它碰到石头壕沟会怎么办?没空气摩擦能刹车吗?原嫦娥一号和嫦娥二号地面应用系统副总指挥刘建忠做客新浪嘉宾聊天室,详解最关键“黑色720秒”软着陆诸多悬念。直播间:http://t.cn/8kXy6Ne","createTime":"1387024649","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"149","reportCount":"94","source":"","userId":"1618051664","videourl":[],"weiboId":"3655508684241652","weiboUrl":"http://weibo.com/1618051664/Ani28hNrK"}]
现在我们来看个正常加载的示例:
案例文件格式
:
这个文件
3.json
内容只有下面一行
内容
{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387160257","commentCount":"33","content":"【视频直播:专> 家正同步解读嫦娥三号落月全程】嫦娥三号今晚九时许落月。月球软着陆是嫦娥三号任务成败关键,它碰到石头壕沟会怎么办?没空
气摩擦能刹车吗?原嫦娥一号和嫦娥二号地面应用系统副总指挥刘建忠做客新浪嘉宾聊天室,详解最关键“黑色720秒”软着陆诸多悬> 念。直播间:http://t.cn/8kXy6Ne","createTime":"1387024649","info1":"","info2":"","info3":"","mlevel":"","musicurl":[ ],"pic_list":[],"praiseCount":"149","reportCount":"94","source":"","userId":"1618051664","videourl":[],"weiboId":"36 55508684241652","weiboUrl":"http://weibo.com/1618051664/Ani28hNrK"}
进入hive建表:
create table weibo.weibodemo2(
beCommentWeiboId string,
beForwardWeiboId string,
catchTime string,
commentCount string,
content string,
createTime string,
info1 string,
info2 string,
info3 string,
mlevel string,
musicurl array<string>,
pic_list array<string>,
praiseCount string,
reportCount string,
source string,
userId string,
videourl array<string>,
weiboId string,
weiboUrl string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
加载数据:
load data local inpath '/home/hadoop/mkmg/weibo/3.json' into table weibodemo2;
结果:
1387160257 33 【视频直播:专家正同步解读嫦娥三号落月全程】嫦娥三号今晚九时许落月。月球软着陆是嫦娥三号任务成败关键,它碰到石头壕沟会怎么办?没空气摩擦能刹车吗?原嫦娥一号和嫦娥二号地面应用系统副总指挥刘建忠做客新浪嘉宾聊天室,详解最关键“黑色720秒”软着陆诸多悬念。直播间:http://t.cn/8kXy6Ne 1387024649 [] [] 149 94 1618051664 [] 3655508684241652 http://weibo.com/1618051664/Ani28hNrK
Time taken: 1.401 seconds, Fetched: 1 row(s)
注意:第一个字段值为空串,所以显示是空的
2.使用get_json_object内置函数加载数据
此方式原本只是用来解析json字符串串的,但是对于json文件也有自己的应用场景
例如下面格式文件的解析:
下面
1,2,3
是行号,即每行一个对象,处理行JSON数据,这种数据使用上面JsonSerDe一个表的话就处理不了。
1 [{key1:xxxx,key2:xxx,key3:xxx,....}]
2 [{key1:xxxx,key2:xxx,key3:xxx,....}]
3 [{key1:xxxx,key2:xxx,key3:xxx,....}]
但是需要明白的是,get_json_object不擅长解析多行数据的(一个json数据占多行),而且单独作为函数使用时,只能给定两个路径,而下面的tuple却可以给多个。
下面测试正常使用:
案例文件
(下面只显示几行)=====》每一行一个json数组
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387162365","commentCount":"75","content":"【@钟路明:为曾说黄绮珊没明星相红不了道歉】20多年前,在广州东山宾馆卜通100歌舞厅,经老板介绍,认识了“苏珊”,一个一线娱乐记者与一个毫无知名度的歌手,交集了,当所有人都认为她歌艺出众一定会红时,我坚信她没明星相,一定红不了。为赌这口气,我从未做过她专访……全文http://t.cn/zTiQCdh","createTime":"1366612056","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/496b2a65jw1e3ydj2dmexj209m0fojs2.jpg"],"praiseCount":"13","reportCount":"79","source":"新浪微博","userId":"1231759973","videourl":[],"weiboId":"3569892063732970","weiboUrl":"http://weibo.com/1231759973/ztkd8fF7c"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387162365","commentCount":"12","content":"【博友:雅安地震灾区所见(图)】@自游人圣地:20日下午接到英国泰晤士报驻北京办事处朋友电话,他们要去灾区采访需要我帮忙。21号早上成雅高速入口已交通管制。318多营出口是通往芦山县最近的一条路,除救护车以外车辆只能绕行。我们抵达雅安南出口,这儿是雅安雨城区枫木乡……http://t.cn/zT6LOPx","createTime":"1366610902","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww2.sinaimg.cn/thumbnail/496b2a65jw1e3ycytwq6pj20go2psaus.jpg"],"praiseCount":"7","reportCount":"56","source":"新浪微博","userId":"1231759973","videourl":[],"weiboId":"3569887223525984","weiboUrl":"http://weibo.com/1231759973/ztk5keNgI"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387162365","commentCount":"20","content":"【作家@韩松落:明星慈善与人们的“作秀猜疑”】因为明星的工作性质,总有人认为,明星的慈善行动,是表演……世界上不存在绝对的、毫无瑕疵的爱。在明星慈善表现面前,我们所要做的是相信,哪怕那只是表象,这种相信,是对自己的最好抚慰,也是对来自他人的抚慰最好的回应。全文http://t.cn/zT6ATq6","createTime":"1366609936","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":[],"praiseCount":"8","reportCount":"64","source":"新浪微博","userId":"1231759973","videourl":[],"weiboId":"3569883172260875","weiboUrl":"http://weibo.com/1231759973/ztjYN9u9J"}]
[{"beCommentWeiboId":"","beForwardWeiboId":"","catchTime":"1387162365","commentCount":"21","content":"【日本毕业生求职面试会遇到哪些问题】@日本百合_hzi:四月份的日本,毕业生们穿着西服走上工作岗位,成为社会人。我翻译了一份日本毕业生面试时需要准备的问题集送给感兴趣的人。你日常是怎么保持自己健康的?你打算做什么样的社会人?你如果落选了怎么办?你的爱好是什么?全文http://t.cn/zTJlB99","createTime":"1366609173","info1":"","info2":"","info3":"","mlevel":"","musicurl":[],"pic_list":["http://ww3.sinaimg.cn/thumbnail/496b2a65jw1e3yc5u8c7lj206y094aaj.jpg"],"praiseCount":"4","reportCount":"86","source":"新浪微博","userId":"1231759973","videourl":[],"weiboId":"3569879971907526","weiboUrl":"http://weibo.com/1231759973/ztjTD80ey"}]
....
1)创建原始表(一个字段状态所有数据)
CREATE TABLE `weibo_json`(`linestr` string);
加载数据:
load data local inpath '/home/hadoop/weibo/1.json' into table weibo_json;
2) 创建最终表
get_json_object(json_txt,path) -- 参数1:json串 参数2:json文件路径
参数2
$ :根节点
. : 子节点
[] : 通过数组元素下表去对应元素
* :取所有的
create table weibo as select get_json_object(linestr,'$[0].beCommentWeiboId') beCommentWeiboId,
get_json_object(linestr,'$[0].beForwardWeiboId') beForwardWeiboId,
get_json_object(linestr,'$[0].catchTime') catchTime,
get_json_object(linestr,'$[0].commentCount') commentCount,
get_json_object(linestr,'$[0].content') content,
get_json_object(linestr,'$[0].createTime') createTime,
get_json_object(linestr,'$[0].info1') info1,
get_json_object(linestr,'$[0].info2') info2,
get_json_object(linestr,'$[0].info3') info3,
get_json_object(linestr,'$[0].mlevel') mlevel,
get_json_object(linestr,'$[0].musicurl') musicurl,
split(regexp_extract(regexp_replace(get_json_object(linestr,'$[0].pic_list'),'\"',''),'^\\[(.+)\\]$',1),',') pic_list,
cast(get_json_object(linestr,'$[0].praiseCount') as int) praiseCount,
cast(get_json_object(linestr,'$[0].reportCount') as int) reportCount,
get_json_object(linestr,'$[0].source') source,
get_json_object(linestr,'$[0].userId') userId,
get_json_object(linestr,'$[0].videourl') videourl,
get_json_object(linestr,'$[0].weiboId') weiboId,
get_json_object(linestr,'$[0].weiboUrl') weiboUrl
from weibo_json;
结果展示:
hive> select * from weibo limit 3;
OK
1387160257 0
1387160257 3
1387160257 15
Time taken: 0.208 seconds, Fetched: 3 row(s)
3.json_tuple
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
这里就不举加载数据的案例了
hive> select json_tuple('{"name":"jack","server":"www.qq.com","age":23}','server','name','age');
OK
www.qq.com jack 23
Time taken: 0.084 seconds, Fetched: 1 row(s)
第一个参数原生json字符串,第二个,第三个,第四个都是需要解析的key,返回值为一个tuple。