hive内置常用函数补充和举例

最新推荐文章于 2022-06-15 12:58:59 发布

whiteblacksheep

最新推荐文章于 2022-06-15 12:58:59 发布

阅读量215

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/whiteblacksheep/article/details/96707751

版权

Hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

json_tuple

函数的作用：用来解析json字符串中的多个字段

hive (default)> create table  rating_json(json string);
	>load data local inpath '/home/hadoop/data/rating.json' overwrite into table rating_json; //导入数据
	hive (default)> select * from rating_json  limit 10;
OK
rating_json.json
{"movie":"1193","rate":"5","time":"978300760","userid":"1"}
{"movie":"661","rate":"3","time":"978302109","userid":"1"}
{"movie":"914","rate":"3","time":"978301968","userid":"1"}
{"movie":"3408","rate":"4","time":"978300275","userid":"1"}
{"movie":"2355","rate":"5","time":"978824291","userid":"1"}
{"movie":"1197","rate":"3","time":"978302268","userid":"1"}
{"movie":"1287","rate":"5","time":"978302039","userid":"1"}
{"movie":"2804","rate":"5","time":"978300719","userid":"1"}
{"movie":"594","rate":"4","time":"978302268","userid":"1"}
{"movie":"919","rate":"4","time":"978301368","userid":"1"}
Time taken: 0.274 seconds, Fetched: 10 row(s)
hive (default)> select  rating_json(json,"movie","rate","time","userid") from rating_json limit 10;
hive (default)> select  json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid) from rating_json limit 10;//用json_tuple解析json字符串字段并重新给字段命名
OK
moveid  rate    time    userid
1193    5       978300760       1
661     3       978302109       1
914     3       978301968       1
3408    4       978300275       1
2355    5       978824291       1
1197    3       978302268       1
1287    5       978302039       1
2804    5       978300719       1
594     4       978302268       1
919     4       978301368       1
Time taken: 0.046 seconds, Fetched: 10 row(s)
hive (default)>

然后再对其进行数据清洗：
raw（原始数据） ==> width 大宽表：你后续需要的所有的字段我全部给你准备完毕
例如：userid,movie,rate,time,year,month,day,hour,minute,ts(yyyy-MM-dd HH:mm:ss)前面四个字段有了，还需要加后面的所有字段
cast(time as bigint) //字符转换函数，这里是将string转换成bigint
unix_timestamp(“2019-07-21 12:21:21.645”) //将string字符串时间转换成int类型的时间戳
from_unixtime(1563682881) //将int类型的时间戳，装换成string :2019-07-21 12:21:21

      hive (default)> select moveid,rate,time,userid,
              > year(from_unixtime(cast(time as bigint))) as year,
              > month(from_unixtime(cast(time as bigint))) as month,
              > day(from_unixtime(cast(time as bigint))) as day,
              > hour(from_unixtime(cast(time as bigint))) as hour,
              > minute(from_unixtime(cast(time as bigint))) as minute,
              > from_unixtime(cast(time as bigint)) as ts
              > from
              > (select  json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid) 
              > from rating_json 
              > ) tmp 
              > limit 10;
OK
moveid  rate    time    userid  year    month   day     hour    minute  ts
1193    5       978300760       1       2001    1       1       6       12      2001-01-01 06:12:40
661     3       978302109       1       2001    1       1       6       35      2001-01-01 06:35:09
914     3       978301968       1       2001    1       1       6       32      2001-01-01 06:32:48
3408    4       978300275       1       2001    1       1       6       4       2001-01-01 06:04:35
2355    5       978824291       1       2001    1       7       7       38      2001-01-07 07:38:11
1197    3       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
1287    5       978302039       1       2001    1       1       6       33      2001-01-01 06:33:59
2804    5       978300719       1       2001    1       1       6       11      2001-01-01 06:11:59
594     4       978302268       1       2001    1       1       6       37      2001-01-01 06:37:48
919     4       978301368       1       2001    1       1       6       22      2001-01-01 06:22:48
Time taken: 0.726 seconds, Fetched: 10 row(s)
hive (default)>

最后创建一张rating_width的大宽表，后续的统计分析都是基于这个rating_width表进行

	hive (default)> create table rating_width as
              > select moveid,rate,time,userid,
              > year(from_unixtime(cast(time as bigint))) as year,
              > month(from_unixtime(cast(time as bigint))) as month,
              > day(from_unixtime(cast(time as bigint))) as day,
              > hour(from_unixtime(cast(time as bigint))) as hour,
              > minute(from_unixtime(cast(time as bigint))) as minute,
              > from_unixtime(cast(time as bigint)) as ts
              > from
              > (select  json_tuple(json,"movie","rate","time","userid") as (moveid,rate,time,userid) 
              > from rating_json 
              > ) tmp 
              > ;
Query ID = hadoop_20190721123434_a60f5f15-7479-444d-94a9-dac0a7f61a99
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1562553101223_0024, Tracking URL = http://hadoop001:8078/proxy/application_1562553101223_0024/
Kill Command = /home/hadoop/app/hadoop/bin/hadoop job  -kill job_1562553101223_0024
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2019-07-21 12:39:31,312 Stage-1 map = 0%,  reduce = 0%
2019-07-21 12:39:46,743 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 15.98 sec
MapReduce Total cumulative CPU time: 15 seconds 980 msec
Ended Job = job_1562553101223_0024
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/.hive-staging_hive_2019-07-21_12-39-26_522_1513098922642793536-1/-ext-10001
Moving data to: hdfs://hadoop001:9000/user/hive/warehouse/rating_width
Table default.rating_width stats: [numFiles=1, numRows=1000209, totalSize=57005699, rawDataSize=56005490]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 15.98 sec   HDFS Read: 63606728 HDFS Write: 57005786 SUCCESS
Total MapReduce CPU Time Spent: 15 seconds 980 msec
OK
moveid  rate    time    userid  year    month   day     hour    minute  ts
Time taken: 21.583 seconds
hive (default)> 
	  > 
              > select * from   rating_width limit 5;
OK
rating_width.moveid     rating_width.rate       rating_width.time       rating_width.userid     rating_width.year       rating_width.month      rating_width.day        rating_width.hour       rating_width.minute rating_width.ts
1193    5       978300760       1       2001    1       1       6       12      2001-01-01 06:12:40
661     3       978302109       1       2001    1       1       6       35      2001-01-01 06:35:09
914     3       978301968       1       2001    1       1       6       32      2001-01-01 06:32:48
3408    4       978300275       1       2001    1       1       6       4       2001-01-01 06:04:35
2355    5       978824291       1       2001    1       7       7       38      2001-01-07 07:38:11
Time taken: 0.038 seconds, Fetched: 5 row(s)
hive (default)>

parse_url_tuple

parse_url_tuple:hive内置函数，解析url字符串
用法：parse_url_tuple(url, partname1, partname2, ..., partnameN)
例如：http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d
可以解析成HOST、PATH、QUERY和COOKIEID字段。

	hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:cookieid") as (host,path,query ,cookieid) from dual;
OK
host    path    query   cookieid
www.ruozedata.com       /d7/xxx.html    cookieid=1234567&a=b&c=d        1234567
Time taken: 0.04 seconds, Fetched: 1 row(s)
hive (default)>

如果最后不想取cookieid的值，想取a的值，也可以如下：

	hive (default)> select parse_url_tuple("http://www.ruozedata.com/d7/xxx.html?cookieid=1234567&a=b&c=d","HOST","PATH","QUERY","QUERY:a") as (host,path,query,a) from dual;
OK
host    path    query   a
www.ruozedata.com       /d7/xxx.html    cookieid=1234567&a=b&c=d        b
Time taken: 0.046 seconds, Fetched: 1 row(s)
hive (default)>