hive语句嵌入python脚本（进行map和reduce，实现左外连接）

最新推荐文章于 2022-07-13 16:16:12 发布

longshenlmj

最新推荐文章于 2022-07-13 16:16:12 发布

阅读量5.9k

点赞数

分类专栏： hive kettle 文章标签： hive kettle

本文链接：https://blog.csdn.net/longshenlmj/article/details/24380041

版权

hive 同时被 2 个专栏收录

26 篇文章 1 订阅

订阅专栏

kettle

7 篇文章 0 订阅

订阅专栏

在Hive语句中使用脚本（如python和shell）进行map和reduce：利用命令transform（或者指定map和reduce），配合加入的脚本文件add file

请看：http://www.coder4.com/archives/4052

别名后面as省略也行，空格直接加，如: table app_stats t1, app_data t2;

先举一个小例子：

add file ${python_script_path}/lanch_interval_count.py;
drop table temp_lanch_interval2;
create table temp_lanch_interval2 as
select reportdate, appid,channelname, app_version, deviceid,ts,sameday
from
(
from
   (
     from
      (
        select fl.reportdate, fl.appid, 1 as app_version,fn.channelname,fl.deviceid,fl.linux_time
                       from (select reportdate, appid, app_version,deviceid,linux_time from factloglanch WHERE dt>= ? and dt<= ? ) fl
left outer join factnewuser_nodimid fn on (fl.deviceid = fn.deviceid and fl.appid = fn.appid)
      ) a
     map reportdate, appid, channelname,app_version, deviceid,linux_time using '/bin/cat'
     as reportdate, appid, channelname,app_version, deviceid,linux_time
     cluster by appid, channelname,deviceid
   ) b
   reduce reportdate, appid, channelname,app_version, deviceid,linux_time using 'lanch_interval_count.py'
          as reportdate, appid,app_version, channelname,deviceid,ts,sameday
) c

具体说明，引一篇讲的很好的博客：http://www.coder4.com/archives/4052

Hive中的TRANSFORM:使用脚本完成Map/Reduce

hive> select * from test;
OK
1       3
2       2
3       1

要输出每一列的md5值,hive中是没有这个udf,用Python的代码#!/home/tops/bin/python

#!/home/tops/bin/python
import sys
import hashlib

for line in sys.stdin:
    line = line.strip()
    arr = line.split()
    md5_arr = []
    for a in arr:
        md5_arr.append(hashlib.md5(a).hexdigest())
    print "\t".join(md5_arr)

在Hive中使用脚本（如，python和shell），首先要将他们加入：
add file /xxxx/test.py

然后，在程序中使用TRANSFORM语法调用：
SELECT
TRANSFORM (col1, col2) USING './test.py' AS (new1, new2)
FORM test;
其中，AS指定输出列，分别对应的列名。如果省略这句，Hive会将第1个tab前的结果作为key，后面其余作为value。
注意：TRANSFORM的分割符号，永远是\t。传入、传出脚本时都默认必须使用\t。没有其他分隔符
所以会出问题，在结合INSERT [OVERWRITE] table使用时，目标表的分隔符不是\t，是其他分隔符如';'，
这样就会出错。

直接使用map 和reduce命令：

SELECT MAP (…) USING ‘xx.py’是使用的语法，
MAP、REDUCE只不过是TRANSFORM的别名，Hive不保证一定会在map/reduce中调用脚本。看看官方文档是怎么说的：
Formally, MAP ... and REDUCE ... are syntactic transformations of SELECT TRANSFORM ( ... ). In
other words, they serve as comments or notes to the reader of the query.
BEWARE: Use of these keywords may be dangerous as (e.g.) typing "REDUCE" does not force a reduce phase
to occur and typing "MAP" does not force a new map phase!
所以，混用map reduce语法关键字可能会引起混淆，所以建议都用TRANSFORM。
如果不是脚本文件，而是awk、sed等系统内置命令，可以直接使用（不用add file），如：
map reportdate, appid, channelname,app_version, deviceid,linux_time using '/bin/cat'
     as reportdate, appid, channelname,app_version, deviceid,linux_time
     cluster by appid, channelname,deviceid

如果，表中有MAP，ARRAY等复杂类型，
CREATE TABLE features
(
    id BIGINT,
    norm_features MAP<STRING, FLOAT>
);
用TRANSFORM命令进行操作，就是将脚本文件的输出，设置为对应格式，Python里面就是print出对应的格式，而复杂类型就用其对应的分隔符
如，MAP类型的KV分割符。
SELECT TRANSFORM(stuff)
USING 'script'
AS (thing1 INT, thing2 MAP<STRING, FLOAT>)

longshenlmj

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive语句嵌入python脚本（进行map和reduce，实现左外连接）

add file ${python_script_path}/lanch_interval_count.py;drop table temp_lanch_interval2;create table temp_lanch_interval2 as select reportdate, appid,channelname, app_version, deviceid,ts,sameday
复制链接

扫一扫