Hive中自定义函数详解

最新推荐文章于 2024-05-16 17:55:13 发布

张凯生

最新推荐文章于 2024-05-16 17:55:13 发布

阅读量622

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/weixin_45721467/article/details/108286774

版权

hive 专栏收录该内容

5 篇文章 1 订阅

订阅专栏

内置函数

# 查看hive内置函数
show functions;
# 查看函数描述信息
desc function max ;

用户自定义函数UDF

用户定义函数-UDF:user-defined function
操作作用于单个数据行，并且产生一个数据行作为输出。大多数函数都属于这一类（比如数学函数和字符串函数）。
用户定义函数-UDF

user-defined function

操作作用于单个数据行，并且产生一个数据行作为输出。大多数函数都属于这一类（比如数学函数和字符串函数）。

用户定义表生成函数-UDTF

user-defined table-generating function

操作作用于单个数据行，并且产生多个数据行-------一个表作为输出。lateral view explore()

用户定义聚集函数-UDAF

user-defined aggregate function

接受多个输入数据行，并产生一个输出数据行。像COUNT和MAX这样的函数就是聚集函数。

简单来说：

UDF:返回对应值，一对一

UDAF：返回聚类值，多对一

UDTF：返回拆分值，一对多

# 0. 导入hive依赖

<dependency>
    <groupId>org.apache.hive</groupId>
    <artifactId>hive-exec</artifactId>
    <version>1.2.1</version>
</dependency>

# 1.定义一个类继承UDF

1. 必须继承UDF
2. 方法名必须是evaluate

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
@Description(
        name = "hello",
        value = "hello(str1,str2)-用来获取 '你好 str1,str2 有美女吗?'的结果", //这里的中文解释以后看的时候会有乱码，最好写英文。
)
public class HelloUDF extends UDF {
    // 方法名必须交evaluate
    public String evaluate(String s1,String s2){
        return "你好，"+s1+","+s2+"有美女吗?";
    }
}

# 2. 配置maven打包环境，打包jar

<properties>
    <!--解决编码的GBK的问题-->
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<build>
        <finalName>funcHello</finalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <includes>
                        <!--将function包下的所有类和子包下左右类，进行打包-->
                        <include>**/function/**</include>
                    </includes>
                </configuration>
            </plugin>
        </plugins>
    </build>

# 打包
mvn package

# 3. 上传linux，导入到函数库中。

# 在hive命令中执行
add jar /opt/doc/funcHello.jar; #session级别的添加，
delete jar /opt/doc/funcHello.jar; # 如果重写，记得删除。
create [temporary] function hello as "function.HelloUDF"; # temporary是会话级别。
# 删除导入的函数
drop [temporary] function hello;

# 4. 查看函数并使用函数

-- 1. 查看函数
desc function hello;
desc function extended hello;
-- 2. 使用函数进行查询
select hello(userid,cityname) from logs;

导入奇葩的依赖方法-pentahu

# 下载
https://public.nexus.pentaho.org/repository/proxied-pentaho-public-repos-group/org/pentaho/pentaho-aggdesigner-algorithm/5.1.5-jhyde/pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar
# 放在本地英文目录下
D:\work\pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar
# 执行mvn安装本地依赖的命令
D:\work> mvn install:install-file -DgroupId=org.pentaho -DartifactId=pentaho-aggdesigner-algorithm  -Dversion=5.1.5-jhyde  -Dpackaging=jar  -Dfile=pentaho-aggdesigner-algorithm-5.1.5-jhyde-javadoc.jar

案例

列自增长(不确定性函数)

# 定义一个函数 get_number()
select get_num() num,id,name,salary from t_person;

定义一个java类继承UDF 书写evaluate方法 import org.apache.hadoop.hive.ql.exec.Description; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache.hadoop.io.LongWritable; @UDFType(deterministic = false) //输入确定，输出确定的函数，false，因为该函数没有输入，输出结果也会变化。 public class NumberUDF extends UDF { private long index = 0; public long evaluate(){ index++; return index; } } 2. 打包 mvn clean package 3. 上传linux 4. 导入到hive的依赖库中 add jar /opt/doc/myhive1.2.jar; 5. 创建函数 create temporary function get_num as 'function.NumberUDF'; 6. 使用 select get_num() num,id,name,salary from t_person;

行列相转

# 案例表和数据

--## 表（电影观看日志）
create table t_visit_video (
    username string,
    video_name string,
    video_date date
)row format delimited fields terminated by ',';
--## 数据：豆瓣观影日志数据。(用户观影日志数据  按照天存放 1天一个日志文件)
张三,大唐双龙传,2020-03-21
李四,天下无贼,2020-03-21
张三,神探狄仁杰,2020-03-21
李四,霸王别姬,2020-03-21
李四,霸王别姬,2020-03-21
王五,机器人总动员,2020-03-21
王五,放牛班的春天,2020-03-21
王五,盗梦空间,2020-03-21

# collect_list(组函数)
作用：对分组后的，每个组的某个列的值进行收集汇总。
语法：select collect_list(列) from 表 group by 分组列;

select username,collect_list(video_name) from t_visit_video group by username

# collect_set(组函数)
作用：对分组后的，每个组的某个列的值进行收集汇总，并去掉重复值。
语法：selectcollect_set(列) from 表 group by 分组列;

select username,collect_set(video_name) from t_visit_video group by username;

# concat_ws(单行函数)
作用：如果某个字段是数组，对该值得多个元素使用指定分隔符拼接。
select id,name,concat_ws(',',hobbies) from t_person;

--# 将t_visit_video数据转化为如下图效果
--统计每个人，2020-3-21看过的电影。
select username,concat_ws(',',collect_set(video_name)) from t_visit_video group by username;

表数据转存导入操作

# 1.将文件数据导入hive表中，
load data local inpath '文件的路径' overwrite into table 表。
# 2.直接将查询结果，放入一个新创建的表中。(执行查询的创建)
    create table 表 as select语...
        1. 执行select语句
        2. 创建一个新的表，将查询结果存入表中。
# 3.将查询结果，导入已经存在表。(不建议用)
    insert into 表 
    select语句...
# 4.将HDFS中已经存在文件，导入新建的hive表中
    create table Xxx(
        ...
    )location 'hdfs的表数据对应的目录'

将SQL的执行结果插入到另一个表中

create table 表 as select语句

--## 例子:
--统计每个人，2020-3-21看过的电影，将结果存入hive的表：t_video_log_20200321
create table t_video_log_20200321 as select ...;

张凯生

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
Hive中自定义函数详解

内置函数# 查看hive内置函数show functions;# 查看函数描述信息desc function max ;用户自定义函数UDF用户定义函数-UDF:user-defined function操作作用于单个数据行，并且产生一个数据行作为输出。大多数函数都属于这一类（比如数学函数和字符串函数）。用户定义函数-UDF user-defined function 操作作用于单个数据行，并且产生一个数据行作为输出。大多数函数都属于这一类（比如数学函数和字符串函
复制链接

扫一扫