Hive01

最新推荐文章于 2023-05-16 14:53:54 发布

正在进阶的程序员

最新推荐文章于 2023-05-16 14:53:54 发布

阅读量401

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/weixin_46667646/article/details/117190215

版权

大数据专栏收录该内容

2 篇文章 0 订阅

订阅专栏

hive

hive架构

在这里插入图片描述

执行流程

1、用户写入sql执行查询发送给driver
2、driver把sql发送给compiler拿到执行计划
3、compiler从metastore获取元数据（sql语句查询表对应hdfs文件）
4、compiler把物理执行计划发回给driver
5、driver把物理计划交给执行引擎
6、执行引擎把mr作业交给yarn进行mapreduce作业
7、结果写到节点返回执行成功
8、结果返回到执行引擎，然后返回到driver

hive搭建模式及搭建

本地单用户模式（derby）

这种方式是最简单的存储方式，只需要在hive-site.xml做如下配置便可
以下配置可写可不写，不写就是默认值注意：ctrl-v+ctrl-a

<?xml version="1.0"?>  

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
<name>javax.jdo.option.ConnectionURL</name> <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>

<property>
 <name>javax.jdo.option.ConnectionDriverName</name> <value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>

<--hive生成数据保存的地址-->
<property>
 <name>hive.metastore.warehouse.dir</name> 
<value>/user/hive/warehouse</value>
</property>
</configuration>

需要将hadoop/share/hadoop/yarn/lib中的jline-0.94.0.jar替换为hive中的jline-2.12.jar，四台hadoop都换。

注：注：使用derby存储方式时，运行hive会在当前目录生成一个****derby*文件和一个*metastore_db****目录。这种存储方式的弊端是在同一个目录下同时只能有一个hive客户端能使用数据库

本地多用户模式（mysql）

这种存储方式需要本地运行一个mysql服务器，并作如下配置（下面两种使用mysql的方式，需要将****mysql的jar包拷贝到$HIVE_HOME/lib****目录下）。

hive-site.xml

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
//hive数据文件存储中位置
 <name>hive.metastore.warehouse.dir</name>
 <value>/user/hive_mysql/warehouse</value>
</property>

//连接本地mysql数据库
<property>
<name>javax.jdo.option.ConnectionURL</name> 
<value>jdbc:mysql://localhost/hive_remote?createDatabaseIfNotExist=true</value>
</property>

//加载mysql驱动
<property>
<name>javax.jdo.option.ConnectionDriverName</name> 
<value>com.mysql.jdbc.Driver</value>
</property>

//数据库用户名
<property> 
<name>javax.jdo.option.ConnectionUserName</name> 
<value>hive</value>
</property>

//数据库密码
<property>
<name>javax.jdo.option.ConnectionPassword</name> 
<value>hive</value>
</property>

</configuration>

附：
安装mysql

yum install mysql-server -y

修改mysql权限：

GRANT ALL PRIVILEGES ON . TO ‘root’@’%’ IDENTIFIED BY ‘123’ WITH GRANT OPTION;

刷新权限：flush privileges;

[ERROR] Terminal initialization failed; falling back to unsupported

java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

at jline.TerminalFactory.create(TerminalFactory.java:101)

错误的原因： Hadoop jline版本和hive的jline不一致

多用户模式

Remote分离

这种存储方式需要在远端服务器运行一个mysql服务器，并且需要在Hive服务器启动metastore服务。

这里用node4节点上的mysql服务器，新建hive_remote数据库，字符集是UTF8

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>
 <name>hive.metastore.warehouse.dir</name>
 <value>/user/hive_rone/warehouse</value>
</property>

<property>
 <name>javax.jdo.option.ConnectionURL</name>
 <value>jdbc:mysql://node4/hive_rone?createDatabaseIfNotExist=true</value>
</property>

<property>
 <name>javax.jdo.option.ConnectionDriverName</name>
 <value>com.mysql.jdbc.Driver</value>
</property>

<property>
 <name>javax.jdo.option.ConnectionUserName</name>
 <value>hivehive</value>
</property>

<property>
 <name>javax.jdo.option.ConnectionPassword</name>
 <value>hive</value>
</property>

<property>
 <name>hive.metastore.uris</name>
 <value>thrift://node2:9083</value>
</property>

</configuration>

注：这里把hive的服务端和客户端都放在同一台服务器上了。服务端和客户端可以拆开，在启动的时候，需要先启动metastore服务

bin/hive --service metastore

bin/hive

./hive --help

Remote分离

将hive-site.xml配置文件拆为如下两部分

1）、服务端配置文件

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

 <name>hive.metastore.warehouse.dir</name>

 <value>/user/hive/warehouse</value>

</property>

<property>

 <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://node4/hive?createDatabaseIfNotExist=true</value>

</property>

<property>

 <name>javax.jdo.option.ConnectionDriverName</name>

 <value>com.mysql.jdbc.Driver</value>

</property>

<property>

 <name>javax.jdo.option.ConnectionUserName</name>

 <value>hivehive</value>

</property>

<property>

 <name>javax.jdo.option.ConnectionPassword</name>

 <value>hive</value>

</property>

</configuration>

2）、客户端配置文件

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<property>

 <name>hive.metastore.uris</name>

 <value>thrift://node2:9083</value>

</property>

</configuration>

启动hive服务端程序

hive --service metastore

客户端直接使用hive命令即可

root@node2:~# hive

Hive history file=/tmp/root/hive_job_log_root_201301301416_955801255.txt

hive> show tables;

test_hive

Time taken: 0.736 seconds

hive>

客户端启动的时候要注意：

[ERROR] Terminal initialization failed; falling back to unsupported

java.lang.IncompatibleClassChangeError: Found class jline.Terminal, but interface was expected

at jline.TerminalFactory.create(TerminalFactory.java:101)

错误的原因： *Hadoop jline版本和hive的jline不一致*

HiveSQL（HQL）

hive数据类型

数值型

tinyint（1-byte有符号整数，从-128到127）
smallint（2-byte有符号整数，从-32，768到32，767）
int/integer （4-byte有符号整数，从-2,147,483,648到2,147,483,647)
bigint (8-byte有符号整数, 从-9,223,372,036,854,775,808到9,223,372,036,854,775,807)
float (4-byte单精度浮点数)
double (8-byte双精度浮点数)

decimal 从Hive 0.11.0开始支持的最大38位精度，Hive 0.13.0开始用户可以指定小数位。decimal(5,3) 12.345

日期时间类型

timestamp （从hive 0.8.0开始支持）年月日时分秒毫秒
date （从hive 0.12.0开始支持）年月日
interval （从hive 1.2.0开始支持）

string类型

string
varchar（从hive 0.12.0开始支持）
char （从hive 0.13.0开始支持）

其他类型

boolean
binary

复合类型

数组：array<data_type>
maps：Maps<primitive_type,data_type>
structs：struct<col_name:data_type>

整数类型

字符串

是单引号或双引号引起来的字符序列

日期

日期转换操作

cast(date as date) 相同的日期值

cast(timestamp as date) 根据当前时区解析出timestamp中年/月/日作为日期类型。

cast(string as date) 如果string是’YYYY-MM-DD’格式, 返回其中的年/月/日。否则返回NULL。

cast(date as timestamp) 返回date日期下的零点零分零秒的timestamp类型数据。

cast(date as string) 年/月/日转换为’YYYY-MM-DD’类型的字符串。

hive的数据库操作

hive命令行的清屏命令：hive >!clear;

创建数据库

create （database|schema）[if not exists] database_name
[comment database_comment] 添加数据库注释
[location hdfs_path] 指定数据库存储位置
[with dbproperties(prorperty_name = property_value)]

查看数据库信息

desc database ;查看数据库的描述信息
desc database extended ;查看数据库的详情

删除数据库

drop （database|schema） [if exists] database_name [restrict|cascade]
restrict :默认模式
cascade：逐级删除

修改数据库

alter （database|schema） database_name set dbproperties(property_name=property_value);

alter （database|schema） database_name set owner [USER|ROLE] user_or_role;

使用数据库

use database_name;
use default;

hive表操作

创建表

内部表和外部表

内部表

默认情况下，内部表存储于hive.metastore.warehouse.dir属性指定的路径下，即/user/hive/warehouse/databasename.db/tablename/目录。默认路径可以通过location属性在建表的时候指定。如果内部表或分区删除了，跟该表或分区关联的数据和元数据一并删除。如果没有指定PURGE选项，则数据会先移动到垃圾桶待指定的时间。
如果希望hive管理表的生命周期，或者是临时表，则使用内部表。

managed table

一旦删除内部表，hdfs上的数据文件要删除

外部表

外部表在外部文件存储元数据以及表结构（mysql）。外部表文件可以被外部进程管理和访问。外部表可以访问存储于Azure Storage Volumes（ASV）或者远程HDFS上的数据。如果外部表的结构或者分区发生改变，则通过MSCK REPAIR TABLE table_name语句修复和刷新元数据信息。
如果数据是已经存在或者存储与远程，或者当表删除后不希望删除数据，就可以使用外部表。

external table

一旦删除外部表，hdfs上的数据文件不会删除，DBMS中存储的元数据要删除。

hive 查看表描述

desc [ extended | formatted] table_name

hive 建表的like和as

create table like table ;

create table as select f1,f2… from tablename2;

hive分区

创建分区表 partitioned by

必须在表定义时指定对应的partiton字段

单分区
create table day_table(
id int,
content string
)
partitioned by (dt string);

注：
以dt为文件夹区分

双分区

create table day_table(
id int ,
content string
)
partitioned by (dt string,hour string);
注：
先以dt 为文件夹，在以hour子文件夹区分

hive 添加分区表语法

实际上只是创建文件夹，文件夹内没有数据文件

alter table table_name add [if not exists] partition partiton_spec [location ‘location’]
(表已创建，在此基础上添加分区)；

实例：
alter table tb_user1 add partition (age = 10) location ‘/user/hive’;

hive 删除分区

内部表中、对应分区的元数据和数据将被一并删除

alter table table_name drop [if exists] partition （partcol = val）;

实例：
alter table tb_user1 drop partition(age = 10);

hive 向指定分区添加数据语法

load data [local] inpath ‘filepath’ [overwrite] into table tableName [partition(partcol1 = val1)]

实例：
load data inpath ‘/user/pv.txt’ into table day_hour partition(dt = ‘2020-05-23’);

当数据被加载至表中，不会对数据进行任何转换。load操作只是将数据复制至hive表对应的位置。数据加载时在表下自动创建一个目录

hive查询执行分区语法

SELECT day_table.* FROM day_table WHERE day_table.dt>= ‘2008-08-08’;
分区表的意义在于优化查询。查询时尽量利用分区字段。如果不使用分区字段，就会全部扫描。

预先导入分区数据，但是无法识别怎么办

方案1、Msck repair table tablename
方案2、直接添加分区

hive的DML

加载文件中的数据列表
load data [local] inpath ‘filepath’ [overwrite] into table tablename [partition (partcol1 = val1)];

对于加载本地文件，需要发生文件的上传，文件从本地上传到hdfs
对于hdfs文件，load的时候直接修改文件的路径，不发生文件的拷贝，（块）。

标志语法：

覆盖

insert overwrite table tablename1
[partition(partcol1= val1)]
[if not exists]
select statement from statement;

追加

insert into table tablename1
[partition(partcol1= val1)]
[if not exists]
select statement from statement;

hive扩展（多个插入记录）

from tablename
insert (overwrite | into) table tablename1
[partition(partcol1=val1)]
[if not exists]
select _statement1

insert (overwrite | into) table tablename2
[partition(partcol1=val1)]
[if not exists]
select _statement2

hive 正则匹配

CREATE TABLE logtbl (
host STRING,
identity STRING,
t_user STRING,
time STRING,
request STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe’
WITH SERDEPROPERTIES (
“input.regex” = "([^ ]) ([^ ]) ([^ ]) \[(.)\] “(.)" (-|[0-9]) (-|[0-9]*)”
)
STORED AS TEXTFILE;

注意：limit只能作为限制输出的语句，而不能直接用于分页，比如类似于mysql的:
selet * from tb_user limit 10, 5;
hive: select * from tb_user limit 5;

hive的beeline

1、beeline要与hiveserver2配合使用
2、服务器启动hiveserver2，服务器就是你启动metastore的节点
bin/hive --service hiveserver2 启动hiveserver2

3、客户的通过beeline两种方式连接到hive

1、直接启动beeline连接指定的数据库
2、beeline -u jdbc:hive2://nodex:10000/dbname -n root
3、启动beeline之后再连接数据库
4、！connect jdbc:hive2//nodex:10000/dbname root abc

4、默认用户名、密码不验证
5、在beeline中使用！close关闭到hiveserver2的连接
6、在beeline中使用！quit 退出beeline

hive的函数

1.内置运算符

1.1关系运算符
运算符类型说明
A = B 所有原始类型如果A与B相等，返回TRUE，否则FALSE
A == B 无无效的语法
A <> B 所有原始类型如果A不等于B返回TRUE，否则FALSE。如果A或B值为“NULL"，结果返回”NULL”
A < B 所有原始类型如果A小于B返回TRUE，否则FALSE。如果A或B值为“NULL”，结果为“NULL”
A <= B 所有原始类型如果A小于等于B，返回TRUE，否则FALSE。如果A或B值为“NULL”，返回“NULL”
A > B 所有原始类型如果A大于B，返回TRUE，否则FALSE。如果A或B值为“NULL”，返回“NULL”
A >= B 所有原始类型如果A大于等于B，返回TRUE，否则FALSE。如果A或B值为“NULL”，返回“NULL”
A IS NULL 所有类型如果A值为”NULL”，返回TRUE,否则返回FALSE
A IS NOT NULL 所有类型如果A值不为”NULL”，返回TRUE,否则返回FALSE
A LIKE B 字符串如果A或B值为”NULL”，结果返回”NULL”。字符串A与B通过sql进行匹配，如果相符返回TRUE，不符返回FALSE。B字符串中的””代表任一字符，”%”则代表多个任意字符。例如：(‘foobar’ like ‘foo’)返回FALSE，(‘foobar’ like ‘foo__’或者‘foobar’ like ‘foo%’)则返回TURE
A RLIKE B 字符串如果A或B值为”NULL”，结果返回”NULL”。字符串A与B通过java进行匹配，如果相符返回TRUE，不符返回FALSE。例如：（‘foobar’ rlike ‘foo’）返回FALSE，（’foobar’ rlike ‘^f.*r$’ ）返回TRUE。
A REGEXP B 字符串与RLIKE相同。

1.2算术运算符

运算符类型说明
A + B 所有数字类型 A和B相加。结果的与操作数值有共同类型。例如每一个整数是一个浮点数，浮点数包含整数。所以，一个浮点数和一个整数相加结果也是一个浮点数。
A – B 所有数字类型 A和B相减。结果的与操作数值有共同类型。
A * B 所有数字类型 A和B相乘，结果的与操作数值有共同类型。需要说明的是，如果乘法造成溢出，将选择更高的类型。
A / B 所有数字类型 A和B相除，结果是一个double(双精度)类型的结果。
A % B 所有数字类型 A除以B余数与操作数值有共同类型。
A & B 所有数字类型运算符查看两个参数的二进制表示法的值，并执行按位”与”操作。两个表达式的一位均为1时，则结果的该位为1。否则，结果的该位为0。
A|B 所有数字类型运算符查看两个参数的二进制表示法的值，并执行按位”或”操作。只要任一表达式的一位为1，则结果的该位为1。否则，结果的该位为0。
A ^ B 所有数字类型运算符查看两个参数的二进制表示法的值，并执行按位”异或”操作。当且仅当只有一个表达式的某位上为1 时，结果的该位才为1。否则结果的该位为0。
~A 所有数字类型对一个表达式执行按位”非”（取反）。

1.3逻辑运算符
运算符类型说明
A AND B 布尔值 A和B同时正确时,返回TRUE,否则FALSE。如果A或B值为NULL，返回NULL。
A && B 布尔值与”A AND B”相同
A OR B 布尔值 A或B正确,或两者同时正确返返回TRUE,否则FALSE。如果A和B值同时为NULL，返回NULL。
A | B 布尔值与”A OR B”相同
NOT A 布尔值如果A为NULL或错误的时候返回TURE，否则返回FALSE。
! A 布尔值与”NOT A”相同

1.4复杂类型函数
函数类型说明
map (key1, value1, key2, value2, …) 通过指定的键/值对，创建一个map。
struct (val1, val2, val3, …) 通过指定的字段值，创建一个结构。结构字段名称将COL1，COL2，…
array (val1, val2, …) 通过指定的元素，创建一个数组。

1.5对复杂类型函数操作

函数类型说明
A[n] A是一个数组，n为int型返回数组A的第n个元素，第一个元素的索引为0。如果A数组为[‘foo’,‘bar’]，则A[0]返回’foo’和A[1]返回”bar”。
M[key] M是Map<K, V>，键K型返回关键值对应的值，例如mapM为 {‘f’ -> ‘foo’, ‘b’ -> ‘bar’, ‘all’ -> ‘foobar’}，则M[‘all’] 返回’foobar’。
S.x S为struct 返回结构x字符串在结构S中的存储位置。如 foobar {int foo, int bar} foobar.foo的领域中存储的整数。

2.内置函数

2.1数学函数

返回类型函数说明
BIGINT round(double a) 四舍五入
DOUBLE round(double a,int d) 小数部分d位之后数字四舍五入，例如round(21.263,2),返回21.26
BIGINT floor(double a) 对给定数据进行向下舍入最接近的整数。例如floor(21.2),返回21。
BIGINT ceil(double a),
ceiling(double a) 将参数向上舍入为最接近的整数。例如ceil(21.2),返回23.
double rand(), rand(int seed) 返回大于或等于0且小于1的平均分布随机数（依重新计算而变）
double exp(double a) 返回e的n次方
double ln(double a) 返回给定数值的自然对数
double log10(double a) 返回给定数值的以10为底自然对数
double log2(double a) 返回给定数值的以2为底自然对数
double log(double base, double a) 返回给定底数及指数返回自然对数
double pow(double a, double p)
power(double a, double p) 返回某数的乘幂
double sqrt(double a) 返回数值的平方根
string bin(BIGINT a) 返回二进制格式
string hex(BIGINT a)
hex(string a) 将整数或字符转换为十六进制格式
string unhex(string a) 十六进制字符转换由数字表示的字符。
string conv(BIGINT num, int from_base, int to_base) 将指定数值由原来的度量体系转换为指定的度量体系。例如CONV(‘a’,16,2),返回。将16进制的a转换为2进制表示。
double abs(double a) 取绝对值
int double pmod(int a, int b)
pmod(double a, double b) 返回a除b的余数的绝对值
double sin(double a) 返回给定角度的正弦值
double asin(double a) 返回x的反正弦，即是X。如果X是在-1到1的正弦值，返回NULL。
double cos(double a) 返回余弦
double acos(double a) 返回X的反余弦，即余弦是X，，如果-1<= A <= 1，否则返回null.
int double negative(int a)
negative(double a) 返回A的相反数，例如negative(2),返回-2。

2.2收集函数
返回类型函数说明
int size(Map<K,V>) 返回map类型的元素数量
int size(Array) 返回数组类型的元素数量

select size(likes), size(addrs) from tb_user1 where id=3;

2.3类型转换函数
返回类型函数说明
指定 “type” cast(expr as ) 类型转换。例如将字符”1″转换为整数:cast(‘1’ as bigint)，如果转换失败返回NULL。

hive> select “1”+1 from tb_user1 limit 1;
OK
2.0
Time taken: 0.131 seconds, Fetched: 1 row(s)
hive> select cast(“1” as int)+1 from tb_user1 limit 1;
OK
2
Time taken: 0.171 seconds, Fetched: 1 row(s)
hive> select cast(“1” as double)+1 from tb_user1 limit 1;
OK
2.0
Time taken: 0.156 seconds, Fetched: 1 row(s)

hive> select cast(‘2019-04-23’ as date) from tb_user1 limit 1;
OK
2019-04-23
Time taken: 0.152 seconds, Fetched: 1 row(s)
hive> select year(cast(‘2019-04-23’ as date)) from tb_user1 limit 1;
OK
2019
Time taken: 0.166 seconds, Fetched: 1 row(s)

2.4日期函数
返回类型函数说明
string from_unixtime(bigint unixtime[, string format]) UNIX_TIMESTAMP参数表示返回一个值’YYYY-MM–DD HH：MM：SS’或YYYYMMDDHHMMSS.uuuuuu格式，这取决于是否是在一个字符串或数字语境中使用的功能。该值表示在当前的时区。
bigint unix_timestamp() 如果不带参数的调用，返回一个Unix时间戳（从’1970- 01 – 01 00:00:00′到现在的UTC秒数）为无符号整数。
bigint unix_timestamp(string date) 指定日期参数调用UNIX_TIMESTAMP()，它返回参数值’1970-01–01 00:00:00′到指定日期的秒数。
bigint unix_timestamp(string date, string pattern) 指定时间输入格式，返回到1970年秒数：unix_timestamp(’2009-03-20′, ‘yyyy-MM-dd’) = 1237532400
string to_date(string timestamp) 返回时间中的年月日： to_date(“1970-01-01 00:00:00″) = “1970-01-01″
string to_dates(string date) 给定一个日期date，返回一个天数（0年以来的天数）
int year(string date) 返回指定时间的年份，范围在1000到9999，或为”零”日期的0。
int month(string date) 返回指定时间的月份，范围为1至12月，或0一个月的一部分，如’0000-00-00′或’2008-00-00′的日期。
int day(string date)
dayofmonth(date) 返回指定时间的日期
int hour(string date) 返回指定时间的小时，范围为0到23。
int minute(string date) 返回指定时间的分钟，范围为0到59。
int second(string date) 返回指定时间的秒，范围为0到59。
int weekofyear(string date) 返回指定日期所在一年中的星期号，范围为0到53。
int datediff(string enddate, string startdate) 两个时间参数的日期之差。
int date_add(string startdate, int days) 给定时间，在此基础上加上指定的时间段。
int date_sub(string startdate, int days) 给定时间，在此基础上减去指定的时间段。

hive> select unix_timestamp() from tb_user1 limit 1;
OK
1569201586
Time taken: 0.143 seconds, Fetched: 1 row(s)
hive> select unix_timestamp(‘1970-01-01 12:10:11’) from tb_user1 limit 1;
OK
15011
Time taken: 0.124 seconds, Fetched: 1 row(s)

2.5条件函数

返回类型函数说明
T if(boolean testCondition, T valueTrue, T valueFalseOrNull) 判断是否满足条件，如果满足返回一个值，如果不满足则返回另一个值。
T COALESCE(T v1, T v2, …) 返回一组数据中，第一个不为NULL的值，如果均为NULL,返回NULL。
T CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 当a=b时,返回c；当a=d时，返回e，否则返回f。
T CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 当值为a时返回b,当值为c时返回d。否则返回e。

hive> select if(1>2,‘ok’,‘no ok’) from tb_user1 limit 1;
OK
no ok
Time taken: 0.202 seconds, Fetched: 1 row(s)
hive> select if(1<2,‘ok’,‘no ok’) from tb_user1 limit 1;
OK
ok
Time taken: 0.165 seconds, Fetched: 1 row(s)
select
case id
when 1 then “小明1”
when 2 then ‘小明2’
else “随便”
end
from tb_user1;

2.6字符函数

返回类型函数说明
int length(string A) 返回字符串的长度
string reverse(string A) 返回倒序字符串
string concat(string A, string B…) 连接多个字符串，合并为一个字符串，可以接受任意数量的输入字符串
string concat_ws(string SEP, string A, string B…) 链接多个字符串，字符串之间以指定的分隔符分开。
string substr(string A, int start) substring(string A, int start) 从文本字符串中指定的起始位置后的字符。
string substr(string A, int start, int len)
substring(string A, int start, int len) 从文本字符串中指定的位置指定长度的字符
string upper(string A)
ucase(string A) 将文本字符串转换成字母全部大写形式
string lower(string A)
lcase(string A) 将文本字符串转换成字母全部小写形式
string trim(string A) 删除字符串两端的空格，字符之间的空格保留
string ltrim(string A) 删除字符串左边的空格，其他的空格保留
string rtrim(string A) 删除字符串右边的空格，其他的空格保留
string regexp_replace(string A, string B, string C) 字符串A中的B字符被C字符替代
string regexp_extract(string subject, string pattern, int index) 通过下标返回正则表达式指定的部分。regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2) returns ‘bar.’
string parse_url(string urlString, string partToExtract [, string keyToExtract]) 返回URL指定的部分。parse_url(‘http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1′, ‘HOST’) 返回：’facebook.com’
string get_json_object(string json_string, string path) select a.timestamp, get_json_object(a.appevents, ‘ $eventid’), get_json_object(a.appenvets, ‘$ .eventname’) from log a;
string space(int n) 返回指定数量的空格
string repeat(string str, int n) 重复N次字符串
int ascii(string str) 返回字符串中首字符的数字值
string lpad(string str, int len, string pad) 返回指定长度的字符串，给定字符串长度小于指定长度时，由指定字符从左侧填补。
string rpad(string str, int len, string pad) 返回指定长度的字符串，给定字符串长度小于指定长度时，由指定字符从右侧填补。
array split(string str, string pat) 将字符串转换为数组。
int find_in_set(string str, string strList) 返回字符串str第一次在strlist出现的位置。如果任一参数为NULL,返回NULL；如果第一个参数包含逗号，返回0。
array<array> sentences(string str, string lang, string locale) 将字符串中内容按语句分组，每个单词间以逗号分隔，最后返回数组。例如sentences(‘Hello there! How are you?’) 返回：( (“Hello”, “there”), (“How”, “are”, “you”) )
array<struct<string,double>> ngrams(array<array>, int N, int K, int pf) SELECT ngrams(sentences(lower(tweet)), 2, 100 [, 1000]) FROM twitter;
array<struct<string,double>> context_ngrams(array<array>, array, int K, int pf) SELECT context_ngrams(sentences(lower(tweet)), array(null,null), 100, [, 1000]) FROM twitter;

3.内置的聚合函数（UDAF）

返回类型函数说明
bigint count(*) ,
count(expr),
count(DISTINCT expr[, expr_., expr_.]) 返回记录条数。
double sum(col), sum(DISTINCT col) 求和
double avg(col), avg(DISTINCT col) 求平均值
double min(col) 返回指定列中最小值
double max(col) 返回指定列中最大值
double var_pop(col) 返回指定列的方差
double var_samp(col) 返回指定列的样本方差
double stddev_pop(col) 返回指定列的偏差
double stddev_samp(col) 返回指定列的样本偏差
double covar_pop(col1, col2) 两列数值协方差
double covar_samp(col1, col2) 两列数值样本协方差
double corr(col1, col2) 返回两列数值的相关系数
double percentile(col, p) 返回数值区域的百分比数值点。0<=P<=1,否则返回NULL,不支持浮点型数值。
array percentile(col, array(p~1,\ [, p,2,]…)) 返回数值区域的一组百分比值分别对应的数值点。0<=P<=1,否则返回NULL,不支持浮点型数值。
double percentile_approx(col, p[, B]) 返回组中数字列（包括浮点类型）的近似p ^ th ^百分位数。B参数以内存为代价控制近似精度。较高的值会产生更好的近似值，默认值为10,000。当col中的不同值的数量小于B时，这给出了精确的百分位值。
array percentile_approx(col, array(p~1, [, p,2_]…) [, B]) 与上面相同，但接受并返回百分位数值而不是单个值。
array<struct{‘x’,'y’}> histogram_numeric(col, b) 使用b个非均匀间隔的箱计算组中数字列的直方图。输出是一个大小为b的双值（x，y）坐标数组，表示bin中心和高度
array collect_set(col) 返回无重复记录

4.内置表生成函数（UDTF）

a,b,c,d,e

返回类型函数说明
数组 explode(array a) 数组一条记录中有多个参数，将参数拆分，每个参数生成一列。
json_tuple get_json_object 语句：
select a.timestamp,
get_json_object(a.appevents, ‘ $eventid’), get_json_object(a.appenvets, ‘$ .eventname’)
from log a;
json_tuple语句:
select a.timestamp, b.* from log a
lateral view
json_tuple(a.appevent, ‘eventid’, ‘eventname’) b as f1, f2

hive的wordcount
create table tb_lines(line string) row format delimited lines terminated by ‘\n’;

hello.txt
select * from tb_lines limit 5;

select split(line, " ") from tb_lines;

select explode(split(line, " ")) from tb_lines limit 3;

from (select explode(split(line, " ")) word from tb_lines) lines
select word, count(word) group by lines.word;

create table tb_count(
word string,
wordnum BIGINT
);

from (select explode(split(line, " ")) word from tb_lines) lines
insert into tb_count
select word, count(word) cwd group by lines.word order by cwd desc;

select * from tb_count limit 5;

5.自定义函数
自定义函数包括三种UDF、UDAF、UDTF
UDF(User-Defined-Function) 一进一出
UDAF(User- Defined Aggregation Funcation) 聚集函数，多进一出。Count/max/min
UDTF(User-Defined Table-Generating Functions) 一进多出，如lateral view explore()
使用方式：在HIVE会话中add 自定义函数的jar文件，然后创建function继而使用函数

5.1 UDF 开发
1、UDF函数可以直接应用于select语句，对查询结构做格式化处理后，再输出内容。
2、编写UDF函数的时候需要注意一下几点：
a）自定义UDF需要继承org.apache.hadoop.hive.ql.UDF。
b）需要实现evaluate函数，evaluate函数支持重载。

3、步骤
a）把程序打包放到目标机器上去；
b）进入hive客户端，添加jar包：hive>add jar /run/jar/udf_test.jar;

c）创建临时函数：hive>CREATE TEMPORARY FUNCTION add_example AS ‘hive.udf.Add’;
d）查询HQL语句：
SELECT add_example(8, 9) FROM scores;
SELECT add_example(scores.math, scores.art) FROM scores;
SELECT add_example(6, 7, 8, 6.8) FROM scores;
e）销毁临时函数：hive> DROP TEMPORARY FUNCTION add_example;

add jar /root/myhello.jar;
create function hello as ‘com.bjsxt.hive.demo.MyHello’;
select hello(name) from tb_user1;

drop function hello;

5.2 UDAF 自定义集函数
多行进一行出，如sum()、min()，用在group by时
1.必须继承
org.apache.hadoop.hive.ql.exec.UDAF(函数类继承)
org.apache.hadoop.hive.ql.exec.UDAFEvaluator(内部类Evaluator实现UDAFEvaluator接口)
2.Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函数
 init():类似于构造函数，用于UDAF的初始化
 iterate():接收传入的参数，并进行内部的轮转，返回boolean
 terminatePartial():无参数，其为iterate函数轮转结束后，返回轮转数据，类似于hadoop的Combiner
 merge():接收terminatePartial的返回结果，进行数据merge操作，其返回类型为boolean
 terminate():返回最终的聚集函数结果

开发一个功能同：
Oracle的wm_concat()函数
Mysql的group_concat()

Hive UDF的数据类型：

正在进阶的程序员

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
Hive01

hivehive架构执行流程1、用户写入sql执行查询发送给driver2、driver把sql发送给compiler拿到执行计划3、compiler从metastore获取元数据（sql语句查询表对应hdfs文件）4、compiler把物理执行计划发回给driver5、driver把物理计划交给执行引擎6、执行引擎把mr作业交给yarn进行mapreduce作业7、结果写到节点返回执行成功8、结果返回到执行引擎，然后返回到driverhive搭建模式及搭建本地单用户模式（derb
复制链接

扫一扫