Hive 个人笔记

最新推荐文章于 2024-04-23 14:59:38 发布

JXCypress

最新推荐文章于 2024-04-23 14:59:38 发布

阅读量691

点赞数 1

本文链接：https://blog.csdn.net/JXCypress/article/details/51209275

版权

Linux 同时被 2 个专栏收录

14 篇文章 0 订阅

订阅专栏

Clusters

4 篇文章 0 订阅

订阅专栏

HIVE 知识体系
数据仓库：就是数据库，面向主题的，集成的，不。可更新的，随时间不变化的数据集合，他用于支持企业或组织的决策分析
针对某一需求的，数据来源分散，不随时间变化，只用于数据查询。

数据仓库的建立：
数据源：———————–》数据抽取转化 ———->> 数据仓库引擎 ——–>> 前段展示

业务数据系统，文档资料，其他数据 ETL( extract,transform,load) server

2.据存放在数据库中 metastore 支持mysql,derby
元数据信息包括：表的名称表的列和分区属性，表的属性（是否为外部表）表的数据所在目录等
3. hql 的执行过程
解释器（词法分析）编译器（生成hql执行计划），优化器(根据编译器生成的执行计划执行优化) 最后生成 mapper reduce 程序

举例： oracle 查看执行计划 explain plan for select * from emp where deptno=10
创建索引 create index myindex on emp(deptno)

HIVE 体系结构
cli jdbc/odbc webCtl（web控制台只能查询）

  hive driver：驱动：解释编译执行优化

namenode 节点 job tracker 任务调度
各个节点

linux

hive安装方式：
嵌入式：hive内置的derby 内置数据库，缺点；只允许是用一个链接
本地式：元数据存储在mysql数据库中，mysql与物理机运行在同一台物理机上
远程模式：元数据不同的物理机或操作系统上

nive –service cli
1命令行模式
show tables –查看列表
show functions –查看函数列表信息
常用 cli 命令
desc 表名
dfs -ls 目录 –查看hdfs 上的文件
！命令 –执行操作系统的名命令感叹号+命令
select * from emp –执行hql 语句
source sql脚本 –执行sql 脚本

hive -s    --方式使用命令行，进入静默模式 避免过多的显示调试的信息，不显示 mapper reduce的执行信息

//不进入交互模式，直接执行命令语句  -e  execute
hive -e 'show tables'

2 hive 的web 管理平台:只能左查询语句
配置：参考官方文档配置 hive-site.xml的web管理界面
拷贝jdk 的tools.jar 到hive的lib 目录，或者直接配置 hive-env.xml
启动方式
hive –service hwi &
端口号：9999
浏览器访问：http：//iP：9999/hwi/

3 hive的远程服务
端口：10000
hive –service hiveserver &

4 数据类型
基本数据类型
与关系性数据库一样；
string
char（20）固定长度 20
varchar（20）最大的长度20

复杂
    array
    map
    struct
        create table student(
            sid int,
            aname string,
            score array<float>
        )
        (1,Tom,[70,78,80])

        create table student1(
            sid int,
            aname string,
            score map<string,float>
        )
        (1.tom,<'语文',85>)

        create table student2(
            sid int,
            aname string,
            scores array<map<string,float>>  --数组，中保存map 每门课程的分数
        )
        （1，'tom',[<'语文'，70>，<'数学',60>]）
        //结构，数据类型可以不一样
        create table student3(
            sid int,
            info struct<name:string,age:int,sex:string>
            )
        (1,{'tom',10,'男'})


时间
    date :0.12.0 开始支持，yyyy-MM-dd
    timestamp ：0.80.0开始支持，解释成与时区无关的数字，长整形数字
    互相转化 cast函数

5 hive 的数据存储
基于hdfs
没有专门的存储结构，
存储结构主要包括数据库，文件，表，视图
可直接加载文本文件
表
table 内部表
与数据库中的表概念类似，每个table在hive中都有一个相应的目录存储数据，所有的table都保存在这个目录中，删除表时元数据和数据都会被删除
eg: create table t1(id int ,name string ,age int)
create table t2(id int ,name string ,age int) location ‘hdfs path’
create table t3(id int ,name string ,age int) row format delimitied fields terminated by ‘,’ //创建表时指定分隔符
create table t4 as select * from sample_table //通过子查询方式创建表，没有指定分隔符则字段之间没有分隔符
create table t5 row format delimitied by ‘,’ as select * from sample_table //指定分割符号
alter table t1 add columns(english int);
drop table t1 //放回到回收站 .trash
partition 分区表:提高扫描效率，不用全表扫描，可以使用explain 查看执行计划，从下往上从左到右看
分区：partition 对饮数据库的partition 列的密集索引
在hive中，表中的一个partition对应与表下的一个目录，所有的partition的数据都存储在对应的目录中
eg:
create table pattition_table (id int ,name string ,age int)
partition by (gender=’M’)
row format delimited fields terminated by ‘,’

                insert into table partition_table partition(gender='M') select sid,ename,age from sample_table where gender='M'     
        externam table 外部表
            指向已经在hdfs中存在的数据，可以创建partition
            他和内部表在元数据的组织上是相同的，而实际存储则有较大的差异
            外部表 只是一个过程，加载在数据和创建表同事完成，并不会移动到数据仓库目录中，只是与外部数据建立一个链接，当删除一个外部表时，只是删除链接
            创建表和加载数据是一次完成的，通过location 关键字指向 外部连接
            eg: create external table external_student (sid int ,sname string,age int) row format delimited fields terminated by ','  location '/hdfs/hive/input'
            删除 hdfs中的目录即删除外部表
        bucket table 桶表
            桶表是数据通过哈希取值，然后放到不同的文件中存储，比如 sname 中字段值经过哈希取值，将其放入指定的桶表中
            create table bucket_table(sid int,sname string ,age int) clustered by (sname) into 5 buckets ；    //clustered by (sname )表示sname字段哈希取值 放入5 个不同的桶中
    视图

        视图是一个虚表，可以跨越多个表，跟 关系性数据一样，不存储数据，可以简化操作
        create view emp_info as
            select e.empno.e.ename.e.sal*12 annlsal,d.dname
            from emp w,dept d
            where e.deptno=d.deptno

hive 基于hadoop 之上的存放在hdfs数据仓库，将执行语句转换成 mapper reduce
使用元数据管理
提供3 种元数据管理方式
嵌入式本地模式远程模式
提供编程接口
jdbc 客户端自定义函数

create table hive_t1(
int a,
string b
)
load data local path “/student1” overwrite into table tb2;

导入目录下的所有文件

load data local path “/student/” overwirte into table tb2;
load data path “hdfs-path” overwrite into table tb2;

load data local inpath “/root/data/data1” into table table partion_table_name partion(col_name=”man”)

use sqoop

export HADOOP_COMMON_HOME= $HADOOP_HOME+PATH EXPORT hadoop_mapperd_home=$ HADOOP_HOME_PATH
cd sqoop home #基于jdbc

oracle -> hdfs

上传 oracle 驱动到sqoop lib 目录下

sqoop import –connect jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger –table emp –columns ‘empno,ename,job’ -m 1 –target -dir ‘/sqoop/emp’

导入hive 不指定表
sqoop import –hive-import jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger –table emp ‘empno,ename,job’ -m 1

指定hive 表名

sqoop import –hive-import jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger –table emp ‘empno,ename,job’ -m 1 –hive-table emp1 # emp1 不存在则创建

使用条件导入

sqoop import –hive-import jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger –table emp ‘empno,ename,job’ -m 1 –hive-table emp1 –where ‘dept=2’

sqoop import –hive-import jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger -m 1 –hive-table emp1 –query “select * from emp where $condition” targer-dir ‘sqoop/emp’ –hive-table emp5

hive 到处到 oracle
sqoop export jdbc:oracle:thin:@...:1521:orcl –username sott -pathod tiger -m 1 –table oracle_table_emp –export-dir “/sqoop/emp”

hive 查询

desc emp

查询所有员工信息
select * from emp

查询员工信息
select empno ,ename,sal from emp

支持算数表达式,与关系行数据库一样
select empno,ename,sal,sal*12 from emp

将null 转换成0
select empno,ename,sal,nvl(commom,0),sal*12+common from emp
查询奖金为 null 使用 is 关键字
select * from emp where comm is null
使用 distinct 去掉重复记录
select distinct deptno,job from emp

简单查询的fetch task 功能，即没有函数或排序
set hive.fetch.task.conversion=more ;不会转化成 mr

或者启动时 hive –hiveconf hive.fetch.task.conversion=more
配置永久生效
conf/hive-site.xml

在查询中使用过滤

select * from emp where depno=2
select * from emp where emp where ename=’king’ //字符串写进单引号中，字符串区分大写小
select * from emp where deptno=10 and sal<2000
查看执行计划
explain select * from emp where deptno=10 and sal<2000

模糊查询
select * from emp where ename like ‘s%’
下划线具有特殊含义表示任意单个字符
select * from emp where ename like ‘%\_%’

排序

查询员工信息员工好姓名月薪按照月薪排序,排序在hdfs 中时一个高效的工作，需要mr 工作
select * empno ,ename,sal from emp order by sal desc

order by + 列，表达式，别名，序号
select empno,ename,sal,sal*12 from emp order by sal*12 //表达式
select empno,ename,sal,sal*12 annulsal from emp order by annulsal //别名
select empno,ename,sal,sal*12 annulsal from emp order by 4 // 按照第四列排序 set hive.orderby.position.alias=true
按照奖金排序 comm –> comm 空值排序默认空值在前
select empno,ename,sal,comm from emp order by comm

hive 函数

数学函数
round 四舍五入取整 select round(小数,保留位数)
ceil 向上取整 ceil(49.9)
foolr向下取整

字符函数
lower(‘字符串’) upper（”） length(”)字符数
connat(”,”) 拼接子复查unguarded
substr(a,b) 从中a取第b位开始取，取右边所有的字符
substr(a,b,c) 从中a取第b位开始取，取c个字符
trim
rpad右填充、
lpad 左填充 lpad(‘abc’,10,’‘) 表示填充左边，工填充十位，以填充
收集函数
返回map 集合的个数
size(map(key,value,key,val,….))
size(map(1,’tom’,2,’cat’)) 返回2

转换
cast(1 as float)
cast(‘2016-05-05’ as date)

日期函数

to_date(‘2016-05-04 11:13:14’);
year|month|day(‘2016-05-04 11:13:14’)
weekforyear(‘2016-05-04 11:13:14’)
datediff(‘2016-05-04 11:13:14’,’2016-05-05 11:13:14’) //相差多少天
data_add|date_sub(‘2016-05-04 11:13:14’,2)

条件函数
select comm ,sal,coalesce(comm,sal) 从左从左到右返回第一个不为null的数据

case … when …:条件表达式
case a
when b then c
when d then e
else f
end

select ename,job,sal,
case job when ‘经理’ then sal+1000
when ‘员工’ then sal+800
end
from emp

聚合函数
count
sum
min
max
avg
select count(*) ,sum(sal),max(sal),min(sal) avg(sal) from emp

表生成函数:把集合或数组中的每个元素单独生成行
explode (map（1,’tom’,2,’cat’,3,’johe’）)
输出
1 tom
2 cat
3 johe

自定义函数完成复杂的功能

hive 表连接
等值链接
select e.ename,d.deptno from emp e,dept d where e.deptno=d.deptno
不等值链接
select * from emp e ,salgrade s where s.sal between s.小值and s.大值
外连接
select d.deptno,d.dname count(e.empno)
from emp e,dept d
where e.deptno +=d.deptno
group by d.deptno,d.name
自链接:将表别名出多张表

hive 子查询
hive 只支持 from 和where 子语句的查询
select e.ename from emp e where e.deptno in (select d.deptno from dept d where d.name=’sal’ or d.name=’accounting’)
子查询的空值问题子查询添加 where xx is not null

java 客户断jdbc 操作
开启数据仓库服务端
hive –service hiveserver

jdbc 客户端操作
步骤：
获取链接
创建运行环境
执行 hql
处理结果
释放资源
创建java项目
添加 hive 的驱动到项目path hive-jdbc.jar，hive 的lib 下的所有包

创建工具类：
private static String driver=”驱动包下面的驱动类路径全名”
private static String url =”jdbc:hive://ip:port/default”
//注册驱动
static{
try(
Class.forName(driver)
)catch{

}

//获取链接
public static Connect getConn(){
return DriverManager.getConnection(url) //异常捕获
}

//释放链接
static void release(Connection conn,Statment st,ResultSet rs){

if（rs !=null）{
    re.close()//捕获异常
}finally{
    rs=null //方便 垃圾回收   
}

if(st !=null){
    st.close() //异常捕获
}
if(conn !=null){
    conn.close() //异常捕获
}

}

测试
main方法
1.工具类获得链接
2.conn.createStatement()
3. rs=st.exexuteQuery(sql)
4//处理数据
while(rs.next()){

}

finally{
//释放资源
}

——————————————————————
通过 thrift cli
创建 socket 链接
final TSocket tSocket =new TSocket(“ip”,port)
//创建一个协议
final TProtcol tProtcal=new TBinaryProtocol(tSocket)
//创建hive client
final HiveClient client=new HiveClient(tProtocol)

//打开socket
tScoket.open()

//执行查询语句

client.execute(“desc emp”)

//获得结果
client.fetchAll()

hive 自定义函数：内置函数无法满足需要，自定义函数
UDF
1需要继承 org.apache.hadoop.hive.ql.exec.UDF
2实现 evaluate 函数，evaluate 函数支持重载
3把程序打包放到目标机器上去，进入hive客户端，添加jat 包，执行如下命令： add jar /root/test/udfjar/testudf.jar
4.创建临时函数指向编写的自定义函数类名
create TEMPORARY FUNCTION 函数名 as ‘自定义的udf 的java类名’

使用：
select 函数名 feom emp
销毁临时函数：
drop temporary function 函数名

案例:
1.继承 udf
2.重写 evaluate方法
public Text evaluate(Text a,Text b){
return new Text(a.toString()+”*“)
}

调用该函数