hive入门操作

最新推荐文章于 2022-04-10 09:50:24 发布

throws-Exception

最新推荐文章于 2022-04-10 09:50:24 发布

阅读量214

点赞数

分类专栏： hive 文章标签： hive

本文链接：https://blog.csdn.net/Shemon_zjw/article/details/108619737

版权

hive 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

本文介绍Hive作为基于Hadoop的数据仓库解决方案，提供类SQL查询语言HQL，支持用户自定义函数，成熟JDBC和ODBC驱动，适用于ETL和BI。文章详细讲解Hive元数据管理，包括数据类型、元数据结构、连接数据库方法，以及Hive的基本命令如创建数据库、表、分区，数据导入和查询等。

摘要由CSDN通过智能技术生成

hive 入门

什么是Hive
- 基于Hadoop的数据仓库解决发难
  - 将结构话的数据文件映射为数据库表
  - 提供类sql的查询语言HQL
Hive的优势和特点
- 提供了一个简单的优化模型
- HQL类SQL语法,简化MR开发
- 支持在不同的计算框架上运行
- 支持在HDFS和HBase上临时查询数据
- 支持用户自定义函数,格式
- 成熟的JDBC和ODBC驱动程序,用于ETL和BI
- 稳定可靠的批处理
Hive元数据管理
- 记录数据仓库中模型的定义,各层级间的映射关系
- 存储在关系数据库中
  - 默认Derby,轻量级内嵌sql数据库
  - 存储在.metastore_db目录中
- 实际生产一般存储在MySql中
  - 修改配置文件hive-site.xml
- Hcatalog
  - 将Hive元数据共享给其他应用程序

Hive数据类型

类似sql数据类型

类型	示例	类型	示例
TINYINT	10Y	SMALLINT	10S
INT	10	BIGINT	100L
FLOAT	1.333	DOUBLE	1.234
DECIMAL	3.15	BINARY	1010
BOOLEAN	TRUE	STRING	“BOOK”
CHAR	“YES”	VARCHAR	“BOOK”
DATE	‘2010-12-24’	TIMESTAMP	‘2013-01-31 00:13:00.345’

复杂数据类型
- Array:存储的数据为相同类型
- Map:具有相同类型的键值对
- STRUCT:封装了一组字段
  
  类型格式定义示例
  array [1,2,3,5] Array a[0]=1
  Map {‘a’:‘apple’} Map<String,String> b[‘a’]=‘apple’
  struct {‘apple’,2} STRUCT <fruit:String, weight:Int> c.weight=2

类型	格式	定义	示例
array	[1,2,3,5]	Array	a[0]=1
Map	{‘a’:‘apple’}	Map<String,String>	b[‘a’]=‘apple’
struct	{‘apple’,2}	STRUCT <fruit:String, weight:Int>	c.weight=2

Hive元数据结构

数据结构	描述	逻辑关系	物理存储
Database	数据库	表的集合	文件夹
Table	表	行数据的集合	文件夹
Partiton	分区	用于分割数据	文件夹
Buckets	分桶	用于分布数据	文件
Row	行	行记录	文件中的行
Columns	列	列记录	每行中指定的位置
Views	视图	逻辑概念	不存储数据
Index	索引	记录统计数据信息	文件夹

hive连接数据库
需要将mysql-connector-jar-5.1.38.jar复制到hive根目录下的lib文件夹下
```
schematool -dbType mysql initSchema
```
启动hive黑界面

beeline黑界面–jdbc黑界面

hiveserver2/hive --service metastore

beeline -u jdbc:hive2://192.168.56.100:10000 -n root

mysql黑界面
```
hive --service metastore
```
```
hive
```
hive基本命令
创建数据库
```
create database mydemo;
```

创建内部表

create table userinfos(
	user int,
	username string
);

创建外部表

create external table customs(
	custid string,
	cusname string,
	age int
)
row format delimited fields terminated by ','
location '/data'
;

     create external table if not exists transaction_details (
         transaction_id string,
         customer_id string,
         store_id string,
         price string,
         product string,
         date string,
         time string
     )
     
     -- 解决csv中有逗号的情况
     row format serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
     WITH SERDEPROPERTIES (
         "separatorChar" = ",",
         "quoteChar" = "\"",
         "escapeChar" = "\\"
     )
     STORED AS TEXTFILE
     --location '/transaction_details.csv'
     tblproperties ("skip.header.line.count"="1")
     load data local inpath '/opt/shop/transaction_details.csv' overwrite into table transaction_details

hive修改数据表

ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])
ALTER TABLE name DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type
ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec ...])

CTAS-WITH

建表

create table example as select * from .... where 1=1(复制全部数据)
create table example as select * from .... where 1=2(复制表结构)

with语句

with
a1 as (select class,sum(score)sum from scores group by class),
a2 as (select class ,score from scores)
select a1.class,(a2.score/a1.sum)sss from a1 inner join a2 on a1.class = a2.class;

hive 数据分区

分区主要用于提高性能
分区列的值将表划分为segments(文件夹)
查询时使用分区列和常规列类似
查询hive自动过滤不用于提高性能的分区
分为静态分区和动态分区

分区建表

CREATE TABLE employee_partitioned(
    name string,
    work_place ARRAY<string>,
    sex_age STRUCT<sex:string,age:int>,
    skills_score MAP<string,int>,
    depart_title MAP<STRING,ARRAY<STRING>> )
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
stored as textfile 
tblproperties ("skip.header.line.count"="1")

MAP KEYS TERMINATED BY ‘:’;
```

静态分区操作

ALTER TABLE employee_partitioned ADD 
PARTITION (year=2019,month=3) PARTITION (year=2019,month=4);

ALTER TABLE employee_partitioned DROP PARTITION (year=2019, month=4);
```

- 添加静态分区的数据

  ```sql
  insert into 追加
insert overwrite into覆盖 拉链表 全量表
  ```

  ```sql
  塞值
  insert into mypart partition(gender='male') values(1,'zs');
  塞表 静态塞值的时候不需要塞分区字段名
  insert overwrite table mypart partition(gender='female')
  select userid,username from userinfos;
  如果塞的表和分区的分区字段不一致，会强行把表的分区字段变为一致
  就是到这个分区，这个分区的字段都变为一致。
  ```

动态分区操作

使用动态分区设定属性–开启动态分区

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

添加动态分区

insert into table employee_partitioned partition(year, month)
select name,array('Toronto') as work_place,
named_struct("sex","male","age",30) as sex_age,
map("python",90) as skills_score,
map("r&d", array('developer')) as depart_title,
year(start_date) as year,month(start_date) as month
from employee_hr eh;

设置动态分区的个数上限

set hive.exec.max.created.files=600000;

Hive 查询 - select基础
- select用于映射符合指定查询条件的行
- Hive select是数据库标准sql的子集
- 使用方法类似于mysql
```
```sql
select * from ....;
```
```

Hive查询-CTE和嵌套查询

with t1 as (select .... from ....) select * from t1;

嵌套查询

select * from (select * from ...)a;

Hive查询 - 进阶语句

列匹配正则表达式

set hive.support.quoted.identifiers=none;

select `o.*`from offers;

exists关键字

select  a.customer_id from customers  a 
where not exists(
select  order_customer_id from orders where a.customer_id= order_customer_id
)

虚拟列
两个连续下划线,用于数据验证

```sql
input__file__name: mapper task的输入文件名称
block__offset__inside__file:当前全局文件位置
```

hive join - 关联
- 指对多表进行联合查询
- join用于将两个或者多个表中的行组合在一起查询
- 类似sql join
内连接 : inner join
- 外连接 : outer join
- right join,left join,full outer join
- 交叉连接 : cross join
- 隐式连接 : implicit join
hive join- mapjoin
- mapjoin操作在map端完成
小表关联大表
- 可进行不等值连接
开启join操作
set hive.auto.convert.join =true
运行时自动将连接转换为mapjoin
mapjoin操作不支持
在union all ,lateral view, group by/join/sort by/cluster by/distribute by等操作后面
在union,join以及其他mapjoin之前

hive 导入数据

移动本地数据文件到hive数据库表

load data local inpath '/opt/aa.csv' overwrite into table mydemo.customs2;

将一张表的数据导入到另一种表

```sql
给一张表的对应分区里插入另一张表的数据，静态塞值的时候不需要塞分区字段名
insert into table mypart partition(gender) 
select userid,username,gender from userinfos;
insert overwrite table ...............
```

使用insert语句将数据插入/导出到文件

从同一数据源插入本地文件,hdfs文件,表
from ctas_employee
insert overwrite local directory '/tmp/out1'  select *
insert overwrite directory '/tmp/out1' select *
insert overwrite table employee_internal select *;
以指定格式插入数据
insert overwrite directory '/tmp/out3'
row format delimited fields terminated by ','
select * from ctas_employee;
其他方式从表获取文件
hdfs dfs -getmerge <table_file_path>

hive 数据交换

import和export用于数据导入和导出
常用于数据迁移场景
除数据库,可导入导出所有数据和元数据
使用export导出数据

EXPORT TABLE employee TO '/tmp/output3';
EXPORT TABLE employee_partitioned partition (year=2014, month=11) TO '/opt/...'

使用import导入数据

IMPORT TABLE employee FROM '/tmp/output3';
IMPORT TABLE employee_partitioned partition (year=2014, month=11) FROM '/tmp/output5';

侧视图
- 常与表生成函数结合使用，将函数的输入和输出连接
- outer关键字:即使output为空也会产生结果
```
select name,work_place,loc from employee lateral view outer explode(split(work_place,',')) a as loc;
```
- 支持多层级
select name,wps,skill,score from employee
lateral view explode(work_place) work_place_single as wps
lateral view explode(skills_score) sks as skill,score;
```
- 常用于规范化行或者解析json
```
排序
- order by
  - 只使用一个Reducer执行全局数据排序
  - 速度慢,应提前做好数据过滤
  - 支持使用CASE WHEN或表达式
  - 支持按位置编号排序
  - set hive.groupby.orderby.position.alias=true;
- sort by/distribute by
  - SORT BY对每个Reducer中的数据进行排序
  - 当Reducer数量设置为1时，等于order by
  - 排序列必须出现在select column列表中
  - distribute by类似于标准sql中的group by
  - 确保具有匹配列值的行被分区到相同的reducer
  - 不会对每个reducer的输出进行排序
  - 通常使用在sort by语句之前
- cluster by
  - cluster by = distribute by + sort by
  - 不支持asc|desc
  - 排序列必须出现在select column列表中
  - 为了充分利用所有的reducer来执行全局排序，可以先使用cluster by，然后使用order by
分组
- group by
  - Hive基本内置聚合函数与GROUP BY一起使用
  - 如果没有指定GROUP BY子句，则默认聚合整个表
  - 除聚合函数外，所选的其他列也必须包含在GROUP BY中
  - GROUP BY支持使用CASE WHEN或表达式
  - 支持按位置编号分组
  - set hive.groupby.orderby.position.alias=true;
  - select的列必须包含在GROUP BY中
  - 对NULL的聚合为0
- grouping sets
  - 实现对同一数据集进行多重GROUP BY操作
  - 本质是多个GROUP BY进行UNION ALL操作
- 2、hive语句的执行顺序：
  (1)from
  (2)on
  (3)join
  (4)where
  (5)group by
  (6)having
  (7)select
  (8)distinct
  (9)distribute by /cluster by
  (10)sort by
  (11) order by
  (12) limit
  (13) union /union all