Hive的DDL语句

最新推荐文章于 2024-07-30 15:56:26 发布

余淮~

最新推荐文章于 2024-07-30 15:56:26 发布

阅读量836

点赞数 24

文章标签： hive

本文链接：https://blog.csdn.net/2301_79020421/article/details/135392461

版权

day02 hive DDL语句

create\drop\alter\use

1、数据库

1、创建数据库

create database ods;

2、查看所有数据库

show databases;

3、进入数据库

use ods;

2、创建表CREATE

1、创建表语句

模板：

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
 
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]

案例：

create table if not exists 表名(
  id int,
  name string
);

1、查看所有表;

show tables;

2、查看具体的表

desc biaoming;

3、查看表详细信息

desc formatted biaoming;

4、表的存储格式

ORC和PARQUET都是列式存储，采用了一些压缩技术和编码方式来提高数据读取速度和存储效率

create table biaoming(
id int
)stored as orc;

5、表中每行的解析方式

# 表示每行的数据以什么为界限，后面跟上说明
row format delimited
# 表示每个字段是用逗号，结尾的，即以逗号分隔的
fields terminated by ','
# 表示每个集合之间用- 结尾
collection items terminated by '-'
# 表示每个map集合中的key以：结尾
map keys terminated by ':'

案例：使用文件给表中load添加数据时，文件中的数据格式按照每行数据解析的格式来写的，导入文件，才能被table识别并成功塞值

create table test_table(
  id int,
  name map<string,string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
map keys terminated by ':';

6、创建表的字段类型为引用数据类型

create table complex_table(
  ar array<int> comment 'array_type',
  mp map<string,string> comment 'map_type',
  struct<id:int,name:string> comment 'struct_type'
)
stored as orc;

内部表和外部表区别：内部表删表之后数据也清空；外部表删表之后，数据依然存在

1、外部表

表名前加external，后加location，表示表中的数据保存的地址（hdfs中）

create external table ext_table(
  id int,
  name string
)
location '/demo';

1、添加数据

insert into table ext_table values(1,'jack'),(2,'tom');

2、查看表数据

select * from ext_table;

而且在删除外部表后，在看hdfs中/demo文件夹下数据还在。

3、内部表转外部表

alter table inner_table set TBLPROPERTIES('EXTERNAL'='TRUE');#内部表转外部表
alter table table_name set TBLPROPERTIES('EXTERNAL'='FALSE');#外部表转内部表

案例1：

1、建外部表

create external table ext_table(
  id int,
  name string
)
row format delimited
fields terminated by ','
location '/demo';

表的数据格式为逗号分隔的文本文件，每行记录表示一个数据行，数据保存在hdfs根目录下的demo文件夹中。

2、添加数据，overwrite是覆盖之前的数据

insert overwrite table ext_table values(1,'jack'),(2,'tom');

案例2：

1、建每一行数据都是json的表

create external table ext_json_tab(
  id int,
  name string
)
#每一行数据都是用Json序列化和反序列化来协助我们解析每一行数据
row format serde 'org.apache.hadoop.hive.serde2.JsonSerDe'
stored as textfile
#数据保存在/demo/json文件夹下
location '/demo/json';

2、添加数据

insert into table ext_json_tab values(1,'jack11'),(2,'tom22'),(3,'luck');

3、在hdfs中的/demo/json文件夹下查看保存的数据，都是json格式的数据

hdfs dfs -cat /demo/json/*


结果为:
{"id":1,"name":"jack11"}
{"id":2,"name":"tom22"}
{"id":3,"name":"luck"}

2、分区表

1、后面加上partitioned by（指定字段，一般是日期），括号内不要填写分区信息。

create table par_table(
  id int,
  name string
)
partitioned by(y int,m int,d int)
stored as orc;

2、添加数据

insert into table par_table values(1,'apple',2023,11,01),(2,'jack',2023,11,01),(3,'tom',2023,11,02);

3、查看表的所有分区

show partitions 表名;

查看par_table,发现表下面创建了递归文件夹

/user/hive/a/par_table/y=2023/m=11/下面有两个文件夹分别是/d=01和/d=02;

4、删除分区

alter table par_table drop if exists partition (y=2023,m=11,d=1);

5、添加分区

alter table par_table add partition(y=2023,m=12,d=1);

3、CTAS(Create Table As Select)

后面加上as select 字段 from 表名；相当于从一个源表中拿出一些字段来建立一个表格，源表中的数据也会复制过来

create table ctas_table
as select id,name from inner_table;

查看数据，发现数据也过来了

select * from ctas_table;

4、Create Table Like

后面加like 表名；创建一个和源表一样结构的表

create table like_table like inner_table;

5、Temporary Tables

表名前加temporary，表示临时表，不存在于hdfs中，只存在于当前会话，会话结束，表自动销毁

create temporary table tmp_table(
  id int,
  name string
);

6、Bucketed Sorted Tables分桶表

1、格式：后面加上CLUSTERED BY(列名) SORTED BY(列名) INTO 32 BUCKETS

2、定义：分桶表（Bucketed Table）是一种通过对表的数据进行分桶来提高查询性能和效率的技术。根据（分桶字段的hashcode%桶数）来进行分组，查找时只需要

需要指定用于分桶的列，同时指定桶的个数；分桶后，保存形式为表的目录下有相应的文件个数（桶的个数）；分桶时还指定分区，则表下先分区，在每个区文件夹下再分桶；

3、优势：进一步缩小数据查询的范围，查找时，只需要

create table buck_table(
  id int,
  age int,
  name string
)
#根据id分为5个桶,每个桶中根据age排序
clustered by(id) sorted by(age) into 5 buckets;

添加数据

insert into table buck_table values(1,12,'jack'),(2,12,'tom'),(3,13,'luck'),(4,14,'xiasi'),(5,15,'xiaowu'),(6,16,'xiaoliu');

7、Skewed Tables倾斜表

1、定义：该表中某些列的某些值出现频率非常高，而其他列的值则相对较少，在处理时，这些列的节点会处理大量数据，而其它列的节点处理较少，效率降低。此时使用倾斜表，将频率高的数据先声明倾斜出来，处理时，它会自动将倾斜的键值对分配给一个单独的桶或分区

2、用处：解决少数key占据大量数据

3、创建：创建表时，可以使用 SKEWED BY 子句指定倾斜列，并使用 STORED AS DIRECTORIES 子句创建存储目录

create table ske_table(
  id int,
  age int
)
skewed by (age) on (20)
[STORED AS DIRECTORIES];
#age为20 的数据很多，所以对它倾斜，程序会将它放在一个reduce中处理

8、约束表

约束表（Constraint Table）是指在数据库中对表的数据进行限制和保护的一种机制。通过定义约束条件，可以确保表中的数据满足特定的规则和要求，从而保证数据的完整性和一致性。

常见的约束类型包括：

主键约束（Primary Key Constraint）：用于标识表中的唯一记录，并且不允许为空。每个表只能有一个主键约束。
唯一约束（Unique Constraint）：确保表中某列的值是唯一的，可以有多个唯一约束。
外键约束（Foreign Key Constraint）：用于建立表与表之间的关系，确保引用表中的外键与被引用表的主键匹配。
非空约束（Not Null Constraint）：确保某列的值不能为空。
检查约束（Check Constraint）：通过定义表达式或函数，对表中的数据进行验证，确保满足特定的条件。

通过使用约束表，可以有效地保护数据库中的数据完整性，防止插入、更新或删除操作违反事先定义的规则。当违反约束条件时，数据库系统会拒绝对表进行相应的操作，并返回错误信息。这样可以防止不符合要求的数据进入到数据库中，保持数据的一致性和可靠性。

需要注意的是，约束是在数据库层面进行强制执行的，而不是应用程序层面。因此，无论通过何种方式操作数据库（如命令行、图形界面工具或应用程序），约束都会被自动应用和验证。

9、视图

1、定义：

在 Hive 中，视图是一种虚拟表，它只包含从现有表中选择的数据的查询结果。与物理表不同，视图本身并不存储数据，而是在查询时动态生成。Hive 视图的语法与 SQL 中的视图非常相似

2、优点：

1、将真实表中特定的类数据提供给用户，保护数据隐私；

2、降低查询复杂度，优化查询语句（复杂的业务逻辑，使用复杂的语句提取出结果，放到视图中，供他人方便使用）

create view v1 as select id,sum(age) as ages from inner_table group by id;

如果删除或修改源表，则视图将失效。

3、查看视图信息

desc v1;

4、查看视图数据

select id,ages from v1;

结果为：
id  ages
1    18
2    20

10、物化视图Materialized Views

1、定义

物化视图（Materialized View）是一种数据库对象，它类似于视图，但与视图不同的是，物化视图实际上存储了基于查询结果的数据。物化视图在数据库中以表的形式存在，并且与表一样，具有独立的存储空间。通过预计算和缓存查询结果

2、注意

源表数据更新后，物化视图需手动重建

1、源数据仅是insert，增量更新

2、有update和delete，为完全重建rebulid。

3、创建语句

create materialized view wu_view
tblproperties(...)
as select id,count(*) as id_cnt from inner_table group by id;

4、重建语句

alter materialized view wu_view rebuild;

5、删除语句

删除源表之前，需要先删除物化视图。

11、Transactional Tables事务表

1、用处

针对流式传输数据，如使用Flume、Kafka工具将数据流式传输到hadoop集群中，通常传输到已有分区上，这就会可能造成脏读。

事务是一组相关的操作，这些操作要么全部成功执行，要么全部回滚，以保持数据的一致性和完整性。

2、格式：

末尾加上

TBLPROPERTIES('TRANSACTIONAL'='TRUE')

3、建表

--Hive中事务表的创建使用
--1、开启事务配置（可以使用set设置当前session生效 也可以配置在hive-site.xml中）
set hive.support.concurrency = true; --Hive是否支持并发
set hive.enforce.bucketing = true; --从Hive2.0开始不再需要  是否开启分桶功能
set hive.exec.dynamic.partition.mode = nonstrict; --动态分区模式  非严格
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; --
set hive.compactor.initiator.on = true; --是否在Metastore实例上运行启动线程和清理线程
set hive.compactor.worker.threads = 1; --在此metastore实例上运行多少个压缩程序工作线程。

--2、创建Hive事务表
create table trans_student(
    id int,
    name String,
    age int
) clustered by (id) into 2 buckets
stored as orc 
TBLPROPERTIES('transactional'='true');

--3、针对事务表进行insert update delete操作
insert into trans_student (id, name, age) values (1,"allen",18);
update trans_student set age = 20 where id = 1;
delete from trans_student where id =1;
select * from trans_student;

4、注意

1、文件格式必须设置为ORC；

2、默认事务配置为关闭，需配置参数开启；

3、表必须是分桶表，才可以使用事务功能；

4、表参数transactional必须为true；

5、外部表不能称为ACID表；

3、修改表ALTER

1、修改表

# 修改表名
alter table inner_table rename to inner_tab;

# 修改表的属性
alter table inner_table set TBLPROPERTIES('EXTERNAL'='TRUE');

# 修改表注释
alter table inner_table set TBLPROPERTIES('comment'='inner table');

# 修改表存储格式---序列化反序列化属性
alter table table1 set serde orc;

2、修改列

#网址
https://cwiki.apache.org/confluence/display/Hive/LanguageManual%20DDL#LanguageManualDDL-RulesforColumnNames

# 模板
ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
  [COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];

# 修改列名
alter table inner_table change age nianling int comment 'age';

3、修改分区表

#网址
https://cwiki.apache.org/confluence/display/Hive/LanguageManual DDL#LanguageManualDDL-RenamePartition

# 删除分区
alter table par_table drop if exists partition (y=2023,m=11,d=1);

# 增加分区
alter table par_table add partition(y=2023,m=12,d=1);

# 修改分区名
alter table par_table partition(y=2023,m=12,d=1) rename to partition(y=2000,m=1,d=1);

# 移动分区
# 将一个分区表的分区移动到另一张分区表下
alter table par_table exchange partition(y=2000,m=1,d=1) with table par_table1;

4、查看数据

1、查看orc压缩文件

hive --service orcfiledump -d /user/hive/a/par_table/y=2023/m=11/d=1/000000_0

5、插入数据

1、insert

into表示追加，overwrite表示覆盖

insert into|overwrite table inner_table values(1,'j'),(2,'k');

2、load

添加linux本地的文本数据到表中；如果添加hdfs中的文件数据，则去掉local

load data local inpath '/home/../pro.data' into table 表名；

3、put

直接将文本数据放到hdfs的表的文件夹下，表就会自动读取，注意分区表不能这么操作；

6、分析ANALYZE、EXPLAIN

1、explain

使用EXPLAIN关键字来获取查询的执行计划，执行计划可以帮助我们了解Hive是如何执行查询的，包括数据读取的顺序、使用的join方式、是否使用了索引等等

explain select * from buck_table;

2、analyze

统计表大小

analyze table buck_table compute statistics;

统计单个分区的大小

analyze table par_table partition (y=2023,m=11,d=1) compute statistics [for columns];

ANALYZE操作可能会消耗大量的时间和资源，因此应该在非高峰期执行

7、Hive新的推荐的客户端连接方式

1、查看10000端口有没有被占用

netstat -nl | grep 10000

2、开启hiveserver2进程

#启动hiveserver2服务
nohup /home/hive/apache-hive-3.1.3-bin/bin/hive --service hiveserver2 > /home/hadoop/apache-hive-3.1.3-bin/logs/hiveserver2.log 2>&1 &

3、使用beenline连接

# hadoop000表示服务在哪里启动就填写那个机器
./beeline -u jdbc:hive2://hadoop000:10000

4、查看hive3.1.3内置的函数

show functions;

5、查看某个函数的描述

describe function extended upper;

6、使用函数

1、upper：转为大写

select upper('abc');

结果为：
ABC

2、array：转为数组

# 别名
select `array`(1,2,3) as arr;

结果为:
arr
[1,2,3]

余淮~

关注

24
点赞
踩
17

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫