Hive Shell命令

Hive Shell命令

Database

Create Database

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES
(property_name=property_value, ...)];

建库语句

create database if not exists dtinone;
create database if not exists test03 location "/a/test03.db";

Drop Database

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
## RESTRICT 默认,可以不写
## CASCADE 级联,删除非空的数据库
drop database test01 cascade;

Alter Database

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

更改数据库只能更改属性信息和数据库拥有者。

Use Database

使用(切换)数据库。

USE database_name;
USE DEFAULT;

Table

Create Table 创建表

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY available in Hive 0.14.0 and later)
  [(col_name data_type [column_constraint_specification] [COMMENT col_comment], ... [constraint_specification])]
  [COMMENT table_comment]
  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
  [SKEWED BY (col_name, col_name, ...)                  -- (Note: Available in Hive 0.10.0 and later)]
     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)
     [STORED AS DIRECTORIES]
  [
   [ROW FORMAT row_format] 
   [STORED AS file_format]
     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: Available in Hive 0.6.0 and later)
  ]
  [LOCATION hdfs_path]
  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)
  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)
 
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
  LIKE existing_table_or_view_name
  [LOCATION hdfs_path];
 
data_type
  : primitive_type
  | array_type
  | map_type
  | struct_type
  | union_type  -- (Note: Available in Hive 0.7.0 and later)
 
primitive_type
  : TINYINT
  | SMALLINT
  | INT
  | BIGINT
  | BOOLEAN
  | FLOAT
  | DOUBLE
  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)
  | STRING
  | BINARY      -- (Note: Available in Hive 0.8.0 and later)
  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)
  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)
  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)
  | DATE        -- (Note: Available in Hive 0.12.0 and later)
  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)
  | CHAR        -- (Note: Available in Hive 0.13.0 and later)
 
array_type
  : ARRAY < data_type >
 
map_type
  : MAP < primitive_type, data_type >
 
struct_type
  : STRUCT < col_name : data_type [COMMENT col_comment], ...>
 
union_type
   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)
 
row_format
  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]
        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)
  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]
 
file_format:
  : SEQUENCEFILE
  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)
  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)
  | ORC         -- (Note: Available in Hive 0.11.0 and later)
  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)
  | AVRO        -- (Note: Available in Hive 0.14.0 and later)
  | JSONFILE    -- (Note: Available in Hive 4.0.0 and later)
  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
 
column_constraint_specification:
  : [ PRIMARY KEY|UNIQUE|NOT NULL|DEFAULT [default_value]|CHECK  [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
 
default_value:
  : [ LITERAL|CURRENT_USER()|CURRENT_DATE()|CURRENT_TIMESTAMP()|NULL ] 
 
constraint_specification:
  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE 
    [, CONSTRAINT constraint_name UNIQUE (col_name, ...) DISABLE NOVALIDATE RELY/NORELY ]
    [, CONSTRAINT constraint_name CHECK [check_expression] ENABLE|DISABLE NOVALIDATE RELY/NORELY ]
建表语句
create table dtinone.t_student(
sid int,
name string,
age int,
address string
)row format delimited fields terminated by ',';

建表后会在hdfs上/user/hive/warehouse/dtinone.db/目录下生成t_student目录,这时将本地文件上传到t_student目录下,就可以查询t_student表中的数据了

本地文件student.txt的内容:

1,zhangsan,20,chengdu
2,lisi,18,beijing
3,wangwu,22,shanghai
4,zhao,21,guangzhou

查询结果:

0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student;
OK
+----------------+-----------------+----------------+--------------------+--+
| t_student.sid  | t_student.name  | t_student.age  | t_student.address  |
+----------------+-----------------+----------------+--------------------+--+
| 1              | zhangsan        | 20             | chengdu            |
| 2              | lisi            | 18             | beijing            |
| 3              | wangwu          | 22             | shanghai           |
| 4              | zhao            | 21             | guangzhou          |
+----------------+-----------------+----------------+--------------------+--+
4 rows selected (0.124 seconds)

Hive数据类型

Primitive Types 原始类型

类型与表中的列相关联。支持以下基本类型:

  • Types are associated with the columns in the tables. The following Primitive types are supported:
  • Integers
    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision
  • String types
    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types
    • TIMESTAMP — A date and time without a timezone (“LocalDateTime” semantics)
    • TIMESTAMP WITH LOCAL TIME ZONE — A point in time measured down to nanoseconds (“Instant” semantics)
    • DATE—a date
  • Binary types
    • BINARY—a sequence of bytes

Complex Types 复杂类型

可以使用基本类型和其他复合类型来构建复杂类型:

  • Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
  • Maps (key-value tuples): The elements are accessed using [‘element name’] notation. For example in a map M comprising of a mapping from ‘group’ -> gid the gid value can be accessed using M[‘group’]
  • Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements [‘a’, ‘b’, ‘c’], A[1] retruns ‘b’.
Arrays类型演示
student_array.txt
1,zhangsan,20,beijing,10 50 80
2,lisi,22,chengdu,40 55 80
3,wangwu,21,shanghai,70 90 80
4,zhaoliu,20,hangzhou,30 52 80

create table if not exists dtinone.t_student_arr(
sid int,
name string,
age int,
address string,
scores array<int>
)row format delimited fields terminated by ',' 
collection items terminated by ' ';

查询结果(数据不太一致,但结构是一样的):

0: jdbc:hive2://hadoop101:10000> select * from dtinone.t_student_arr;
OK
+----------------+-----------------+----------------+--------------------+-------------------+--+
| t_student.sid  | t_student.name  | t_student.age  | t_student.address  | t_student.scores  |
+----------------+-----------------+----------------+--------------------+-------------------+--+
| 1              | zhangsan        | 20             | chengdu            | [70,75,90]        |
| 2              | lisi            | 22             | beijing            | [88,96,74]        |
| 3              | wangwu          | 21             | shanghai           | [77,65,87]        |
| 4              | zhaoliu         | 20             | guangzhou          | [90,96,94]        |
+----------------+-----------------+----------------+--------------------+-------------------+--+
4 rows selected (0.124 seconds)
Maps 类型
student_map.txt
1,zhangsan,20,beijing,father:laozhang mother:baby
2,lisi,22,chengdu,father:laoli mother:lily
3,wangwu,21,shanghai,father:laozhang mother:baby
4,zhaoliu,20,hangzhou,father:laozhang mother:baby

create table if not exists dtinone.t_student_map(
sid int,
name string,
age int,
address string,
family map<string,string>
)row format delimited fields terminated by ','
collection items terminated by ' '
map keys terminated by ':';

查询结果(数据不太一致,但结构是一样的):

0: jdbc:hive2://hadoop101:10000> select * from test.t_student_map;
OK
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| t_student_map.sid  | t_student_map.name  | t_student_map.age  | t_student_map.address  |       t_student_map.family       |
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
| 1                  | zhangsan            | 20                 | chengdu                | {"father":"aaa","mother":"bbb"}  |
| 2                  | lisi                | 18                 | beijing                | {"father":"ccc","mother":"ddd"}  |
| 3                  | wangwu              | 22                 | shanghai               | {"father":"eee","mother":"fff"}  |
| 4                  | zhao                | 21                 | guangzhou              | {"father":"aaa","mother":"bbb"}  |
+--------------------+---------------------+--------------------+------------------------+----------------------------------+--+
4 rows selected (0.135 seconds)
## map的字段用M['group']表示
select * from test.t_student_map where family['father'] = 'aaa';
Structs类型
student_struct.txt
1,zhangsan,20,sichuan chengdu gaoxin
2,lisi,22,beijing beijing haidian
3,wangwu,21,sichuan chengdu jinjiang
4,zhaoliu,20,sichuan chengdu jinjiang

create table if not exists dtinone.t_student_struct(
sid int,
name string,
age int,
address struct<province:string,city:string,area:string>
)row format delimited fields terminated by ',' 
collection items terminated by ' ';
多种复杂类型一起使用
{
sid:'1',
name:'zhangsan',
age:20,
scores:[88,99,100],
family:{
    father:'laozhang',
    mother:'lily'
    },
address:{
    province:'四川',
    city:'成都',
    area:'高新区'
    }
}

1,zhangsan,20,88 99 100,father:laozhangmother:lily,四川 成都 高新区
2,zhaosi,21,78 90 100,father:laozhaomother:lily,四川 成都 高新区

create table if not exists dtinone.t_student_mix(
sid int,
name string,
age int,
scores array<int>,
family map<string,string>,
address struct<province:string,city:string,area:string>
)
row format delimited fields terminated by ',' 
collection items terminated by ' ' 
map keys terminated by ':';

内部表和外部表

内部表/管理表

默认情况下,Hive创建内部表,其中的文件、元数据和统计数据由内部Hive进程管理。

一个内部表存储在hive.metastore.warehouse.dir路径属性,默认情况下在类似于/user/hive/warehouse/databasename.db/tablename/的文件夹路径中。

在表创建期间,location属性可以覆盖默认位置。

如果删除了一个内部表或分区,则会删除与该表或分区关联的数据和元数据。

当Hive应该管理表的生命周期,或者在生成临时表时,使用内部表。

内部表使用location属性可以覆盖默认位置

删除内部表,会删除元数据和数据

create table dtinone.t1(
id int,
name string,
age int
)row format delimited fields terminated by ','
location '/xxx';
## /xxx 会覆盖/user/hive/warehouse/dtinone.db/t1

外部表

外部表描述外部文件的元数据/模式。
外部表可以访问存储在源中的数据。
如果外部表的结构或分区被更改,可以使用MSCK REPAIR table table_name语句刷新元数据信息。
删除外部表,只会删除元数据信息,数据不会删除
一般源数据使用外部表。

创建外部表

create external table dtinone.t2(
id int,
name string,
age int
)row format delimited fields terminated by ',';

删除外部表,仅仅是删除元数据,数据还在

内部表和外部表转换

查看表的描述信息

desc 表名; ## 查看表的字段信息
desc extended 表名; ## 查看表的扩展信息
desc formatted 表名; ## 格式化表的扩展信息
内部表转外部表
ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='TRUE');
外部表转内部表
ALTER TABLE dtinone.t3 SET TBLPROPERTIES ('EXTERNAL'='FALSE'); 

Alter Table Properties

ALTER TABLE table_name SET TBLPROPERTIES table_properties;

table_properties:
: (property_name = property_value, property_name = property_value, ... )

复制表结构及数据

create table student as select * from dtinone.t_student;

复制表结构

create table student01 like dtinone.t_student;

分区表

Hive处理的是离线数据(历史数据)。离线数据一般是 按照天或者周为单位的数据。现在就每天的数据都放在 一个表中,那么处理其中一天的数据的时候会加载所有 的数据。这种肯定是不行的。此时可以使用分区表。

在hive中数据库是一个目录;hive的表也是一个目录; hive的分区表是表目录的子目录。

分区表的目录格式:分区字段=值

  1. 创建access_20201210数据

    1 1001 www.baidu.com 2020-12-10
    2 1001 www.126.com 2020-12-10
    3 1002 www.baidu.com 2020-12-10
    4 1003 www.baidu.com 2020-12-10
    
  2. 创建分区表

    create table t_access(
    id int,
    uid int,
    website string,
    time string
    )
    partitioned by (access_time string)
    row format delimited fields terminated by ' ';
    
  3. 创建分区表的目录

    hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10
    
  4. 上传数据到分区表目录下

    hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10
    
  5. 查询数据

    select * from t_access;
    

注意:此时没有数据显示,原因是因为分区目录是使用hdfs命 名手动创建的。hive的元数据信息中没有该分表表的信 息,所以无法查询到数据。

  • 解决方式一:使用元数据修复命令MSCK(metastore check)

    MSCK REPAIR table t_access;
    
  • 解决方式二:使用hive的命令添加分区(添加分区时多个分区用空格隔开,官网语法有误

    ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec
    [LOCATION 'location'][ PARTITION partition_spec [LOCATION 'location'], ...];
    
    partition_spec:
    : (partition_column = partition_col_value, partition_column = partition_col_value, ...)
    
    ALTER TABLE t_access ADD IF NOT EXISTS PARTITION (access_time='2020-12-12');
    
    ALTER TABLE page_view ADD PARTITION (dt='2008-08-08', country='us') location '/path/to/us/part080808'
                              PARTITION (dt='2008-08-09', country='us') location '/path/to/us/part080809';
    

多级分区

一般也就是2-3级足以。

  • 创建多级分区

    create table t_access01(
    id int,
    uid int,
    website string,
    time string
    )
    partitioned by (access_time string,hours int) 
    row format delimited fields terminated by ' ';
    
  • 创建多级分区需要创建多级目录

    hdfs dfs -mkdir /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
    
  • 将数据文件上传到多级目录下

    hdfs dfs -put /home/hadoop/data/access_20201210 /user/hive/warehouse/t_access/access_time=2020-12-10/hours=10
    

删除分区

ALTER TABLE table_name DROP [IF EXISTS] PARTITION partition_spec[, PARTITION partition_spec, ...]
  [IGNORE PROTECTION] [PURGE]; 
  
partition_spec:
: (partition_column = partition_col_value, partition_column = partition_col_value, ...)

注意:添加分区的时候多个PARTITION之间使用空格; 删除分区的时候多个PARTITION之间使用逗号。

alter table t_access drop if exists partition (access_time='2020-12-10'),partition (access_time='2020-12-11');

Hive表的存储格式

File Formats

Hive supports several file formats:

  • Text File
  • SequenceFile
  • RCFile
  • Avro Files
  • ORC Files
  • Parquet
  • Custom INPUTFORMAT and OUTPUTFORMAT

file_format:

:SEQUENCEFILE

| TEXTFILE --(Default, depending on hive.default.fileformat configuration)

| ORC – (Note: Available in Hive0.11.0 and later)

| PARQUET – (Note: Available in Hive 0.13.0 and later)

| AVRO – (Note: Available in Hive 0.14.0 and later)

| INPUTFORMAT input_format_classname

| OUTPUTFORMAT output_format_classname

parquet

https://zhuanlan.zhihu.com/p/141908285
简单介绍下:

  • Parquet 是一种支持嵌套结构的列式存储格式
  • 非常适用于 OLAP 场景,按列存储和扫描

ORC

除了 Parquet,另一个常见的列式存储格式是ORC(OptimizedRC File)。在 ORC 之前,Apache Hive 中就有一种列式存储格式称为RCFile(RecordColumnar File),ORC 是对 RCFile 格式的改进,主要在压缩编码、查询性能方面做了优化。因此 ORC/RC 都源于Hive,主要用来提高 Hive 查询速度和降低 Hadoop 的数据存储空间。

ORC具有以下一些优势:

  1. ORC是列式存储,有多种文件压缩方式,并且有着很高的压缩比。
  2. 文件是可切分(Split)的。因此,在Hive中使用ORC作为表的文件存储格式,不仅节省HDFS存储资源,查询任务的输入数据量减少,使用的MapTask也就减少了。
  3. 提供了多种索引,row group index、bloom filter index。
  4. ORC可以支持复杂的数据结构(比如Map等)

Parquet 与 ORC 的不同点总结以下:

  • 嵌套结构支持:Parquet 能够很完美的支持嵌套式结构,而在这一点上 ORC 支持的并不好,表达起来复杂且性能和空间都损耗较大。
  • 更新与 ACID 支持:ORC 格式支持 update 操作与ACID,而 Parquet 并不支持。
  • 压缩与查询性能:在压缩空间与查询性能方面,Parquet 与 ORC 总体上相差不大。可能 ORC 要稍好于 Parquet。
  • 查询引擎支持:这方面 Parquet 可能更有优势,支持Hive、Impala、Presto 等各种查询引擎,而 ORC 与Hive 接触的比较紧密,而与Impala 适配的并不好。之前我们说 Impala 不支持 ORC,直到 CDH 6.1.x 版本也就是 Impala3.x 才开始以 experimental feature支持 ORC 格式。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值