1. 数据类型
1.1 数字类
类型 | 长度 | 备注 |
---|---|---|
TINYINT | 1字节 | 有符号整型 |
SMALLINT | 2字节 | 有符号整型 |
INT | 4字节 | 有符号整型 |
BIGINT | 8字节 | 有符号整型 |
FLOAT | 4字节 | 有符号单精度浮点数 |
DOUBLE | 8字节 | 有符号双精度浮点数 |
DECIMAL | – | 可带小数的精确数字字符串 |
1.2 日期时间类
类型 | 长度 | 备注 |
---|---|---|
TIMESTAMP | – | 时间戳,内容格式:yyyy-mm-dd hh:mm:ss[.f…] |
DATE | – | 日期,内容格式:YYYYMMDD |
INTERVAL | – | – |
1.3 字符串类
类型 | 长度 | 备注 |
---|---|---|
STRING | – | 字符串 |
VARCHAR | 字符数范围1 - 65535 | 长度不定字符串 |
CHAR | 最大的字符数:255 | 长度固定字符串 |
1.4 Misc类
类型 | 长度 | 备注 |
---|---|---|
BOOLEAN | – | 布尔类型 TRUE/FALSE |
BINARY | – | 字节序列 |
1.5 复合类
类型 | 长度 | 备注 |
---|---|---|
ARRAY | – | 包含同类型元素的数组,索引从0开始 ARRAY<data_type> |
MAP | – | 字典 MAP<primitive_type, data_type> |
STRUCT | – 结构体 STRUCT<col_name : data_type [COMMENT col_comment], …> | |
UNIONTYPE | – | 联合体 UNIONTYPE<data_type, data_type, …> |
1.5.1 复合类测试
1.5.1.1 创建表
drop table if exists complex_table;
create table complex_table(
c_int int,
c_array array<string>,
c_map map<string, string>,
c_struct struct<name:string,age:int>
)
row format delimited
FIELDS TERMINATED BY '\t'
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
1.5.1.2 插入数据
insert into complex_table values(
1,
array("array_value1","array_value2"),
str_to_map("key1:value1,key2:value2"),
named_struct("name","Alice", "age",18)
);
1.5.1.3 查询数据
1.5.1.3.1 array
- 语法: A[n]
操作类型: A为array类型,n为int类型
说明:返回数组A中的第n个变量值,数组的起始下标为0,下标超过长度返回null 值。
hive> select c_array[0], c_array[1], c_array[2] from complex_table ;
OK
array_value1 array_value2 NULL
- size()函数可以查询数组中元素的个数,下标超过长度返回null 值
hive> select size(c_array) from complex_table;
OK
2
- array_contains()函数可以查询数组中是否包含某个元素
array_contains(数组名,值)
返回 true 或 false
hive> select array_contains(c_array,'array_value1') v1, array_contains(c_array,'array_value3') from complex_table;
OK
true false
1.5.1.3.2 map类型
- 数据访问
语法: M[key]
操作类型: M为map类型,key为map中的key值
说明:返回map类型M中key值为指定值的value值,没有的值返回 NULL。
hive> select c_map['key1'], c_map['key3'] from complex_table;
OK
value1 NULL
- 获取map中的键、值
map_keys()
map_values()
hive> select map_keys(c_map), map_values(c_map) from complex_table;
OK
["key1","key2"] ["value1","value2"]
3.size()函数获取map中键值对的个数
hive> select size(c_map) from complex_table;
OK
2
- 查询map中是否包含某个键、值
array_contains(map_keys(字段名), 键名)
array_contains(map_values(字段名), 值名)
hive> select
array_contains(map_keys(c_map), 'key1'),
array_contains(map_keys(c_map), 'key3'),
array_contains(map_values(c_map), 'value1'),
array_contains(map_values(c_map), 'value3')
from complex_table;
OK
true false true false
1.5.1.3.3 struct类型
语法: S.x
操作类型: S为struct类型
说明:返回集合S中的x字段
hive> select c_struct.name, c_struct.age from complex_table;
OK
Alice 18
1.5.2 union 测试
1.5.2.1 创建表
create table union_testnew(
foo uniontype<int, double, string, array<string>, map<string, string>>
)
row format delimited
collection items terminated by ','
map keys terminated by ':'
lines terminated by '\n'
stored as textfile;
1.5.2.2 数据准备
vim union_test.log
0,1
1,3.0
2,world
3,wade:tom:polly
4,k1^Dv1:k2^Dv2
1.5.2.3 导入数据
hive> load data local inpath './union_test.log' overwrite into table union_testnew;
1.5.2.4 查询数据
hive> select * from union_testnew;
OK
union_testnew.foo
{0:1}
{1:3.0}
{2:"world"}
{3:["wade","tom","polly"]}
{4:{"k1":"v1","k2":"v2"}}
Time taken: 0.225 seconds, Fetched: 5 row(s)
Hive 官方文档:LanguageManual Types
2 创建表
所有的操作在 mydb。
CREATE DATABASE IF NOT EXISTS mydb;
use mydb;
2.1 CREATE TABLE
指定文件格式和分割符。
CREATE TABLE IF NOT EXISTS employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
2.2 CREATE TABLE LIKE
用于创建表结构相同的表。
CREATE TABLE IF NOT EXISTS mydb.employee2 LIKE mydb.employees;
SHOW TABLES
- 显示当前数据库的所有的表。
hive> SHOW TABLES;
OK
employee2
employees
Time taken: 0.063 seconds, Fetched: 2 row(s)
hive>
- 显示指定数据库的所有的表。
hive> SHOW TABLES IN mydb;
OK
employee2
employees
Time taken: 0.05 seconds, Fetched: 2 row(s)
- 使用通配符,‘.’ 代表任意一个字符,‘*’ 代表 0 个或多个字符
hive> SHOW TABLES 'empl.*';
OK
employee2
employees
Time taken: 0.049 seconds, Fetched: 2 row(s)
DESC table 显示表的信息
DESC
普通 desc 只显示表的所有字段
hive> DESC employees;
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
Time taken: 0.078 seconds, Fetched: 5 row(s)
DESC EXTENDED
加上 EXTENDED
关键字则显示表的详细信息。如表在在的数据库,owner,创建时间,序列化信息,统计信息(文件数,总数据量,行数等)。但是详细信息显示在一行,不容易阅读,用 FORMATTED 关键字以比较容易阅读的方式输出。
hive> DESC EXTENDED employees;
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
Detailed Table Information Table(tableName:employees, dbName:mydb, owner:houzhizhen, createTime:1635924848, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:salary, type:float, comment:null), FieldSchema(name:subordinates, type:array<string>, comment:null), FieldSchema(name:deductions, type:map<string,float>, comment:null), FieldSchema(name:address, type:struct<street:string,city:string,state:string,zip:int>, comment:null)], location:file:/user/hive/warehouse/mydb.db/employees, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{mapkey.delim=, collection.delim=, serialization.format=, line.delim=\n, field.delim=}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{totalSize=0, numRows=0, rawDataSize=0, COLUMN_STATS_ACCURATE={\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"address\":\"true\",\"deductions\":\"true\",\"name\":\"true\",\"salary\":\"true\",\"subordinates\":\"true\"}}, numFiles=0, transient_lastDdlTime=1635924848, bucketing_version=2}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE, rewriteEnabled:false, catName:hive, ownerType:USER)
Time taken: 0.178 seconds, Fetched: 7 row(s)
DESC FORMATTED
加上 FORMATTED 关键字,比较容易阅读。
hive> DESC FORMATTED employees;
OK
# col_name data_type comment
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
# Detailed Table Information
Database: mydb
OwnerType: USER
Owner: houzhizhen
CreateTime: Wed Nov 03 15:34:08 CST 2021
LastAccessTime: UNKNOWN
Retention: 0
Location: file:/user/hive/warehouse/mydb.db/employees
Table Type: MANAGED_TABLE
Table Parameters:
COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"address\":\"true\",\"deductions\":\"true\",\"name\":\"true\",\"salary\":\"true\",\"subordinates\":\"true\"}}
bucketing_version 2
numFiles 0
numRows 0
rawDataSize 0
totalSize 0
transient_lastDdlTime 1635924848
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
collection.delim \u0002
field.delim \u0001
line.delim \n
mapkey.delim \u0003
serialization.format \u0001
Time taken: 0.099 seconds, Fetched: 39 row(s)
Managed Tables and External Tables
Managed Tables
Hive 管理的表,创建表时,没有 external 关键字。当删除表时,不仅删除对应的元数据,还删除对应的文件。
External Tables
外部表,和 Managed Tables 对应,Hive 不管理数据。当删除表时,不删除对应的文件,仅删除对应的元数据。
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
c_exchange string,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks';
外部表转为内部表
alter table stocks set tblproperties('EXTERNAL'='FALSE');
删除外部表时删除数据
Hive 4.0 及以后版本才支持:HIVE-19753
alter table table_name set tblproperties('external.table.purge'='TRUE');
Partitioned Managed Tables
分区表,创建分区表加上 PARTITIONED BY
关键字。
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING);
SHOW PARTITIONS
显示表的所有分区
hive> SHOW PARTITIONS employees;
OK
country=US/state=AB
country=US/state=AC
country=US/state=CA
SHOW PARTITIONS
添加其中一个限制分区
hive> SHOW PARTITIONS employees partition(country='US');
OK
country=US/state=AB
country=US/state=AC
country=US/state=CA
DESC PARTITION
显示分区的列信息。
hive> DESC employees PARTITION(country='US',state='AB');
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
country string
state string
# Partition Information
# col_name data_type comment
country string
state string
Time taken: 0.162 seconds, Fetched: 12 row(s)
DESC EXTENDED PARTITION
和desc extended TABLE
一样,输出分区的详细信息。
hive> DESC EXTENDED employees PARTITION(country='US',state='AB');
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
country string
state string
# Partition Information
# col_name data_type comment
country string
state string
Detailed Partition Information Partition(values:[US, AB], dbName:mydb, tableName:employees, createTime:1635928495, lastAccessTime:0, sd:StorageDescriptor(cols:[FieldSchema(name:name, type:string, comment:null), FieldSchema(name:salary, type:float, comment:null), FieldSchema(name:subordinates, type:array<string>, comment:null), FieldSchema(name:deductions, type:map<string,float>, comment:null), FieldSchema(name:address, type:struct<street:string,city:string,state:string,zip:int>, comment:null), FieldSchema(name:country, type:string, comment:null), FieldSchema(name:state, type:string, comment:null)], location:file:/user/hive/warehouse/mydb.db/employees/country=US/state=AB, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), parameters:{transient_lastDdlTime=1635928495, totalSize=1, numRows=0, rawDataSize=0, numFiles=1}, catName:hive)
Time taken: 0.192 seconds, Fetched: 14 row(s)
DESC FORMATTED PARTITION
和 DESC FORMATTED TABLE 显示类似。
ALTER TABLE – 修改表
PARTITION – 和分区相关的修改
ADD PARTITION – 增加分区
ALTER TABLE employees ADD IF NOT EXISTS PARTITION(country='US',state='AD') LOCATION 'file:/user/hive/warehouse/mydb.db/employees/country=US/state=AD';
SET LOCATION OF PARTITION – 设置分区的位置
ALTER TABLE employees PARTITION(country='US',state='AB') SET LOCATION 'file:/user/hive/warehouse/mydb.db/employees/country=US/state=AC';
DROP PARTITION – 删除分区
ALTER TABLE employees DROP IF EXISTS PARTITION(country='US',state='AB');
CHANGE COLUMNS – 修改列
hive> DESC employees;
OK
name string
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
country string
state string
# Partition Information
# col_name data_type comment
country string
state string
Time taken: 0.069 seconds, Fetched: 12 row(s)
CHANGE COLUMN NAME AND TYPE – 修改列名和类型
ALTER TABLE employees CHANGE COLUMN name full_name VARCHAR(255);
hive> DESC employees;
OK
full_name varchar(255)
salary float
subordinates array<string>
deductions map<string,float>
address struct<street:string,city:string,state:string,zip:int>
country string
state string
# Partition Information
# col_name data_type comment
country string
state string
Time taken: 0.056 seconds, Fetched: 12 row(s)
CHANGE COLUMN POSITION – 修改列位置
SET name
after salary
– 放到指定的列之后
hive> ALTER TABLE employees CHANGE COLUMN full_name name String COMMENT 'comment' AFTER salary;
OK
Time taken: 0.119 seconds
SET name
first – 修改为第一列
hive> ALTER TABLE employees CHANGE COLUMN name name String COMMENT 'comment' FIRST;
OK
Time taken: 0.119 seconds
Adding columns – 增加列
Add new columns to the end of the existing columns.
ALTER TABLE employees ADD COLUMNS (
age INT,
sex boolean);
Replacing Columns – 删除或替换所有的列
Removes all the existing columns and replaces them with the new columns specified.
ALTER TABLE employees REPLACE COLUMNS (
`name` string COMMENT 'comment',
`salary` varchar(255),
`subordinates1` array<string>,
`deductions` map<string,float>,
`address` struct<street:string,city:string,state:string,zip:int>,
`age` int,
`sex` boolean,
married boolean);
如果因为格式对不上报错,只能先删除,再重建。
Alter Table Properties – 修改列的属性
ALTER TABLE employees SET TBLPROPERTIES(
'notes'=' The employees of the whole company');
Alter Storage Properties – 修改存储的属性
如表 employees
有 3 个分区。
hive> show partitions employees;
OK
country=US/state=AC
country=US/state=AD
country=US/state=CA
设置表的文件格式
alter table employees set fileformat textfile;
文件格式可以选择 textile, orc, parquet, SEQUENCEFILE。如果表是分区表,则只有新建的分区使用表默认的存储格式,以前的分区不会改。
设置分区文件格式
ALTER TABLE employees PARTITION(country='US',state='AC') SET FILEFORMAT SEQUENCEFILE;
SET SERDE – 设置序列化信息
ALTER TABLE table_name SET SERDE 'com.example.JSONSerDe'
WITH SERDEPROPERTIES(
'prop1' = 'value1',
'prop2' = 'value2',
)
SET Serde Properties – 设置序列化器用到的属性
ALTER TABLE table_name
SET SERDEPROPERTIES(
'prop1' = 'value1',
'prop2' = 'value2',
)
Cluster by, Sorted by, Buckets – 设置Cluster by, Sorted by, Buckets
ALTER TABLE employees CLUSTERED BY (name)
SORTED BY (age)
INTO 48 BUCKETS;
修改表的 comment
ALTER TABLE test_t SET TBLPROPERTIES ('comment' = '中文表-comment');
修改字段的 comment (注意字段名写 2 次)
ALTER TABLE test_t CHANGE c1 c1 double COMMENT '中文字段-c1-comment';
TOUCH – 会重新触发增加分区的 hook
ALTER TABLE ... TOUCH
statement is used to trigger these hooks.
ALTER TABLE employees TOUCH PARTITION(country='US',state='AC');
NODROP (Disabled in Hive 3)
ALTER TABLE employees PARTITION(country='US',state='AC') ENABLE NO_DROP;
OFFLINE (Disabled in Hive 3)
ALTER TABLE employees PARTITION(country='US',state='AC') ENABLE OFFLINE;
ARCHIVE and UNCHARIVE (Disabled in Hive 3)
ARCHIVE captures the partition files into a Hadoop archive(HAR) file.
ALTER TABLE employees ARCHIVE PARTITION(country='US',state='AC');