数据仓库工具Hive——数据类型&文件编码格式

努力转行的任同学...

已于 2022-12-14 17:28:57 修改

阅读量2.4k

点赞数 23

文章标签： hive 数据仓库 hadoop

于 2021-05-17 20:04:10 首次发布

本文链接：https://blog.csdn.net/qq_43408367/article/details/116946847

版权

Hive 专栏收录该内容

16 篇文章 1 订阅

订阅专栏

文章目录

Hive -help 查看hive命令

[root@linux123 ~]# hive -help
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hive-2.3.7/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
usage: hive
 -d,--define <key=value>          Variable substitution to apply to Hive
                                  commands. e.g. -d A=B or --define A=B
    --database <databasename>     Specify the database to use
 -e <quoted-query-string>         SQL from command line
 -f <filename>                    SQL from files
 -H,--help                        Print help information
    --hiveconf <property=value>   Use value for given property
    --hivevar <key=value>         Variable substitution to apply to Hive
                                  commands. e.g. --hivevar A=B
 -i <filename>                    Initialization SQL file
 -S,--silent                      Silent mode in interactive shell
 -v,--verbose                     Verbose mode (echo executed SQL to the
                                  console)

不进入hive交互窗口，执行sql语句：使用 hive -e + sql语句

[root@linux123 ~]# hive -e "select * from t1";
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hive-2.3.7/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/opt/lagou/servers/hive-2.3.7/conf/hive-log4j2.properties Async: true
OK
t1.team	t1.year
活塞	1990
公牛	1991

hive 执行脚本中的sql语句：hive -f + 脚本

# 创建文件hqlfile.sql，内容：select * from t1
# 执行文件中的SQL语句
[root@linux123 ~]# hive -f hqlfile.sql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hive-2.3.7/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/opt/lagou/servers/hive-2.3.7/conf/hive-log4j2.properties Async: true
OK
t1.team	t1.year
活塞	1990

执行文件中的SQL语句，将结果写入文件

[root@linux123 ~]# hive -f hqlfile.sql >> result.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hive-2.3.7/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/lagou/servers/hadoop-2.9.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

Logging initialized using configuration in file:/opt/lagou/servers/hive-2.3.7/conf/hive-log4j2.properties Async: true
OK
Time taken: 4.156 seconds, Fetched: 22 row(s)
[root@linux123 ~]#

退出Hive命令行：exitorquit
在命令行执行 shell 命令 / dfs 命令

hive> ! ls;
hive> ! clear;
hive> dfs -ls / ;

Hive的数据类型与文件格式

Hive支持关系型数据库的绝大多数基本数据类型，同时也支持集合数据类型。

基本数据类型

Hive类似和java语言中一样，支持多种不同长度的整型和浮点类型数据，同时也支持布尔类型、字符串类型，时间戳数据类型以及二进制数组数据类型等。

   大类                                         类型
   
Integers(整型)                           TINYINT -- 1字节的有符号整数
									    SMALLINT -- 2字节的有符号整数
										INT -- 4字节的有符号整数
										BIGINT -- 8字节的有符号整数
										
Floating point numbers(浮点数)           FLOAT -- 单精度浮点数
									    DOUBLE -- 双精度浮点数
										
Fixed pointnumbers(定点数)               DECIMAL -- 17字节，任意精度数字。通常用户自定义decimal(12, 6)

String(字符串)							STRING -- 可指定字符集的不定长字符串
										VARCHAR -- 1-65535长度的不定长字符串
										CHAR -- 1-255定长字符串
									
Datetime(时间日期类型)                    TIMESTAMP -- 时间戳（纳秒精度）
										DATE -- 时间日期类型

Boolean(布尔类型)						BOOLEAN -- TRUE / FALSE

Binary types(二进制类型)                  BINARY -- 字节序列

这些类型名称都是 Hive 中保留字。这些基本的数据类型都是 java 中的接口进行实现
的，因此与 java 中数据类型是基本一致的：

Hive 数据类型	Java 数据类型	长度	样例
TINYINT	byte	1byte 有符号整数	20
SMALINT	short	2byte 有符号整数	20
INT	int	4byte 有符号整数	20
BIGINT	long	8byte 有符号整数	20
BOOLEAN	boolean	布尔类型，true 或者false	true 或者false
FLOAT	float	单精度浮点数	3.14159
DOUBLE	double	双精度浮点数	3.14159
STRING	string	字符系列。可以指定字符集。可以使用单引号或者双引号。	‘gond ’ ,“good”
TIMESTAMP		时间类型
BINARY		字节数组

对于 Hive 的 String 类型相当于数据库的 varchar 类型，该类型是一个可变的字符串,不过它不能声明其中最多能存储多少个字符，理论上它可以存储 2GB 的字符数.

Hive数据类型的隐式转换

Hive的数据类型是可以进行隐式转换的，类似于Java的类型转换。如用户在查询中将一种浮点类型和另一种浮点类型的值做对比，Hive会将类型转换成两个浮点类型中值较大的那个类型，即：将FLOAT类型转换成DOUBLE类型；当然如果需要的话，任意整型会转化成DOUBLE类型。
Hive 中基本数据类型遵循以下层次结构，按照这个层次结构，子类型到祖先类型允许隐式转换。

Hive数据类型的显示转换

Hive使用cast函数进行强制类型转换；如果强制类型转换失败，返回NULL

hive> select cast('1111s' as int);
OK
NULL
hive> select cast('1111' as int);
OK
1111

Hive 集合数据类型

Hive支持集合数据类型，包括array、map、struct、union，和基本数据类型一样，这些类型的名称同样是保留字；
ARRAY 和 MAP 与 Java 中的 Array 和 Map 类似；

STRUCT 与 C 语言中的 Struct 类似，它封装了一个命名字段集合，复杂数据类型允许任意层次的嵌套；

类型	描述	示例
ARRAY	有序的相同数据类型的集合	array(1,2)
MAP	key-value对。key必须是基本数据类型，value不限	map(‘a’, 1, ‘b’,2)
STRUCT	不同类型字段的集合。类似于C语言的结构体	struct(‘1’,1,1.0),named_struct(‘col1’, ‘1’, ‘col2’, 1,‘clo3’, 1.0)
UNION	不同类型的元素存储在同一字段的不同行中	create_union(1, ‘a’, 63)

hive> select array(1,2,3);
OK
[1,2,3]

-- 使用 [] 访问数组元素
hive> select arr[0] from (select array(1,2,3) arr) tmp;

hive> select map('a', 1, 'b', 2, 'c', 3);
OK
{"a":1,"b":2,"c":3}

-- 使用 [] 访问map元素
hive> select mymap["a"] from (select map('a', 1, 'b', 2, 'c',3) as mymap) tmp;

-- 使用 [] 访问map元素。 key 不存在返回 NULL
hive> select mymap["x"] from (select map('a', 1, 'b', 2, 'c',3) as mymap) tmp;
NULL

hive> select struct('username1', 7, 1288.68);
OK
{"col1":"username1","col2":7,"col3":1288.68}

-- 给 struct 中的字段命名
hive> select named_struct("name", "username1", "id", 7, "salary", 12880.68);
OK
{"name":"username1","id":7,"salary":12880.68}

-- 使用 列名.字段名 访问具体信息
hive> select userinfo.id
> from (select named_struct("name", "username1", "id",7, "salary", 12880.68) userinfo) tmp;

-- union 数据类型
hive> select create_union(0, "zhansan", 19, 8000.88) uinfo;

Hive-文本文件数据编码

Hive表中的数据在存储在文件系统上，Hive定义了默认的存储格式，也支持用户自定义文件存储格式。
Hive默认使用几个很少出现在字段值中的控制字符，来表示替换默认分隔符的字符。

Hive默认分隔符

id name age hobby(array) score(map)
字段之间：^A
元素之间: ^B
key-value之间：^C
666^Alisi^A18^Aread^Bgame^Ajava^C97^Bhadoop^C87
create table s1(
	id int,
	name string,
	age int,
	hobby array<string>,
	score map<string, int>
);
load data local inpath '/home/hadoop/data/s1.dat' into table s1;
select * from s1;

Hive分隔符

分隔符	名称	说明
\n	换行符	用于分隔行。每一行是一条记录，使用换行符分割数据
^A	< Ctrl >+A	用于分隔字段。在CREATE TABLE语句中使用八进制编码\001表示
^B	< Ctrl >+B	用于分隔 ARRAY、MAP、STRUCT 中的元素。在CREATETABLE语句中使用八进制编码\002表示
^C	< Ctrl +C>	Map中 key、value之间的分隔符。在CREATE TABLE语句中使用八进制编码\003表示

Hive 中没有定义专门的数据格式，数据格式可以由用户指定，用户定义数据格式需要指定三个属性：列分隔符（通常为空格、“\t”、“\x001”）、行分隔符（“\n”）以及读取文件数据的方法。
在加载数据的过程中，Hive 不会对数据本身进行任何修改，而只是将数据内容复制或者移动到相应的 HDFS 目录中。
将 Hive 数据导出到本地时，系统默认的分隔符是^A、B、^C 这些特殊字符，使用cat 或者 vim 是看不到的；
在 vi 中输入特殊字符：
1. (Ctrl + v) + (Ctrl + a) => ^A
2. (Ctrl + v) + (Ctrl + b) => ^B
3. (Ctrl + v) + (Ctrl + c) => ^C
^A / ^B / ^C 都是特殊的控制字符，使用 more 、 cat 命令是看不见的；可以使用cat -A file.dat

读时模式

在传统数据库中，在加载时发现数据不符合表的定义，则拒绝加载数据。数据在写入数据库时对照表模式进行检查，这种模式称为"写时模式"（schema on write）。
1. 写时模式 -> 写数据检查 -> RDBMS；
Hive中数据加载过程采用"读时模式" (schema on read)，加载数据时不进行数据格式的校验，读取数据时如果不合法则显示NULL。这种模式的优点是加载数据迅速。
1. 读时模式 -> 读时检查数据 -> Hive；好处：加载数据快；问题：数据显示NULL