hive知识点详解

最新推荐文章于 2024-05-20 09:27:13 发布

倒吃甘蔗

最新推荐文章于 2024-05-20 09:27:13 发布

阅读量852

点赞数 1

分类专栏： hive 大数据文章标签： hive 知识点入门 hadoop hdfs

本文链接：https://blog.csdn.net/qq_38190111/article/details/94719342

版权

Hive支持的常用数据类型和文件格式

Hive是一种构建在Hadoop上的数据仓库，Hive把SQL查询转换为一系列在Hadoop集群中运行的MapReduce作业，是MapReduce更高层次的抽象，不用编写具体的MapReduce方法。Hive将数据组织为表，这就使得HDFS上的数据有了结构，元数据即表的模式，都存储在名为metastore的数据库中。#####

hive 语法不区分大小写

以下数据类型均为hive保留字

整形
- tinyint 微整形，只占用一个字节，0~255即 0~2⁸-1
- smallint 小整形，占用两个字节，-32768-32767 即 -2 ¹⁵~2¹⁵-1
- int 整形，占用四个字节，-2147483648~2147483647 即-2 ³¹~2³¹-1
- bigint 长整型，占用八个字节，范围是 -2 ⁶³~2⁶³-1
布尔型
- boolean - TRUE/FALSE
浮点型
- float - 单精度浮点数。四个字节
- double - 双精度浮点数。八个字节
和c语言一样
定点型
- decimal - 定小数点值，用户自定义长度、精确度
字符串型
- string- 字符序列。可以指定字符集。可以使用单引号或者双引号
- varchar- 特定类型的字符序列，需指定最大长度
- char - 特定类型的字符序列，需指定固定长度
null 空值
Date and time types:
- timestamp(v0.8+) 格式为整数、浮点数、或者字符串,可以是整数，也就是距离Unix新纪元时间（1970年1月1日，午夜12点）的秒数；也可以是浮点数，即距离Unix新纪元时间的秒数，精确到纳秒（小数点后保留9为数）；还可以是字符串，即JDBC所约定的时间字符串格式，格式为YYYY-MM-DD hh:mm:ss.fffffffff）如‘2012-02-03 12:34:56.123456789’（JDBC所兼容的java.sql.Timestamp时间格式）
- Date 一个日期
binary（v0.8+) 字节数组

按照类型的等级（范围）组织：

Type
- Primitive Type
  - Number
    - DOUBLE
      - FLOAT
        
        BIGINT
        
        INT
        
        SMALLINT
        
        TINYINT
      - STRING
  - BOOLEAN

avatar

含hadoop fs…的操作与在 hive里可以用 dfs … （hive里可用dfs来执行hadoop操作，但无法用vi编辑文件，而且上传文件到hdfs也不行）

复合数据类型

structs: 一组命名的字段，字段类型可以不同。比如，定义一个字段info的类型为info struct<name:string,age:int>,则可以使用info.name,info.age来获取其中的元素值。
- 构建：
```
hive> create table student(id int,info struct<name:string,age:int>);
```
- 插入：
目前没有找到可以直接用数据组成struct的方式直接插入，只有通过select语句将查找得来的另一张表的数据用named_struct语法插入。

hive 支持 insert into … values …
遗憾的是 hive还不支持在VALUES子句中使用UDF函数。可以使用虚拟表(dummy table)来实现，这有点难看，但有效。

原因：Hive没有行级别的数据插入、数据更新和删除操作
INSERT语句允许用户通过查询语句向目标表中插入数据。
delete、update 更新删改操作可以通过OVERWRITE、drop来实现

hive中可以采用insert 的方式向表中插入数据,但是显然使用这种方式效率是很低的;因此一般我们采用load的方式,语法如下。（但需在创建表时约定好分隔符，分隔符可自定义。）
```
hive> create table student_test(id INT, info struct<name:STRING, age:INT>)  
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' //字段与字段之间的分隔符
> COLLECTION ITEMS TERMINATED BY ':';  //一个字段各个item的分隔符 
//文件内容        
$ cat test5.txt 
1,zhou:30
2,yan:30
3,chen:20
4,li:80
hive> LOAD DATA LOCAL INPATH '/home/work/data/test5.txt' INTO TABLE student_test;
Copying data from file:/home/work/data/test5.txt
Copying file: file:/home/work/data/test5.txt
Loading data to table default.student_test
OK
Time taken: 0.35 seconds
```
- 下面是手动插入表数据(将people表的数据插入student)
student表：
```
hive> describe student;
        OK
    id          int                 	                    
    info    	struct<name:string,age:int>	   

hive> select * from people;
    OK
    //无内容
```
people表
```
hive> describe people;
    OK
    name  	string              
    age   	int            
hive> select * from people;
    OK
    weimeng	21
    sdf	22
```
将people中的name、age字段组成struct插入student中，其中1为自定义常量，与people表无关：
- 采用 insert 语句插入
```
hive> insert into student select 1,named_struct('name',name,'age',age) from people;
```
结果：
```
hive> select * from student;
OK
1	{"name":"weimeng","age":21}
1	{"name":"sdf","age":22}
```
- 一个from 多步操作
  
  一个from关键字后面可以跟多个insert和select语句,理论上应该是一个select对应一个from,但是采用这种一个from方式可以节省数据访问的开销;
- 更新用overwrite（会删除原先数据）
```
hive> insert overwrite table student select 1,named_struct('name',name,'age',age) from people; 
```
- 删除用drop
  drop table student
- 查询：
```
hive> select info.age from student; 
```
Maps: 键值对一组无序的键值对。键的类型必须是原子的，值可以是任何类型，同一个映射的键的类型必须相同，值的类型也必须相同。
- 构建：
```
hive> create table employee(id int,pref map<string,string>);
```
- 插入：
  
  使用str_to_map 方法插入map，但只针对map<string,string>的情况
```
insert into employee select 1,str_to_map('weimeng:yes,xxx:no');
```
  map<string,int>等含有非string类型键或值得插入map操作可以通过load方法（建表时定义好分隔符）载入数据，或者用select字句插入。
stackoverflow上一个例子：
```
CREATE TABLE complex1(c1 array<string>, c2 map<int,string>)
```
加载数据操作省略
```
select * from complex1;

OK

["Mohammad","Tariq"] {7:"Bond"}

Time taken: 0.062 seconds
```
另一个表complex2 :
```
CREATE TABLE complex2(c1 map<int,string>);
```
将complex2 的内容放进complex1
```
insert into table complex2 select c2 from complex1;

select * from complex2;
OK
{7:"Bond"}
Time taken: 0.062 seconds
```
手动插入table的map字段（含int等非string类型）完成
- 查询：
```
hive> select perf['person'] from employee where perf['person'] is not null; 
```

Arrays: 数组（队列）一组有序字段，字段的类型必须相同。

构建：

hive> create table class(classname string,students array<string>);

插入：

array方法

hive> insert into class select 'computer163',array('weimeng','xxx');

查询：
```
hive> select students[0] from class; 
```

复杂数据类型的声明必须使用尖括号指明其中数据字段的类型。定义三列，每列对应一种复杂的数据类型，如下所示。

做个总结

创建包含有集合数据类型的hive表

hive> create table test_set(
> id int,
> name string,
> bobby array<string>,  //array中元素为String类型
> friend map<string,string>,  //map中键和值均为String类型
> mark struct<amth:int,english:int>  //Struct中元素为Int类型
> )
> row format delimited fields terminated by ',' //字段之间用','分隔
> collection items terminated by '_'  //集合中的元素用'_'分隔
> map keys terminated by ':'  //map中键值对之间用':'分隔
> lines terminated by '\n'   //行之间用'\n'分隔
> ;
OK
Time taken: 0.327 seconds

向表test_set中导入数据

对于数据量较大，常用的一种方法是通过文件批量导入的方法，如下的wm.txt文本中的数据插入到表中

1,xiaoming,basketball_game,xiaohong:yes_xiaohua:no,99_75
1,xiaohong,watch_study,xiaoming:no_xiaohua:not,95_95

将/wm.txt 上传到hadoop的hdfs 的 / 目录并查看
$ haddop fs -put /wm.txt /;
$ hadoop fs -ls /;

可以采用如下语句来实现

load data inpath '/wm.txt' overwrite into table test_set;

对于想插入几条数据时，可以采取insert语句来插入数据

可以采用如下语句来实现，分别通过array,str_to_map,named_struct来包装插入的三种集合数据

INSERT INTO test_set SELECT 2,'xiaohua',array('basketball','read'),str_to_map('xiaoming:no,xiaohong:no'),named_struct('math',90,'english',90)

查询表中的数据，如果要查询表中的所有数据，直接通过查询

hive> select * from test_set;
OK
NULL	basketball_game	["xiaohong:yes","xiaohua:no"]	{"99":null,"75":null}	NULL
NULL	NULL	NULL	NULL	NULL
1	xiaohong	["watch","study"]	{"xiaoming":"no","xiaohua":"not"}	{"amth":95,"english":95}
Time taken: 0.102 seconds, Fetched: 3 row(s)

另外，对于集合类型的查询，我们还有一种经常使用的方法，查询语句如下

select id,name,hobby[0],      //查询第一个hobby
friend['xiaohong'],       //查询map键为xiaohong的value
mark.math       //查询struct中math的值
from test_set where name = 'xiaoming';//结果不再展示

hive文件格式
- textfile 文本，默认值。
- sequencefile 二进制序列文件。
- rcfile 列式存储格式文件，hive0.6以后开始支持。
- orc 列式存储格式文件，hive0.13以后开始支持。
涉及硬件方面的存储方式、压缩算法
```
建表最后一句设定存储文件格式：
create table xxxx  
stored as textfile(sequencefile、rcfile、orc);
```

hive的所有数据存储基于hadoop的HDFS，hive没有专门的数据存储格式（可支持Text，SequenceFile，ParquetFile，RCFILE等）

存储结构主要包括：数据库、文件、表、视图、索引

DDL语句，创建删除表，数据的导入导出

DDL(数据库模式定义语言 Data Definition Language)
hdfs - 分布式文件系统

hdfs采用了主从（Master/Slave）结构模型，一个HDFS集群是由一个NameNode和若干个DataNode组成的。其中NameNode作为主服务器，管理文件系统的命名空间和客户端对文件的访问操作；集群中的DataNode管理存储的数据。

知识点：Database 、table 、 partition 、 bucket

数据库（database）在hdfs中表现为${hive.metastore.warehouse.dir}目录下一个文件夹

可通过hadoop 查看

$ hadoop fs -ls /user/hive/warehouse/

Found 10 items
drwxr-xr-x   - root supergroup          0 2019-07-02 02:21 /user/hive/warehouse/class
drwxr-xr-x   - root supergroup          0 2019-07-01 09:15 /user/hive/warehouse/dummy

构建
CREATE DATABASE|SCHEMA [IF NOT EXISTS] database_name
删除
drop database database_name
展示
show databases;

表（table) 在hdfs中表现所属db目录下一个文件夹

通过hadoop的hdfs来查看

$ hadoop fs -ls /user/hive/warehouse/class

Found 1 items
-rwxr-xr-x   1 root supergroup         24 2019-07-02 02:21 /user/hive/warehouse/class/000000_0

构建

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[(col_name data_type [COMMENT col_comment], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type
[COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)]
INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[STORED AS file_format]
[LOCATION hdfs_path]

删除
drop table table_name
跟新
insert overwrite table table_name ...
展示

show tables;

show tables 'page.* //展示所有page开头的table

show extended table_name[partition(xx='xx')] //展示这个表（分区）的所有属性
描述
describe table_name;

改（alter)

重命名表（原表名存在时）.

ALTER TABLE old_table_name RENAME TO new_table_name;

重命名列名

ALTER TABLE old_table_name REPLACE COLUMNS (col1 TYPE, ...);

修改字段属性（保证不改变列名）

ALTER TABLE tableName ALTER COLUMN columnName VARCHAR(10)；

已有表里面加列

ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');

删除字段

ALTER TABLE tableName DROP COLUMN columnNa

最低0.47元/天解锁文章

倒吃甘蔗

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive知识点详解

Hive支持的常用数据类型和文件格式Hive是一种构建在Hadoop上的数据仓库，Hive把SQL查询转换为一系列在Hadoop集群中运行的MapReduce作业，是MapReduce更高层次的抽象，不用编写具体的MapReduce方法。Hive将数据组织为表，这就使得HDFS上的数据有了结构，元数据即表的模式，都存储在名为metastore的数据库中。#####hive 语法不区分大小写...
复制链接

扫一扫