大数据--Hive基础

张聪聪

已于 2023-07-20 15:51:25 修改

阅读量113

点赞数

文章标签：大数据 hive hadoop

于 2023-07-14 10:52:21 首次发布

本文链接：https://blog.csdn.net/xuehentian/article/details/131665142

版权

一、Hive的介绍
（一）Hive的概念

Hive 由 Facebook 实现并开源
基于 Hadoop 的一个数据仓库工具
可以将结构化的数据映射为一张数据库表，
并提供 HQL(Hive SQL)查询功能，
底层数据是存储在 HDFS 上。
Hive 的本质是将 SQL 语句转换为 MapReduce 任务运行，
使不熟悉 MapReduce 的用户很方便地利用 HQL 处理和计算 HDFS 上的结构化的数据，适用于离线的批量数据计算.

基于 Hadoop 的一个数据仓库工具
   ①HDFS为hive存储数据
   ②MapReduce为hive提供了计算引擎
   ③YARN为hive提供了资源调度

   hive可以将结构化的数据映射成一张数据库表

   二维：通过两个条件就能够唯一确定值

并提供 HQL(Hive SQL)查询功能：

底层数据是存储在 HDFS 上

OLAP：Online Analytical Processing 联机分析处理（查询）

OLTP：Online Transaction Processing 联机事务处理（增删改）

Hive 的本质是将 SQL 语句转换为 MapReduce 任务运行
   select count(distinct department) from student;

   select * from student;

使不熟悉 MapReduce 的用户很方便地利用 HQL 处理和计算 HDFS 上的结构化的数据，适用于离线的批量数据计算

Hive依赖于HDFS存储数据，Hive将HQL转换成MapReduce执行。
select、 from 、 join、 group by

（二）为什么使用Hive
直接使用 MapReduce 所面临的问题：
人员学习成本太高
项目周期要求太短
MapReduce 实现复杂查询逻辑开发难度太大
为什么要使用 Hive：
更友好的接口：操作接口采用类 SQL 的语法，提供快速开发的能力
更低的学习成本：避免了写 MapReduce，减少开发人员的学习成本
更好的扩展性：可自由扩展集群规模而无需重启服务，还支持用户自定义函数 .

（三）Hive的特点
Hive是一个工具，不是一个分布式的集群，让用户编写sql语句，帮sql语句翻译成MapReduce，然后提交到hadoop集群。
优点：课件中
缺点：
1、Hive 不支持记录级别的增删改操作，但是用户可以通过查询生成新表或者将查询结果导入到文件中
2、Hive 的查询延时严重，因为 MapReduce的启动过程消耗很长时间，所以不能用在交互查询系统中。
3、Hive 不支持事务

（四）Hive的架构

基本组成

一、用户接口

CLI， Shell 终端命令行（ Command Line Interface ），采用交互形式使用 hive 命令行与 hive 进

行交互，最常用（学习，调试，生产）

JDBC/ODBC ，是 Hive 的基于 JDBC 操作提供的客户端，用户（开发员，运维人员）通过这连接至

Hive server 服务

Web UI ，通过浏览器访问 Hive

二、 Thrift Server

Thrift 是 Facebook 开发的一个软件框架，可以用来进行可扩展且跨语言的服务的开发， Hive 集

成了该服务，能让不同的编程语言调用 hive 的接口

三、元数据存储

元数据，通俗的讲，就是存储在 Hive 中的数据的描述信息。

Hive 中的元数据通常包括：表的名字，表的列和分区及其属性，表的属性（内部表和外部表），

表的数据所在目录

Metastore 默认存在自带的 Derby 数据库中。缺点就是不适合多用户操作，并且数据存储目录不

固定。数据库跟着 Hive 走，极度不方便管理

解决方案：通常存我们自己创建的 MySQL 库（本地或远程）

Hive 和 MySQL 之间通过 MetaStore 服务交互

四、 Driver ：编译器（ Compiler ），优化器（ Optimizer ），执行器（ Executor ）

Driver 组件完成 HQL 查询语句从词法分析，语法分析，编译，优化，以及生成逻辑执行计划的生

成。生成的逻辑执行计划存储在 HDFS 中，并随后由 MapReduce 调用执行

Hive 的核心是驱动引擎，驱动引擎由四部分组成：

(1) 解释器：解释器的作用是将 HiveSQL 语句转换为抽象语法树（ AST ）

(2) 编译器：编译器是将语法树编译为逻辑执行计划

(3) 优化器：优化器是对逻辑执行计划进行优化

(4) 执行器：执行器是调用底层的运行框架执行逻辑执行计划

五、执行流程

HiveQL 通过命令行或者客户端提交，经过 Compiler 编译器，运用 MetaStore 中的元数据进行类型检测和语法分析，生成一个逻辑方案 (Logical Plan) ，然后通过的优化处理，产生一个 MapReduce 任务

（五）Hive的存储数据模型
   数据库、数据表、分区和分桶、表数据

   库 + 表 + 分区：对应到HDFS上面的一个目录
   分桶 + 表数据： HDFS上面的一个文件

   数据存储系统：数据库 + 数据仓库

   数据库（写模式）:支持CRUD，存储的数据量相对较少
   数据仓库（读模式）：支持查询

   模式问题：
   1、读模式 schema on read
   2、写模式 schema on write

   在关系型数据库当中的事务操作，如果事务操作不符合数据库表的规范。那么在做这个操作的时候，就会给你做检测，如果不符合规范，执行引擎给你抛出错误。
   在数据仓库当中的场景中，hive的引擎在添加的时候不会进行校验是否符合规范。

   hive会在数据读的时候进行校验，如果没有通过，就会给你排除错误。

   数据模式：文件

   derby数据库

   Hive 中的表分为内部表、外部表、分区表和 Bucket 表

   内部表、外部表：

   分区表和 Bucket 表
       默认的数据仓库的存储目录：/user/hive/warehouse
   假设创建一个myhive数据库：
       myhive数据库对应HDFS上面的目录：/user/hive/warehouse/myhive.db
   假设在myhive库下创建一个表student
       表对应的hdfs上面的目录：/user/hive/warehouse/myhive.db/student/
   放数据进去：
       load data local input "/home/data/student.txt" into table student;
   数据加载进去之后：
       /user/hive/warehouse/myhive.db/student/student.txt

   分区表：
       1、概念：对表的一种细化管理
       2、分区就是对应HDFS上面的一个目录

   student_ptn
       分区一：sex = M
       这个分区表对应的目录是：
           /user/hive/warehouse/myhive.db/student_ptn/sex=M/

       分区二：sex = F
       这个分区表对应的目录是：
           /user/hive/warehouse/myhive.db/student_ptn/sex=F/

       /user/hive/warehouse/myhive.db/student_ptn/student.txt 看不见

       往分区表当中导入数据：
           load data local input "/home/data/student.txt" into table student_ptn partition(sex=M);
       上面这个导入之后对应的HDFS上面的文件位置：
           /user/hive/warehouse/myhive.db/student_ptn/sex=M/student.txt


   分桶表：
       student_bucket
       分桶数据：对应的HDFS上面的某个目录下面的一个文件
       HashPartitioner：
           key的哈希值模除以分区的个数
       字段的哈希值模除以分桶的个数

       同一个值的hash值一定是一样的。

       1、如果一个字段的某个值，在一个分桶当中，那么所有的改字段的值的所有记录都在这个分桶表中。
       hello,1
       hello,1
       world,1
       2、某些分桶中有可能有多个字段的值，也有可能没有。
       桶1：hello 桶2；world
       桶1：hello world 桶2；null 空的
       桶1：null 空的桶2；hello world

   内部表和外部表区别：
       删除内部表，删除了元数据和真实的数据
       删除外部表，删除的是元数据，真实数据还存储。

二、Hive 实操

实际操作：

// 创建库
create database myhive2;

// 使用新创建的库
use myhive;

// 查看当前正在使用的库：
select current_database();

// 创建表
create table student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";

// 查看表
show tables;

// 往表中加载数据：
load data local inpath "/home/data/student.txt" into table student;

// 查询数据
select id,name,sex,age,department from student;

// 查看表结构
desc student;
desc extended student;
desc formatted student;

DDL
官网命令链接： https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select

   数据定义语言 Data Definition Language
   （一）针对库
       创建库、查看库、删除库、修改库、切换库

   1、创建库
       create database myhive;
       create database if not exists myhive;

   CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
   小括号表示必选，小括号中间的竖线表示或者，中括号表示可选。
       create database myhive_1 comment "myhive_1 database 1" location "/myhive_1";
       location是HDFS的目录

   2、查看库
       show databases;//查看所有的库
       select current_database();//查看当前使用的库

   3、删除库
       默认：
       drop database myhive_1;   等价于    drop database myhive_1 RESTRICT;
       drop database myhive_1 cascade;//强制级联删除，慎用！

   4、修改库
       基本不常用！！

   5、切换库
       use myhive;


   （二）针对表
       创建表、删除表、修改表、查看表、查看某张表的详细信息。
   1、查看表
       show tables;
       show tables in myhive;//在当前库查看另外一个库的表的信息
       show tables "stu*";

       test_crm_student
       temp_student
       t_student

   2、查看某张表的详细信息
       desc student;
       desc formatted student;
       desc extended student;

   3、创建表

       分桶表
       示例：CLUSTERED BY department SORTED BY age ASC,id DESC INTO 3 BUCKETS

       Create Table ... As Select 简称CTAS

       create table studentss like student; //复制表结构，不是复制数据






       创建表的类型有6种：
       （1）创建内部表
           create table student(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";

           内部表的类型：MANAGED_TABLE

       （2）创建外部表
       create external table student_ext_1(id int, name string, sex string, age int, department string) row format delimited fields terminated by ",";

       内部表和外部表的区别：
           删除表的时候，内部表会都删除，外部表只删除元数据
       到底选择内部表还是外部表？
           1、如果数据已经存储在HDFS上面了，需要使用hive去进行分析，并且这份数据还有可能使用其他的计算引擎来执行分析，使用外部表
           2、如果这个一份数据只是hive做数据分析使用，就可以使用内部表

       // 指定一个不存在的外部路径: 创建表的时候，会自动给你创建表目录
       create external table student_ext_2(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_2";

       // 指定一个已经存在的目录: 并且有数据
       //在linux中执行
       hadoop fs -mkdir -p /student_ext_3
       hadoop fs -put /home/data/student.txt /student_ext_3
       //在hive命令行中执行
       create external table student_ext_3(id int, name string, sex string, age int, department string) row format delimited fields terminated by "," location "/student_ext_3";


       （3）创建分区表

       // 创建只有一个分区字段的分区表：
       create table student_ptn(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field") row format delimited fields terminated by ",";

load data local inpath "/home/data/student.txt" into table student_ptn;

// 把数据导入到一个不存在的分区，它会自动创建该分区
load data local inpath "/home/data/student.txt" into table student_ptn partition(city="beijing");

       分区字段是虚拟列，它的值是存储在元数据库中，不是存储在数据文件中。

       // 把数据导入到一个已经存在的分区
       alter table student_ptn add partition (city="chongqing");
       load data local inpath "/home/data/student.txt" into table student_ptn partition(city="chongqing");

// 创建有多个分区字段的分区表：
create table student_ptn_date(id int, name string, sex string, age int, department string) partitioned by (city string comment "partitioned field", dt string) row format delimited fields terminated by ",";

// 往分区中导入数据:
load data local inpath "/home/data/student.txt" into table student_ptn_date partition(city="beijing"); //报错

load data local inpath "/home/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-12'); //正确

// 不能在导入数据的时候指定多个分区定义
load data local inpath "/home/data/student.txt" into table student_ptn_date partition(city="beijing", dt='2012-12-14') partition (city="beijing" , dt='2012-12-13'); XXXXXX

       // 添加分区
       alter table student_ptn_date add partition(city="beijing", dt='2012-12-14') partition (city="beijing" , dt='2012-12-13'); √√√√√√√√
       alter table student_ptn_date add partition(city="chongqing", dt='2012-12-14') partition (city="chongqing" , dt='2012-12-13');

       // 查询一个分区表有那些分区
       show partitions student_ptn;
       show partitions student_ptn_date;
       show partitions student;

       （4）创建分桶表

       // 创建一个分桶表
       create table student_bucket (id int, name string, sex string, age int, department string) clustered by (department) sorted by (age desc, id asc) into 3 buckets row format delimited fields terminated by ",";

       Num Buckets:     3
       Bucket Columns:    [department]
       Sort Columns:    [Order(col:age, order:0), Order(col:id, order:1)]

       你往分通表里面导入数据要通过分桶查询方式进行导入数据。

这里直接使用 Load 语句向分桶表加载数据，数据时可以加载成功的，但是数据并不会分桶。

这是由于分桶的实质是对指定字段做了 hash 散列然后存放到对应文件中，这意味着向分桶表中插入数据是必然要通过 MapReduce，且 Reducer 的数量必须等于分桶的数量。由于以上原因，分桶表的数据通常只能使用 CTAS(CREATE TABLE AS SELECT) 方式插入，因为 CTAS 操作会触发 MapReduce。加载数据步骤如下：

1. 设置强制分桶

set hive.enforce.bucketing = true; --Hive 2.x 不需要这一步
复制代码
在 Hive 0.x and 1.x 版本，必须使用设置 hive.enforce.bucketing = true，表示强制分桶，允许程序根据表结构自动选择正确数量的 Reducer 和 cluster by column 来进行分桶。

2. CTAS导入数据

INSERT INTO TABLE emp_bucket SELECT * FROM emp; --这里的 emp 表就是一张普通的雇员表
复制代码
可以从执行日志看到 CTAS 触发 MapReduce 操作，且 Reducer 数量和建表时候指定 bucket 数量一致。

       （5）从查询语句的结果创建新表
       通过下面的命令：
           create table ... as select ....
       查询例子：
           select department, count(*) as total from student group by department;
       完整的CTAS语句：
           create table dpt_count as select department, count(*) as total from student group by department;

       （6）通过like复制已有表的结构创建新表
       create table student_like like student;

   4、删除表
   drop table student;
   drop table if exists student;

5、修改表

       1)修改表名
       alter table student_like rename to studentss;

2)修改字段

           添加字段：
           alter table student2 add columns (city string, dt string);
           删除字段：
           alter table student2 drop columns (city); //报错XXXXXXXX
           替换字段：
           alter table student2 replace columns (id int, name string, sex string, age int);
           改变列的定义：
           alter table student2 change id newid string comment "new id";
           改变列的顺序：
           alter table student2 change sex sex string first;
           alter table student2 change name name string after sex;

3)修改分区

添加分区：
alter table student_ptn add partition(city='tiajin') partition(city='shanghai');

           删除分区：
           alter table student_ptn drop partition(city='tiajin');
           alter table student_ptn drop partition(city='tiajin'),partition(city='shanghai');

           修改分区的数据目录：
           alter table student_ptn partition(city="beijing") set location "/stu_beijing"; //报错
           alter table student_ptn partition(city="beijing") set location "hdfs://hadoop0:8020/stu_beijing"; //正确


   6、清空表
   truncate table student;
   hadoop fs -rm -r /user/hive/warehouse/myhive.db/student/*

三、DML
   数据操纵语言 Data Manipulation Language
   1、导入数据
   LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]
   （1）LOAD操作是复制或者移动操作，将数据文件复制或移动到Hive设置的路径上。
   写LOCAL的是复制，不写LOCAL是移动
   load
   load data local inpath "/home/data/student.txt" into table student;
   load data inpath "/student.txt" into table student;

   insert
   insert into table student (id, name, sex, age, department) values (101,"huangbo","M",222,"IT");

   创建分区表：

create table student_ptn (id int, name string, sex string, age int) partitioned by (department string) row format delimited fields terminated by ",";

   单重插入：
   insert into table student_ptn partition (department = 'IS') select id,sex,name,age from student where department = 'IS';
   insert into table student_ptn partition (department = 'CS') select id,sex,name,age from student where department = 'CS';
   insert into table student_ptn partition (department = 'MA') select id,sex,name,age from student where department = 'MA';

   多重插入：
   from student
   insert into table student_ptn partition (department = 'IS') select id,sex,name,age where department = 'IS'
   insert into table student_ptn partition (department = 'CS') select id,sex,name,age where department = 'CS'
   insert into table student_ptn partition (department = 'MA') select id,sex,name,age where department = 'MA'

   多重插入最大的好处就是给很多结构相同的SQL语句组合在一起提高所有的HQL的执行效率，翻译成的MapReduce只需要读取一次数据就搞定了。


   2、导出数据
   insert overwrite local directory "/home/data/cs_student" select * from student where department = 'CS';

   insert overwrite directory "/home/data/cs_student" select * from student where department = 'CS';


   3、查询数据
   Hive 中的 SELECT 基础语法和标准 SQL 语法基本一致，支持 WHERE、DISTINCT、GROUP BY、 ORDER BY、HAVING、LIMIT、子查询等