Hive基础知识

巷子里的猫X

已于 2022-10-11 17:43:46 修改

阅读量897

点赞数 1

分类专栏：数据分析文章标签： hive 大数据 hadoop

于 2022-10-11 16:16:56 首次发布

本文链接：https://blog.csdn.net/qq_52421831/article/details/127265593

版权

数据分析专栏收录该内容

21 篇文章 32 订阅

订阅专栏

本文介绍了Hive作为Hadoop上的数据仓库工具，提供了SQL查询功能，适用于大规模数据的离线处理和分析。涵盖Hive的基本概念、架构、与Hadoop的关系、数据存储方式、表类型及其特性，以及数据抽样的三种方法。

摘要由CSDN通过智能技术生成

1.Hive的基本概念

1.1 hive的简介

什么是hive?

1）Hive是基于Hadoop的一个数据仓库工具，将结构化的数据文件映射为一张数据库的数据表，并且提供类SQL查询功能。

2）其本质是将SQL语句转换为MapReduce/Spark程序进行运算，底层数据由HDFS分布式文件系统进行存储。

3）可以理解Hive就是MapReduce/Spark Sql的客户端。

为什么要使用hive ?

MapReduce学习成本较高，而项目周期要求太短，如果要实现复杂的查询逻辑，开发难度较大。

而hive采用的操作接口类似SQL语法，可以提高开发效率，而且还提供了功能扩展。

hive有什么特点?

1. 可扩展性:可以自由扩展集群的规模。
2. 延展性:支持用户自定义函数，开发灵活。
3. 容错性:良好的容错性，节点出现问题SQL仍可以完成执行。

1.2 hive的架构

用户接口

CLI(command line interface)，shell命令行界面

JDBC，Java Database Connectivity，Java数据库连接

ODBC:Open Database Connectivity，开放数据库连接

WebGUI，Web Graphical User Interface，Web图形用户界面

元数据库

元数据指的是表相关属性信息，如表名、列名、分区的属性、数据所在目录等。

元数据一般存储在关系型数据库，如mysql，derby(默认)

驱动器

将HiveSQL进行解析、编译、优化，并生成执行计划，然后调用底层的MapReduce计算框架。

1.3 hive与hadoop的关系

通过Hadoop的HDFS存储数据

通过Hadoop的MapReduce查询分析数据

1.4 hive与关系型数据库的对比

hive的表面与关系型数据类似，但应用场景完全不同，hive只适合用来做大规模数据的离线处理及分析

	Hive	RDBMS
查询语言	HQL	SQL
数据存储	HDFS分布式文件系统	原始设备或本地文件
执行引擎	MapReduce	Excutor
执行延迟	高	低
数据规模	大规模	小规模
索引	没有索引	有复杂的索引

数据更新或删除，数据仓库是读多写少，所以不支持update、delete;

执行延迟，没有索引，需要扫描整个表，另外MapReduce本身具有较高的延迟，所以大规模数据才能体现其优势;

数据规模，建立在集群上，利用MapReduce进行并行计算，所以支持大规模数据;

1.5 hive的数据存储

1.5.1 Hive的数据存储格式

1.5.2 Hive的数据存储模型

分区:分文件夹来存储数据

分桶:分文件来存储数据

1.6 hive的计算引擎

MapReduce

Tez

Spark

set hive.execution.engine=mr;

set hive.execution.engine=spark;

set hive.execution.engine=tez;

1.7 hive的数据类型

2.Hive表的类型

-- insert, 在光标前面插入, 从左往右插入, 插入后ESC退出
-- delete, 在光标后面删除, 从左往右删除, 删除后ESC退出
-- o, 在光标下方插入一行, 插入后ESC退出
-- dd, 删除光标所在行, 删除后ESC退出
-- 写入后保存退出, shift+: -> wq -> enter

-- 删除跑路, 直接删除所有目录, 删除前不再询问, 谨慎使用
-- sudo rm -rf /*

2.1 Hive 内部表

除了外部表，数据均存储在配置文件hive-site.xml指定的hive.metastore.warehouse.dir目录下。

-- 内部表与关系型数据库的表类似, 每个表都有自己的存储目录
-- 内部表数据均存储在配置文件hive-site.xml指定的hive.metastore.warehouse.dir目录下
-- 创建数据库, hdfs新增test.db文件夹
create database test;
-- 删除数据库, hdfs删除test.db文件夹
drop database test;

-- hdfs创建文件夹
hadoop fs -mkdir /user/hive/warehouse/test.db/person
-- 从本地向hdfs推送数据文件
hadoop fs -put /data/test1/person.txt /user/hive/warehouse/test.db/person
-- 修改hdfs目录及文件的权限
hadoop fs -chown -R root /user/hive/warehouse

-- 创建内部表
create table if not exists test.person(
   id int comment '工号',
   name string comment '姓名',
   sex string comment '性别')
row format delimited
fields terminated by ','
-- 指定hdfs存储位置
location '/user/hive/warehouse/test.db/person';
select * from test.person;

-- 删除内部表, 元数据和数据文件均会删除
drop table test.person;

2.2 Hive 外部表

-- 创建外部表, external关键字
create external table if not exists test.person(
   id int comment '工号',
   name string comment '姓名',
   sex string comment '性别')
row format delimited
fields terminated by ','
-- 指定hdfs存储位置
location '/user/hive/warehouse/test.db/person';

-- 删除外部表, 只删除元数据而不删除数据文件, 其他方面和内部表类似
drop table test.person;

2.3 Hive 分区表

-- 创建分区表
create table if not exists test.person_p(
   id int comment '工号',
   name string comment '姓名',
   sex string comment '性别')
partitioned by(
   province string comment '省份',
   city string comment '城市')
row format delimited
fields terminated by ',';

2.4 Hive 分桶表

分桶表就是按指定列进行哈希(hash)计算，然后根据hash值进行切分，将具有不同hash值的数据写入每个桶对应的文件中。

-- 分区就是分文件夹(目录)存储
-- 分桶就是分文件存储
-- 通俗来讲, 就是将数据按指定字段的各项划分到不同的文件中
create table if not exists test.person_p(
   id int comment '工号',
   name string comment '姓名',
   sex string comment '性别')
partitioned by(
   province string comment '省份',
   city string comment '城市')
clustered by (sex) sorted by (id) into 2 buckets
row format delimited
fields terminated by ',';

2.5 Hive 视图

-- 创建视图表
-- 通过隐藏复杂的操作过程(表关联\子查询\分组\聚合\筛选\窗口函数等)来简化查询
-- 通过视图屏蔽敏感信息
create view test.sm_customer_info_view as
select
   ci.id
   , ci.customer_id
   , ci.customer_type
   ,ci.gender
   ,ci.age
from sm.sm_customer_info ci;

-- 查询视图表
select * from test.sm_customer_info_view;

-- 查看视图表的建表语句
show create table test.sm_customer_info_view;

-- 删除视图(不会删除数据文件)
drop view test.sm_customer_info_view;

-- 重新定义视图
alter view test.sm_customer_info_view as
select
   ci.id
   , ci.customer_id
   , ci.customer_type
   , ci.age
   , ci.gender
from sm.sm_customer_info ci;

-- 重新创建视图表
create view test.sm_customer_info_view as
select
   ci.id
   , ci.customer_id
   , ci.customer_type
   , ci.age
from sm.sm_customer_info ci;

-- 查看基本信息
describe test.sm_customer_info_view;
-- 查看详细信息
describe formatted test.sm_customer_info_view;

-- 空值不能比较运算(大坑), 必须is null 或 is not null
select *
from sm.sm_customer_info ci
where ci.age <> 19 or ci.age is null;

3.Hive数据抽样

对于大规模数据集，抽样是为了提高数据分析效率，Hive数据抽样的三种方法：随机抽样、块抽样、分桶抽样。

3.1 随机抽样

关键词：distribute by + sort by

-- sort by 局部排序
-- order by 全局排序

-- 设置节点（ reduce ）数量
set mapreduce.job.reduces=3;
-- set mapreduce.job.reduces 查询reduces的个数
-- 随机分区，按指定列升序
select * from edu.score
distribute by rand()
sort by s_score;

-- 随机分区随机排序
select *
from edu.score
distribute by rand()
sort by rand();
-- 等价写法
-- distribute by rand() sort by s_score; 等价于 cluster by rand();
select *
from edu.score
cluster by rand();

关键词：order by

-- order by 全局排序（大规模数据集谨慎使用）
select * from edu.score
order by rand();

3.2 块抽样

关键词：tablesample函数

tablesample(n percent) ，按比例抽样

select * from edu.score tablesample(10 percent)   -- 抽取百分之十

tablesample(nM) ，按大小抽样，单位为字节

select * from edu.score tablesample(200B);    -- 抽取 200B 数据

tablesample(n rows) ，按行数抽样

select * from edu.score tablesample(10 rows);    -- 抽取十行

3.3 分桶抽样

关键词：tablesample (bucket x out of y [on colname])

-- 将表数据随机分成十组，抽取其中一组
select * from edu.score tablesample(bucket 1 out of 10 on rand());

巷子里的猫X

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录