大数据知识点全讲解之Hive(上)

最新推荐文章于 2022-03-26 17:34:42 发布

Helltaker

最新推荐文章于 2022-03-26 17:34:42 发布

阅读量255

点赞数

分类专栏：大数据文章标签：数据库大数据 hive java mysql

本文链接：https://blog.csdn.net/helltaker/article/details/108626472

版权

大数据专栏收录该内容

20 篇文章 0 订阅

订阅专栏

大数据知识点全讲解之Hive

Hive简介
Hive结构
Hive与Hadoop的关系
Hive安装
Hive交互方式
Hive数据类型
Hive和Beeline
Hive的基本操作

Hive简介

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张数据库表，并提供类SQL查询功能

其本质是将SQL转换为MapReduce的任务进行运算，底层是由HDFS来提供数据的存储

为什么使用Hive？

采用类SQL语法去操作数据，提供快速开发的能力
避免了去写MapReduce，减少开发人员的学习成本
功能扩展很方便

Hive结构

在这里插入图片描述

Hive与Hadoop的关系

Hive利用HDFS存储数据，利用MapReduce查询分析数据

Hive与传统数据库的对比：
Hive用于海量数据的离线数据分析

	Hive	RDBMS
查询语言	HQL	SQL
数据存储	HDFS	Raw Device or Local FS
执行	MapReduce	Excutor
执行延迟	高	低
处理数据规模	大	小
索引	0.8版本后加入	有复杂的索引

Hive安装

Hive交互方式

第一种：bin/hive

cd /opt/hive
bin/hive
#创建一个数据库
create database if not exits mytest

第二种：使用sql语句或者sql脚本进行交互

不进入hive的客户端直接执行hive的hql语句

cd /opt/hive
bin/hive -e "create database if not exits mytest"

Hive数据类型

Hive的基本数据类型

Hive的数据类型类似于SQL的
在这里插入图片描述
粗体字的数据类型是推荐使用的

Hive的复杂数据类型

ARRAY：存储的数据为相同类型
MAP：具有相同类型的键值对
STRUCT：封装了一组字段

在这里插入图片描述

Hive元数据结构

在这里插入图片描述

Hive和Beeline

二者的区别：

hive不需要启动服务再访问（一个命令包括了启动客户端和服务端）
beeline需要先启动服务端，再访问客户端
beeline在查询效率上比hive高，beeline不支持update、delete

Hive的基本操作

数据库表操作

1.创建数据库

create database if not exists myhive;

说明：hive的表存放位置模式是由hive-site.xml当中的一个属性指定的

<name>hive.metastore.warehouse.dir</name>
<value>/opt/hive/warehouse</value>

2.创建数据库并指定路径

create database if not exists myhive location '/opt/hive/myhive'

3.设置数据库键值对信息
数据库可以有一些描述性的键值对信息，在创建时添加：

create database foo with dbproperties ('owner'='itcast','date'='20200916');

查看数据库的键值对信息：

describe database extended foo;

修改数据库的键值对信息：

alter database foo set dbproperties ('owner'='iteima');

4.查看数据库更多信息（包括建表语句）

desc database extended foo

5.删除数据库

强制删除数据库

drop database foo castcade;

6.创建表的语法

create [external] table [if not exists] table_name(
	col_name data_type [comment '']
	col_name data_type)
[comment '']
[partitioned by (col_name data_type,...)]
[clustered by (col_name, col_name,...)]
[sorted by (col_name [asc|desc],...) into num_buckets buckets]
[row format row_format]
[stored as ...]
[location '']

内部表操作

建表入门：

create table stu(id int, name string);
insert into stu values(1, "zhangsan");
select * from stu;

创建表并指定字段之间的分隔符：

create table if not exists stu(id int, name string) row format delimited fields terminated by '\t';

创建表并指定表文件的存放路径：

create table if not exists stu(id int, name string) row format delimited fields terminated by '\t' location '/usr/data'

根据查询结果创建表：

create table stu3 as select * from stu2;

根据已经存在的表结构创建表：

create table stu4 like stu;

查询表的详细信息：

desc stu2;

删除表：

drop table stu4;

外部表操作

外部表因为是指定其他的hdfs路径的数据加载到表当中来，所以hive表会认为自己不完全独占这份数据，所以删除hive表的时候，数据仍然存在hdfs当中，不会删掉

什么时候使用内、外部表？
举例：每天将收集到的网站日志定期流入HDFS文本文件。在外部表（原始日志表）的基础上做大量的统计分析，用到中间表、结果表使用内部表存储，数据通过select+insert进入内部表

创建老师表：

create external table teacher (t_id string, t_name string) row format delimited fileds terminated by '\t';

创建学生表：

create external table student (s_id string, s_name string, s_birth string, s_sex string) row format delimited fields terminated by '\t';

加载数据：

load data local inpath '文件路径' into table student;

加载并覆盖已有数据：

load data local inpath '文件路径' overwrite into table student;

从hdfs文件系统向表中加载数据（需要提前将数据上传到hdfs文件系统）

hdfs dfs -mkdir -p /hivedatas
hdfs dfs -put teacher.csv /hivedatas/

#hive
load data inpath '/hivedatas/teacher.csv' into table teacher;

分区表操作

在大数据中，最常用的一种思想就是分治，我们可以把大的文件切割划分成一个个的小的文件，这样每次操作一个小的文件就会很容易了；同样的道理，在hive当中也是支持这种思想的，就是我们可以把大的数据，按照每月，或者天进行切分成一个个的小文件，存放在不同的文件夹中

创建分区表语法：

create table score(s_id string, c_id string, s_score int) partitioned by (month string) row format delimited fields terminated by '\t';

创建一个表带多个分区：

create table score(s_id string, c_id string, s_score int) partitioned by (year string, month string) row format delimited fields terminated by '\t';

加载数据到分区表中：

load data local inpath '文件路径' into table score partition(month='09');

加载数据到多分区表中：

load data local inpath '文件路径' into table score partition(year='2020',month='09');

多分区表联合查询(使用union all)

select * from score where month='09' union all select * from score where month='10'

查看分区：

show partition score;

添加一个分区：

alter table ... add partition(month='11');

删除分区：

alter table .. drop partition(month='11');

分桶表操作

分桶，就是将数据按照指定的字段进行划分，到多个文件当中去，分桶就是MapReduce中的分区

开启Hive的分桶功能：

set hive.enforce.bucketing=true;

设置Reduce个数：

set mapreduce.job.reduces=3;

创建分桶表

create table course (c_id string, c_name string, t_id string) clustered by (c_id) into 3 buckets row format delimitered fields terminated by '\t';

由于桶表的数据加载通过hdfs dfs -put或者load data均不好使，只能通过insert overwrite
创建普通表–>普通表中加载数据–>通过inser overwrite给桶表加载数据：

insert overwrite table course select * from coursecommon cluster by (c_id);

修改表结构

重命名：

alter table 旧表名 rename to 新表名;

增加/修改列信息：

添加列：

alter table ... add columns(列名 类型, ...);

更新列：

alter table ... change column 旧列名 新列名 类型;

Helltaker

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录