hive入门

库库林_沙琪马

已于 2024-09-08 12:20:32 修改

阅读量1.6k

点赞数 29

分类专栏：大数据技术基础文章标签： hive hadoop 数据仓库

于 2024-09-08 12:15:45 首次发布

本文链接：https://blog.csdn.net/iku_n/article/details/142024761

版权

大数据技术基础专栏收录该内容

1 篇文章 0 订阅

订阅专栏

一，MySQL入门使用

8088: Hadoop

打开：命令提示符程序，输入mysql -uroot -p 回车

查看数据库

show databases;

使用数据库

use 数据库名称;

创建数据库

create database 数据库名称 [charset utf8];

删除数据库

drop database 数据库名称;

查看当前使用的数据库

select database();

查看当前use的数据库

select current_database();

1，DDL-表管理

查看有哪些表 show tables；注意：需要选择数据库

1,创建表

create table 表名称(

	列名称 列类型,

	列名称 列类型，

	........

);

-- 列类型有

int	  -- 整数

float       -- 浮点数

varchar(长度)	-- 文本，长度为数字，做最大长度限制

date	-- 日期类型

timestamp	-- 时间戳类型

2,删除表

drop table 表名称;

drop table if exists 表名称;

3, 修改表

添加列

alter table table_name add 列名 列类型

修改列和属性

alter table table_name change 旧列名 新列名 新列类型 约束条件;

删除列

alter table table_name drop column 列名;

2，DML-数据操作

1,数据插入

普通插入

insert into 表[(列1，列2，.....,列N)] values（值1，值2，.......，值N）;

插入检索出来的数据

insert into 表(列, ...)
select 列,...
from 表名;

将一个表的内容插入到一个新表

create table New_Table as
select * from 表名;

更新(就是修改表数据)

update table_name 
set  列=值 [where 条件判断]

2,数据删除

delete from 表名称 
where 条件判断;

清空表

truncate table table_name

3,数据更新

update 表名 set 列=值 [where 条件判断];

3，DQL-数据查询

(1) 基础数据查询

select 字段列表|* from 表;

含义就是:

从（FROM）表中，选择（SELECT）某些列进行展示

select 字段列表|* from 表 where 条件判断

AND 和 OR 用于连接多个过滤条件。优先处理 AND，当一个过滤表达式涉及到多个 AND 和 OR 时，可以使用 () 来决定优先级，使得优先级关系更清晰。

IN 操作符用于匹配一组值，其后也可以接一个 SELECT 子句，从而匹配子查询得到的一组值。

NOT 操作符用于否定一个条件

(2) 分组聚合查询(group by)

select 字段|聚合函数 from 表 [where 条件] group by 列

聚合函数有：

	- sum(列) 求和
	- avg(列) 求平均值
	- min(列) 求最小值
	- max(列) 求最大值
	- count(列|*) 求数量

注意：group by中出现了哪个列，哪个列才能出现在select中的非聚合中.

having 过滤分组
- 先分组后过滤

select 列,count(*) as num
from 表名
where 条件语句
group by 列名
having 条件语句;

(3) 排序分页排序(order by,limit)

select 列|聚合函数|* from 表

where ...

group by ...

order by ... [ASC | DESC]

ASC 升序排列

DESC 降序配列

2, LIMIT 对查询结果进行数量限制或分页显示，

select 列|聚合函数|* from 表

where ...

group by ...

order by ... [ASC | DESC]

limit n[, m]

– n 限制五条数据

– 10, 5 从第10条数据开始，取5条

(4) 多表查询(from, inner join)

FROM多表
- select … from 表1 [as 别名], 表2 [as 别名], …, 表N [where 连接条件];
  
  直接在FROM中写多个表，通过as可以给出表别名
INNER JOIN
- select … from 表1 [as 别名1] [inner] join 表2 as 别名2 on 连接条件;
  
  inner join 内关联 (两个表的交集模式)
  
  left join 左外关联
  
  right join 右外关联

(5) 通配符(like)

% 匹配 >=0 个任意字符;
_ 匹配 == 1 个任意字符
[] 可以匹配集合内的字符，例如[ab] 将匹配a 或者 b。用^可以对其进行否定，也就是不匹配集合内的字符

select *
from table_name
where 列名 like '[^AB]%'; --不以 A 和 B 开头的任意文本

(6) 去重查询

distinct

select distinct 列名,列名
from 表名

4，约束

create table [if not exitsts] 表名(
	字段名1 类型[(宽度)] [约束条件] [comment '字段说明'],
	字段名2 类型[(宽度)] [约束条件] [comment '字段说明'],
	字段名3 类型[(宽度)] [约束条件] [comment '字段说明']
)['表的一些设置']

概念：

约束实际上就是表中数据的限制条件

(1), 主键约束(primary key)

– 主键的作用

– 主键约束的列非空且唯一

注意:

一张表只能有一个主键，联合主键也是一个主键

创建主键

create table table_name(
	....
	<字段名> <数据类型> primary key
	....
)

create table table_name(
	....
	[constraint <约束名>] primary key [字段名]
);

联合主键

注意: 联合主键，不能直接在字段名后面声明主键约束

联合主键的每一列都不能为空
```
create table table_name(
	...
	[constraint <约束名>] primary key(字段1,字段2,....,字段n)
);
```

修改表结构添加主键

create table table_name(
	...
);
alter table <表名> add primary key(字段列表);

create table table_name(
	...
);
alter table <表名> add primary key(字段列表1, 字段列表2);

删除主键约束

alter table table_name drop primary key;

(2), 自增长约束(auto_increment)

(3),非空约束(not null)

(4), 唯一性约束(unique)

(5), 默认约束(default)

(6), 零填充约束(zerofill)

(7), 外键约束 (foreign key)

– 添加外键

create table table_name(
	字段名 数据类型
	....
	[constraint] [外键名称] foreign key(外键字段名) references 主表 (主表列名) 
);

alter table 表名 add consteraint 外键名称 foreign key(外键字段名) references 主表(主表列名)

二，Hive

两部分组成元数据管理服务(metastore),SQL解释器

启动元数据管理服务(必须启动，否则无法工作)
退出安全模式
- ```
hdfs dfsadmin -safemode leave
```
注：在hive文件夹下启动

前台启动: bin/hive --service metastore

后台启动: nohup bin/hive --service metastore >> logs/metastore.log 2>&1 &
启动客户端
- Hive Shell方法(可以直接写SQL)
  - bin/hive
- Hive ThriftServer方式(不可以直接写SQL，需要外部客户端链接使用)
  - bin/hive --service hiveserver2
  - 后台执行脚本:
```
nohup bin/hive --service hiveserver2 >> logs/hiveserver2.log 2>&1 &
```

1，Hive体验

可以执行:bin/hive,进入到Hive Shell环境中，可以直接执行SQL语句

创建表

create table test(id int, name string, gender string);

插入数据

insert into test values(1, '王力红','男')

查询数据

select gender,count(*) as cnt from test group by gender;

HIve中创建的库和表的数据,存储在HDFS中，默认存放在:

hdfs://node1:8020/user/hive/warehouse中

内置服务

bin/beeline

!connect jdbc:hive2://node1:10000

2，Hive数据库操作语法

(1)，数据库操作

(1.1),创建数据库

create database if not exists myhive;

use myhive;

(1.2),查看数据库详细信息

desc database myhive;

(1.3),创建数据库并指定hdfs存储位置

create database myhive2 location '/myhive2';

(1.4),删除数据库,如果数据库下面有数据表，那么就会报错

drop database myhive;

(1.5),强制删除数据库,包含数据库下面的表一起删除

drop database myhive2 cascade;

hive数据库的本质是一个文件夹
默认存储在：/user/hive/warehouse内
可以通过location关键字指定在创建的时候指定存储目录

(2)，数据表操作

(2.1),创建表

create [external] table [if not exists] table_name

	[col_name data_type [comment col-comment], .....]

	[comment table_comment]

	[partitioned by (col_name data_type [comment col_comment], ....)]

	[clustered by (col_name, col_name, ....)]

	[sorted by (col_name [asc|desc], ....)] into num_buckets BUCKETS]

	[row format row_format]

	[stored as file_format]

	[location hdfs_path]

(2.2),内部表和外部表

内部表(create table table_name …)

直接删除内部表会直接删除元数据(metadata)及存储数据，因此内部表不适合和其他工具共享数据
外部表(create external table table_name … location)

外部表是指表数据可以在任何位置，通过location关键字指定。数据存储的不同也代表了这个表并不是hive内部管理的，而是可以随意临时链接到外部数据上的。

所有，在删除外部表的时候，仅仅删除元数据(表的信息),不会删除数据本身
- 创建外部表
```
create external table test_ext1(id int,name string) row format delimited fields terminated by '\t' location '地址'
```
自行指定分割符
- row format delimited fields terminated by ‘\t’ 表示以\t分割
查看表详细类型
```
desc formatted stu;
```

表转换

alter table 表 set tblproperties('EXTERNAL'='TRUE');

(2.3),数据加载

建表

create table 表名(

        dt string comment '时间(时分秒)',

        user_id string comment '用户ID',

        word string comment '搜索词',

        url string comment '用户访问网址'

) comment '搜索引擎日志表' row format delimited fields terminated by '\t'

两种加载方式

本地加载(linux)

load data local inpath '路径' into table 表;

HDFS加载

load data inpath '路径' into table 表;

(2.4),数据加载 insert select

insert [overwrite|into] table tablename1 [if not exists] select statement1 from 表

* overwrite 覆盖
* into 追加

(2.5), 数据导出

导出到本地

insert overwrite local directory '路径' select * from 表名;

指定分隔符

insert overwrite local directory '路径' row format delimited fieds terminated by'\t' select * from 表名;

导出到HDFS

insert overwrite directory '路径' row format delimited fied terminated by '\t' select * from 表名;

区别

– 带local 写入本地

– 不带local 写入hdfs

(2.6), 分区表

分区其实就是HDFS上的不同文件夹

分区表可以极大的提高特点场景下Hive的操作性能

基本语法

create table 表名(......) partitioned by (分区列 列类型, .....)row format delimited fields terminated by '\t';

加载数据

load data [local] inpath 'path' into table 表 partition(分区列='', ...)

(2.7), 分桶表

开启分桶表的自动优化
```
set hive,enforce.bucketing=true
```

创建分桶表

create table 表名(列名 列类型, ....) clustered by(列名) into 3 buckets row format delimited fields terminated by '\t';

桶表的数据加载，只能通过insert select
- 创建一个临时中转表
- 向中转表load data数据
- 从中转表进行insert select向分桶表加载数据
```
insert overwrite table 表名 select * from 表名 cluster by(列名);
```

(2.8), 修改表

表重命名

alter table 旧表名 rename to 新表名

修改表属性值

alter table table_name set tblproperties(''='')

添加表分区

alter table table_name add partition(分区)

修改分区值

alter table table_name partition(分区) rename to partition(分区)

添加新列

alter table table_name add colums(列名 列类型, ...)

修改列名

alter table table_name change 旧列名 新列名 列类型;

删除表
```
drop table table_name;
```
清空表数据(无法清空外部表)
```
truncate table table_name;
```

(2.9)array数组

建表语句

create table table_name(work_locations array<string>)row format delimited fields terminated by '\t'
collection items terminated by ',';

– 查询array类型中的元素个数

select name, size(array列名) from table_name;

– 查询array中包含的数据

select * from table_name where array_contains(array列名, 要查询的数据名称)

(2.10) map类型

建表语句

create table table_name(members map<string,string>)row format delimited fields terminated by ',' 
collection items terminated by '#'
map keys terminated by ';';

create table table_name(
	id int,
	name string,
	menbers map<string, string>,
	age int
)row format delimited fields terminated by ','
collection items terminated by '#'
map keys terminated by ':';

# collection items terminated by '#' 每个键值对之间的分隔符
# map keys terminated by ':'  单个键值对内部，k和v的分隔符

取出map的全部key，返回类型是array

select map_keys(map数据名) from table_name;

取出map的全部values, 返回类型是array

select map_values(map数据名) from table_name;

size 查看map的元素(K-V对)的个数

select size(map数据名) from table_name;

(2.11), struct类型

– struct类型是一个复合类型，可以在一个列中存入多个子列，每个子列允许设置类型和名称

在这里插入图片描述

建表语句

create table table_name(
	id string,
	info struct<name:string,age:int>
)
row format delimited
fields terminated by '#'
collection ifems terminatted by ':';