大数据实训笔记Day07

最新推荐文章于 2023-02-10 18:38:27 发布

旧城^以西

最新推荐文章于 2023-02-10 18:38:27 发布

阅读量300

点赞数

文章标签： java hive 大数据数据库 mysql

本文链接：https://blog.csdn.net/linhan123321/article/details/119121053

版权

Hive

简介

注意问题

在Hive中，每一个database在HDFS上都会对应一个目录
在Hive中，没有主键的概念，也就意味着在定义表的时候不需要定义主键
Hive在建表的时候就需要指定字段之间的间隔符号，建好表之后就无法修改
在插入数据的时候，insert into表示追加数据；insert overwrite表示将原来的数据清空之后再加入数据

基本操作

SQL	解释
show databases;	查看所有的库
create database hivedemo;	创建库
drop database demo;	删除库
use hivedemo;	使用库
create table person (id int, name string, age int, gender string);	建立person表
insert into person values(1, ‘Sam’, 19, ‘male’);	插入数据
select * from person;	查询数据
load data local inpath ‘/opt/hivedemo/person’ into table person;	加载数据
drop table person;	删除表
create table person (id int, name string, age int, gender string) row format delimited fields terminated by ’ ';	在建表的时候指定字段之间的间隔符号
desc person;	描述表结构
create table p2 like person;	创建一个和person表的表结构一致的表p2
insert into table p2 select * from person where age >= 18;	将person表中age>=18的数据查询出来放到p2表中
from person insert overwrite table p2 select * where gender = ‘male’ insert into table p3 select * where age < 18;	将person表中性别为男生的数据放到p2表中，同时将age<18的数据查询出来放到p3表中
insert overwrite local directory ‘/opt/hivedata’ row format delimited fields terminated by ‘\t’ select * from person where age >= 18;	将person表中age>=18的数据查询出来放到本地目录下
insert overwrite directory ‘/person’ row format delimited fields terminated by ‘,’ select * from person where gender =‘female’;	将person表中性别为女的数据查询出来放到HDFS的person目录下
alter table person rename to p1;	修改表名
alter table p1 add columns(height double);	动态添加列

数据类型

概述

在Hive中，提供了相对丰富的数据类型，大概可以分为两类：基本类型和复杂类型
基本类型

Hive类型 Java类型
tinyint byte
smallint short
int int
bigint long
float float
double double
boolean boolean
string String
timestamp Timestamp
binary byte[]
复杂类型：array，map，struct

Hive类型	Java类型
tinyint	byte
smallint	short
int	int
bigint	long
float	float
double	double
boolean	boolean
string	String
timestamp	Timestamp
binary	byte[]

复杂类型

array：数组类型，对应了Java中的数组或者集合

原始数据

1 lucy,lily  david,evan
2 adair,bruce,lee simon,tony,tom,rose
3 bob,alex,cindy frank,fred
4 henry,william kite,job,thomas

建表

create table battles (id int, groupa array<string>, groupb array<string>) row format delimited fields terminated by ' ' collection items terminated by ',';

加载数据

load data local inpath '/opt/hivedemo/battles' into table battles;

判断非空

select groupa[2] from battles where groupa[2] is not null;

map：映射类型，对应了Java中的Map类型

原始数据

1 tom,15 sam,17
2 lily,16 lucy,16
3 david,14 danny,15
4 frank,19 fred,19
5 henry,17 hack,18

建表语句

create table groups (groupid int, membera map<string,int>, memberb map<string,int>) row format delimited fields terminated by ' ' map keys terminated by ',';

加载数据

load data local inpath '/opt/hivedemo/groups' into table groups;

查询数据

select membera['frank'] from groups  where membera['frank'] is not null;

struct：结构体类型，对应了Java中的对象

原始数据

1 tom,19,male,182.5,68.7
2 tony,18,male,181.3,70.2
3 thomas,18,male,183.6,79.1
4 vincent,17,female,165.9,50.1

建表语句

create table infos (id int, info struct<name:string, age:int, gender:string, height:double, weight:double>) row format delimited fields terminated by ' ' collection items terminated by ',';

加载数据

load data local inpath '/opt/hivedemo/infos' into table infos;

查询数据

select info.age from infos where info.name = 'vincent';

表结构

内部表和外部表

在Hive中手动建表手动添加数据(包括insert和load)，这种表称之为内部表
在Hive中手动建表来管理HDFS上已经存在的数据，这种表称之为外部表

外部表建表语句

create external table orders (orderid int, orderdate string, productid int, num int) row format delimited fields terminated by ' 'location '/orders';

可以通过命令来确定一个表是内部表还是外部表
```
desc extended p1;
# 或者
desc formatted p1;
```
如果Table Type的属性值为MANAGED_TABLE，就表示这是一个内部表；如果Table Type的属性值为EXTERNAL_TABLE，那么就表示这是一个外部表
内部表在被删除的时候，在HDFS上对应的目录会一起删除；外部表在被删除的时候，在HDFS上对应的目录不会被删除
在实际生产过程中，数据前期的采集和管理使用的是外部表；后期对数据进行处理和分析的时候，大部分时候采用的是内部表

分区表

分区表的作用是对数据进行分类

分区表建表语句

create table cities (id int, name string) partitioned by (province string) row format delimited fields terminated by ' ';

加载数据

load data local inpath '/opt/hivedemo/hebei' into table cities partition(province = 'hebei');
load data local inpath '/opt/hivedemo/henan' into table cities partition(province = 'henan');

在Hive中，每一个分区在HDFS上都会形成一个单独的目录
当对分区表进行查询的时候，如果指定了分区条件，那么分区表的查询速度要高于未分区表；如果在查询的时候进行可跨分区查询，那么此时未分区表的查询速度要高于分区表

手动添加分区

alter table cities add partition(province = 'guangdong') location '/user/hive/warehouse/hivedemo.db/cities/province=guangdong';

修复表

msck repair table cities; # 这个命令有执行失败的可能

修改分区表

alter table cities partition(province = 'shanxi') rename to partition(province = 'test');

删除分区

alter table cities drop partition(province = 'test');

在Hive中，要求分区表中被分区的字段在原始数据中不存在

动态分区

原始数据

1 hebei 邢台
2 hebei 承德
3 shanxi 太原
4 shanxi 大同
5 liaoning 沈阳
6 liaoning 大连
7 jilin 长春
8 liaoning 鞍山
9 shanxi 阳泉
10 liaoning 抚顺

在Hive中建立临时表用于管理原始数据

create table cities_tmp (tid int, tprovince string, tname string) row format delimited fields terminated by ' ';

将数据加载到临时表中

load data local inpath '/opt/hivedemo/cities' into table cities_tmp;

关闭严格模式

set hive.exec.dynamic.partition.mode=nonstrict;

从未分区表中查询出来放到已分区表中

insert into table cities partition(province) select tid, tname, tprovince from cities_tmp distribute by tprovince;

Hive本身支持多字段分区，多个字段之间，前一个字段形成的目录会包含后一个字段形成的目录，此时会形成多级目录，实际过程中，会利用多字段分区来实现多级分类的效果。例如年级班级、省市县等

原始数据

1 1 1 tom
1 1 2 sam
1 1 3 bob
1 1 4 alex
1 2 1 bruce
1 2 2 cindy
1 2 3 jack
1 2 4 john
2 1 1 tex
2 1 2 helen
2 1 3 charles
2 1 4 frank
2 2 1 david
2 2 2 simon
2 2 3 lucy
2 2 4 lily

在Hive中建表来管理数据

create table students_tmp (tgrade int, tclass int, tid int, tname string) row format delimited fields terminated by ' ';

加载数据

load data local inpath '/opt/hivedemo/students' into table students_tmp;

抽样数据，以确定数据正确加载

select * from students_tmp tablesample(5 rows);

建立分区表

create table students (id int, name string) partitioned by (grade int, class int) row format delimited fields terminated by '\t';

关闭严格模式

set hive.exec.dynamic.partition.mode=nonstrict;

动态分区 - 这次是多字段分区

insert into table students partition(grade, class) select tid, tname, tgrade, tclass from students_tmp distribute by tgrade, tclass;

分桶表

分桶表的作用是对数据进行抽样
数据分的桶的数量越多，执行的时候花费的内存越多
在Hive中，分桶机制默认不开启，需要开启分桶机制
```
set hive.enforce.bucketing = true;
```

建立分桶表

create table cities_bucket (id int, name string) clustered by (name) into 3 buckets row format delimited fields terminated by '\t';

向分桶表中添加信息，但是注意的是，分桶表只能通过insert方式来添加数据不能通过load方式来添加数据
```
insert overwrite table cities_bucket select id, name from cities;
```

对桶中的数据进行抽样

select * from cities_bucket tablesample(bucket 1 out of 2 on name);

函数

概述

Hive的目的是对数据进行分析，因此在Hive中，提供了非常丰富的函数，可以通过
```
show functions;
```
来查看Hive中所有的函数
在Hive中，可以通过
```
desc function xxx;
```
来描述这个函数的用法
在Hive中，所有的函数不能单独使用

旧城^以西

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
大数据实训笔记Day07

Hive简介注意问题在Hive中，每一个database在HDFS上都会对应一个目录在Hive中，没有主键的概念，也就意味着在定义表的时候不需要定义主键Hive在建表的时候就需要指定字段之间的间隔符号，建好表之后就无法修改在插入数据的时候，insert into表示追加数据；insert overwrite表示将原来的数据清空之后再加入数据基本操作SQL解释show databases;查看所有的库create database hivedemo;创建库
复制链接

扫一扫