hive练习笔记静态分区动态分区加载数据方式严格模式创建表库及其相关操作

最新推荐文章于 2023-02-27 03:26:23 发布

C_time

最新推荐文章于 2023-02-27 03:26:23 发布

阅读量590

点赞数 1

分类专栏： Hive 文章标签： hive练习笔记

本文链接：https://blog.csdn.net/C_time/article/details/100673158

版权

Hive 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

1.首先，hive遵循sql的结构规则

set  ;
with tmp as()
select

from(
    select
    a.id id, --二级id
    a.name name,

    from test a
    left join test1 b
    on ...
    join ...
    where 
    group by
    having
    order by/sort by

    union /union all
)

在这里插入图片描述
2.创建库：
创建库的本质：在hive的warehouse目录下创建一个目录（库名.db命名的目录）
hive的数据库名、表名都不区分大小写
hive不能使用关键字、数字开头的字符串来做库表名。
字段名可以使用关键字但要加反引号所以不推荐使用
user order

create database if not exists qf24;

3.切换库：

use gp24;

删除库

drop database qf24；只能删除空库
drop database qf24 cascade;  强制删除

显示当前正在操作的库

hive> set hive.cli.print.current.db=true;

4.创建表：本质是创建目录，并映射到元数据。

什么没指定 都是默认的分隔符啥的

create table qf24.t_user(id int,name string);

create table t_user(id int,name string); --带库名不带库名都行

5.hive的默认的列与列之间分隔符是：不过为啥我打不出来？！ ^A 视频是说 ctrl A ctrl V 可是不识别打不出来啥意思
默认：^A 、\u0001 、 \001

^B \002

^C \003
6.创建表

create table if not exists t_user1(
id int comment "this is userid",
name string
)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile --存储格式
location '/user/hive/warehouse/qf24.db/t_user'  --只能是hdfs中的目录，直接为表加载数据
;

Clustered by 分桶后续再说

7.创建表的语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] TABLENAME(
	[COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...] )

	[COMMENT 'TABLE COMMENT']
	[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...)]
	[CLUSTERED BY (COLUMNNAME COLUMNTYPE [COMMENT 'COLUMN COMMENT'],...) [SORTED BY (COLUMNNAME [ASC|DESC])...] INTO NUM_BUCKETS BUCKETS]
	[ROW FORMAT ROW_FORMAT]
	[STORED AS FILEFORMAT]
	[LOCATION HDFS_PATH]

;

最重要的，第三条删除时内外部表的删除的情况

1.内外部表区别：

1、默认创建内部表，创建外部表需要external。
2、一般使用外部表(长期存在的表、数据量大的、不希望把数据块删除的数据)，临时表或者确定使用即可清空全部数据(数据库和元数据)则可以使用内部表。
3、内部表删除时将会删除元数据和hdfs中表对应的目录，而外部表删除时只会删除元数据，hdfs中的数据目录保留。

1.创建外部表就是加个external

create external table t_user2(id int,name string)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile --存储格式
;

set 这个set的设置当前session有效

设置本地模式：

set hive.exec.mode.local.auto=true;

hive表的数据加载：

1、直接将hdfs中数据使用命令上传到表所对应的目录即可。

[root@hadoop01 hivedata]# hdfs dfs -put ./t1 /user/hive/warehouse/qf24.db/t_user/

2、创建表的时候，使用location指定表所对应的目录即可。

create table if not exists qf24.t_user2(
id int comment "this is userid",
name string
)
row format delimited fields terminated by ' '
lines terminated by '\n'
stored as textfile --存储格式
location '/user/hive/warehouse/qf24.db/t_user'  --只能是hdfs中的目录，直接为表加载数据
;

3、使用load方式加载数据

load data local inpath '/home/hivedata/t1' into table qf24.t_user; --默认复制
load data local inpath '/home/hivedata/t' overwrite into table qf24.t_user;
load data inpath '/t1' overwrite into table qf24.t_user;   --移动 注意就是要hdfs上的t1移动到了表的目录下

4、使用insert into方式

法一：
set hive.exec.mode.local.auto=true;
insert into table t_user2
select id,name from t_user;

法二：
from t_user
insert into table t_user2
insert into table t_user3
select
id,
name
;

from t_user
insert into table t_user2
select
id,
name
where id > 2
;

法三：
set hive.exec.mode.local.auto=true;
with tmp as(
select
id,
name
from t_user
)
insert into table t_user2
select * from tmp;

5、使用ctas方式来

create table t_user3
as
select
name
from t_user
;

6、使用like方式(克隆)

create table t_user4 like t_user2;
create table t_user4 like t_user2 location '/user/hive/warehouse/qf24.db/t_user2';

1.查看表的描述

desc t_user2;
desc extended t_user2;
show create table t_user2;  --和mysql一样
describe t_user2;

2.有个题根据网站分组统计每个网站的上行下行流量以及总流量

CREATE TABLE qf24.log1(
id string COMMENT 'this is id column',
phonenumber bigint,
mac string,
ip string,
url string,
status1 string,
status2 string,
upflow int,
downflow int,
status3 string,
dt string
)
COMMENT 'this is log table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
stored as textfile;

load data local inpath '/home/data.log' into table qf24.log1; --默认复制


set hive.exec.mode.local.auto=true;
select
l.url url,
round(sum(l.upflow)/1024.0,2) upflows,
round(sum(l.downflow/1024.0),2) downflows,
round(sum(l.upflow+l.downflow)/1024.0,2) total_flows
from qf24.log1 l
group by l.url
order by total_flows desc
limit 3
;

3.表修改：

1.重命名表名：rename to

alter table log1 rename to log3;

字段名\字段类型\字段顺序： change column

没有before和last。只有after和first

alter table log3 change column id myid int after mac;
alter table log3 change column myid id string first

添加字段：add columns

alter table log3 add columns(dt1 int,dt2 string);

删除字段：replace columns

alter table log3 replace columns(
id string,
phonenumber bigint,
mac string,
ip string,
url string,
status1 string,
status2 string,
upflow int,
downflow int,
status3 string,
dt string
);

1.内外部表转换除了false可以小写其它都要大写才行

内转外部表:
alter table log3 set tblproperties('EXTERNAL'='TRUE');  ###内部表转外部表，true一定要大写;
外转内部表:
alter table log3 set tblproperties('EXTERNAL'='false'); ##false大小写都没有关系

分区表的意义在于优化查询。查询时尽量利用分区字段。如果不使用分区字段，就会全部扫描。

hive的分区表：

分区意义：
避免全表扫描，从而提高查询效率。默认使用全表扫描。

使用什么分区？
日期、地域、能将数据分散开来

分区技术：
[PARTITIONED BY (COLUMNNAME COLUMNTYPE [COMMENT ‘COLUMN COMMENT’],…)]

注意区分大小写(感觉在hdfs就是一个目录所以区分大小写这样对不)
还有分区字段并没有修改表字段是个伪字段不过可以用来作为where条件

1、hive的分区名区分大小写
2、hive的分区字段是一个伪字段，但是可以用来进行操作
3、一张表可以有一个或者多个分区，并且分区下面也可以有一个或者多个分区。
4、分区字段使用表外字段
在这里插入图片描述

所以本质就是：
在表的目录或者是分区的目录下再创建目录，分区的目录名为指定字段=值(比如:dt=2019-09-09)

案例：
创建1级分区表：
create table if not exists part1(
id int,
name string
)
partitioned by (dt string)
row format delimited fields terminated by ' '
;

加载数据
load data local inpath '/home/hivedata/t1' overwrite into  table part1 partition(dt='2019-09-09');
load data local inpath '/hivedata/user.txt' into table part1 partition(dt='2018-03-20');

创建二级分区：
create table if not exists part2(
id int,
name string
)
partitioned by (year int,month int)
row format delimited fields terminated by ' '
;

加载数据
load data local inpath '/home/hivedata/t1' overwrite into  table part2 partition(year=2019,month=9);
load data local inpath '/home/hivedata/t' overwrite into  table part2 partition(year=2019,month=10);


select * from part2 where year=2019 and month=10;

数据就两个字段随便造几个试试就行注意分隔符是空格还是\t
在这里插入图片描述

动态分区加载数据不能使用load方式加载

load data local inpath '/hivedata/user.txt' into table dy_part1 partition(dt);  这样不行吧

修改分区：

1、查看分区
show partitions 表名;


2、添加分区
alter table part1 add partition(dt='2019-09-10');
alter table part1 add partition(dt='2019-09-13') partition(dt='2019-09-12');
alter table part1 add partition(dt='2019-09-11') location  '/user/hive/warehouse/qf1704.db/part1/dt=2019-09-10';

3、分区名称修改
alter table part1 partition(dt='2019-09-10') rename to partition(dt='2019-09-14');

4、修改分区路径
alter table part1 partition(dt='2019-09-14') set location '/user/hive/warehouse/qf24.db/part1/dt=2019-09-09';    --错误使用
alter table part1 partition(dt='2019-09-14') set location 'hdfs://hadoo01:9000/user/hive/warehouse/qf24.db/part1/dt=2019-09-09';  --决对路径

5、删除分区
alter table part1 drop partition(dt='2019-09-14');
alter table part1 drop partition(dt='2019-09-12'),partition(dt='2019-09-13');


静态分区：加载数据到指定分区的值。
动态分区：数据未知，根据分区的值来确定需要创建的分区。
混合分区：静态和动态都有。

动态分区的属性：
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=strict/nonstrict
set hive.exec.max.dynamic.partitions=1000
set hive.exec.max.dynamic.partitions.pernode=100

strict:严格模式必须至少一个静态分区
nostrict：可以所有的都为动态分区，但是建议尽量评估动态分区的数量。

案例：
create table dy_part1(
id int,
name string
)
partitioned by (dt string)
row format delimited fields terminated by ' '
;

load data local inpath '/home/hivedata/t1' overwrite into  table dy_part1 partition(dt='2019-09-09');

set hive.exec.mode.local.auto=true;
insert into table dy_part1 partition(dt)
select
id,
name,
dt
from part1
;

混合分区：
create table if not exists dy_part2(
id int,
name string
)
partitioned by (year int,month int)
row format delimited fields terminated by ' '
;

set hive.exec.mode.local.auto=true;
set hive.exec.dynamic.partition.mode=strict;
insert into table dy_part2 partition(year=2019,month)
select
id,
name,
month
from part2
where year=2019
;

分区表注意事项

1、hive的分区使用的是表外字段，分区字段是一个伪列，但是分区字段是可以做查询过滤。
2、分区字段不建议使用中文
3、一般不建议使用动态分区，因为动态分区会使用mapreduce来进行查询数据，如果分区数据过多，导致namenode和resourcemanager的性能瓶颈。所以建议在使用动态分区前尽可能预知分区数量。
4、分区属性的修改都可以使用修改元数据和hdfs数据内容。

分区的严格模式和严格模式不是一回事

hive的严格模式：

可以在配置文件永久修改默认是非严格模式

 <property>
    <name>hive.mapred.mode</name>
    <value>nonstrict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>

严格模式阻挡5类查询：

1、笛卡尔积
set hive.mapred.mode=strict;
select
*
from dy_part1 d1
join dy_part2 d2
;

2、分区表没有分区字段过滤
set hive.mapred.mode=strict;
select
*
from dy_part1 d1
where d1.dt='2019-09-09'
;

不行
select
*
from dy_part1 d1
where d1.id > 2
;

select
*
from dy_part2 d2
where d2.year >= 2019
;

3、order by不带limit查询
select
*
from log3
order by id desc
;

4、(bigint和string比较)Comparing bigints and strings.

5、(bigint和double比较)Comparing bigints and doubles.

hive读写模式：

Hive是一个严格的读时模式。 写数据不管数据正确性，读的时候，不对则用NULL替代。
mysql是一个的写时模式。 写的时候检查语法，不okay就会报错。

load data local inpath '/home/hivedata/t' into  table t_user;
insert into stu(id,sex) value(1,abc);

C_time

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
hive练习笔记静态分区动态分区加载数据方式严格模式创建表库及其相关操作

1.首先，hive遵循sql的结构规则set ;with tmp as()selectfrom( select a.id id, --二级id a.name name, from test a left join test1 b on ... join ... where group by having...
复制链接

扫一扫