Hive_基本操作(二)

最新推荐文章于 2023-11-02 17:41:27 发布

weizhouck

最新推荐文章于 2023-11-02 17:41:27 发布

阅读量303

点赞数

分类专栏： Hive

本文链接：https://blog.csdn.net/WandaZw/article/details/82773347

版权

Hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

DDL 操作

建表语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name

   [(col_name data_type [COMMENT col_comment], ...)]

   [COMMENT table_comment]

   [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

   [CLUSTERED BY (col_name, col_name, ...)

   [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

   [ROW FORMAT row_format]

   [STORED AS file_format]

   [LOCATION hdfs_path]

1：创建内部表

create table if not exists t_user (id int,name string,age int)
row format delimited
fields terminated by ',';

linux 将本地文件加载hdfs hive表目录，从而实现数据插入

hadoop fs -put t_user.txt /user/hive/warehouse/t_user

hive 命令行模式将linux本地文件数据插入 hive表中

 load data local inpath '/root/t_user2.txt' into table t_user;

hive 命令行模式将hdfs文件系统数据插入 hive表中

load data inpath '/t_user3.txt' into table t_user;

2：创建外部表

create external table if not exists t_ext (id int,name string,age int)
row format delimited
fields terminated by ','
location "/hivedata";

加载数据

hadoop fs -put t_user.txt /hivedata
load data local inpath '/root/t_user2.txt' into table t_ext;
load data inpath '/t_user4.txt' into table t_user;

区别： 删除表，hive元数据都会删除，内部表会删除hdfs下的表目录以及数据文件，外部表只删除元数据，hdfs下的数据文件会保留

注意：load data inpath *** into table ***实际上是将hdfs文件移动到hive表目录下

总结：从安全性能考虑，实际生产中新增或删除外部表不会造成数据文件的丢失，推荐使用外部表更加安全可靠

3：创建分区表

create table if not exists t_partition (ip string,duration int)
partitioned by (country string)
row format delimited
fields terminated by ',';

-- 查询表结构
desc t_partition ;

加载数据

 load data local inpath '/root/t_part' into table t_partition partition(country="China" );
 select * from t_partition where country="USA";

总结：分区表是将数据根据分区字段，存储在hive表目录下各级分区子目录中，查询带上分区条件可以避免全表扫面，提高查询效率。

4：stored as [ textfile | sequencefile | rcfile ]

create table if not exists t_user3 (id int,name string,age int)
row format delimited
fields terminated by ','
stored as sequencefile;

-- 压缩文件不能使用 load data 上传以及移动文件的方式加载数据可以使用以下方式

insert overwrite table t_user3 select * from t_user;

总结：默认 testfile ,如果数据是纯文本，可以使用 stored as textfile .如果需要压缩处理可以使用 stored as sequencefile;

DML 和 DDL操作

1：添加新分区

alter table t_partition add partition(country="Japan")；
load data local inpath "/root/t_part2" into table t_partition partition(country="Japan");
select * from t_partition where country="Japan";

2：查询表分区

 show partitions t_partition;

3：删除表分区

alter table t_partition drop partition(country="Japan");
show partitions t_partition;

4：修改表名

alter table t_partition rename to t_partition_new;
show tables;

5：增加或更新列

-- 新增
alter table t_partition add columns (city string);
desc t_partition;
-- 更新类型
alter table t_partition replace columns (duration string);

6：加载数据 overwrite ( 会将原来数据覆盖 )

insert overwrite table t_user3 select * from t_user;
load data local inpath '/root/t_user.txt' overwrite into table t_user;
load data inpath '/t_user.txt' overwrite into table t_user;

6：复制原有表结构插入已有数据

 crate table t_user4  like t_user; 
 insert overwrite table t_user4  select * from t_user;

7：复制原hive表数据导出到本地或hdfs中

-- 导出到linux本地
insert overwrite local directory '/root/t_user5.txt' select * from t_user;
-- 导出到hdfs
insert overwrite directory '/root/t_user5.txt' select * from t_user;

8：插入自动分区

-- 设置动态分区模式 非严格模式
SET HIVE.EXEC.DYNAMIC.PARTITION.MODE=NONSTRICT
insert overwrite t_user partition(city) select id,name,city from t_user2 where city='USA';

9：排序

order by 会对输入做全局排序，因此只有一个reducer,会导致输入规模较大时需要较长的计算时间

sort by 不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by 进行排序，并且设置 reduce.task>1，则 sort by只保证每个reducer的输出有序，不保证全局有序。

distribute by(字段) 根据指定的字段将数据分到不同的reducer,且分发算法时hash散列。

cluster by(字段) 除了具有Distribute by 的功能外，还会对该字段进行排序

weizhouck

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive_基本操作(二)

DDL 操作建表语法CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT table_comment] [PARTITIONED BY (col_name data_type [COMMENT col...
复制链接

扫一扫

专栏目录