hive入门一

最新推荐文章于 2023-10-10 16:27:12 发布

西门吹水之城

最新推荐文章于 2023-10-10 16:27:12 发布

阅读量159

点赞数 1

分类专栏： hadoop 文章标签： Java 大数据

本文链接：https://blog.csdn.net/sinat_32692653/article/details/85126953

版权

hadoop 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

创建表

普通表

create table ip_table1(ip string, region string, country string,province string,city string,area string,company string)

row format delimited fields terminated by '|';

导入数据: load data local inpath '/home/adminuser/app/mydata/ip.txt' into table ip_table1;

分区表

create table ip_table2(ip string, region string, country string,province string,city string,area string)

partitioned by(company string)

row format delimited fields terminated by '|';

分区的字段不能出现在创建表的字段里面，否则会报错。

导入数据：load data local inpath '/home/adminuser/app/mydata/ip1.txt' into table ip_table2 partition(company='1')

创建多层分区：

create table t10(

id int

,name string

,hobby array<string>

,add map<String,string>

)

partitioned by (pt_d string,sex string)

row format delimited fields terminated by ','

collection items terminated by '-'

map keys terminated by ':'

多分区导入数据：

load data local inpath '/home/hadoop/Desktop/data' overwrite into table t10 partition ( pt_d = '0',sex='male');

多分区的实质是在第一个分区的文件夹的基础上再建子文件夹。

分区的增删：

添加一个分区：alter table testljb add partition (age=2);

增加多个分区：

alter table testljb add partition(age=3) partition(age=4);

增加多分区表的分区：

alter table testljb add partition(age=1,sex='male');

删除分区：

alter table testljb drop partition(age=1);

动态分区与静态分区

在一个场景下：将数据导入另一张表中，这张表的创建语句：

create table ip_table3(ip string, country string,province string,city string,area string)

partitioned by(company string,region string)

row format delimited fields terminated by '|';

现在要从另一张表中的数据导入先要设置打开动态分区：

set hive.exec.dynamic.partition=true;

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table ip_table3 partition (company,region) select ip,country,province,area,city,company,region from ip_table1;

这里语言会查找最后两个字段的数据匹配到partition里面的字段

坑：尽量避免使用中文作为分区的字段，否则很麻烦！（可以设置使用中文）

数据查询：

一、order by语法

select * from ip_table3 order by company desc（asc）;

二、sort by语法

select * from ip_table3 sort by company desc（asc）;

order by 保证的是全局的排序，而sort by保证的是每个分区（reducer）的有序。

三、distribute by

select * from ip_table3 distribute by company sort by ip;

ditribute by是控制map的输出在reducer是如何划分的，比如在上个字段中，这里需要注意的是distribute by必须要写在sort by之前。因为，sort by作用于分区。

set mapred.reduce.tasks=2;

select * from student distribute by sex sort by sage;

distribute by 是按照指定的字段对数据进行划分到不同的reduce中
第一的语句，我们只是按照性别和年龄排序
第二个，我们设置了2个reduce,并且按照性别分区。sort by sage;

四、cluster by

select * from ip_table3 cluster by company

cluster by的功能就是distribute by和sort by相结合，上句sql等价select * from ip_table3 distribute by company sort by ip;

西门吹水之城

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录