hive的sql语句之系统函数分桶表和抽样查询

最新推荐文章于 2023-12-21 22:12:59 发布

一把秀儿

最新推荐文章于 2023-12-21 22:12:59 发布

阅读量697

点赞数

分类专栏： linux和hadoop生态

本文链接：https://blog.csdn.net/m0_52106226/article/details/110498793

版权

linux和hadoop生态专栏收录该内容

7 篇文章 0 订阅

订阅专栏

数组 select

select array(1,2,3,4) ;
用法案例
select array(id,url,ct)这是表的列名 from  tb_log ; 这是表名  就是把表的三列合成一列展示出来
select array(id,url,ct)[0] from  tb_log; 数组的取值就是把索引为0的id列展示出来

array_conrains

array_conrains  判断数组中是否有指定 元素   用法如下
select array_contains(array(1,2,3),2) ;    会返回一个true代表有这个元素

upper lower 大小写转换

upper    小写转大写
select  upper ('abc') ;
lower     大写转小写
select  lower ('ABC') ;

split 切割

select split('hello_tom_jim','_') ;   前面写字符串后面写分隔符  返回一个数组切成了三份
select split('hello_tom_jim','_')[0] ;  因为是返回数组所以可以根据索引查询  0索引返回hello

trim(str) 去除首尾空格

select trim('     hello      ');    返回的是hello

uuid 生成随机字符串

select uuid() ;       返回一个随机字符串229e4f73-614d-4940-8043-00ac454588df

replace 字符串替换

replace(字符串 , 要替换的子串 , 替换的新字符串) 替换字符串
select replace('a_f_j_l_o','_','');   返回afglo   _就被替换了

substr 偏移量

select substr(字符串,开始位置,长度_可省略)     
select substr('hello',1,3);     返回的就是hel
substring   跟上面的用法一样
select substring_index(字符串,切割符,第几块字符串)
select substring_index('a-b-c','-',1);   返回a  就是返回第一块字符串

分桶表

分桶表
分区表是将数据分文件夹管理 , 减少数据扫描的文件范围直接从对应文件夹中读取数据
分桶表
对join 对查询的优化将数据按照指定的字段的规则分文件

1 创建普通表  导入数据 
2 创建分桶表
3 开启分桶功能 
set hive.enforce.bucketing=true;     -- 开启分桶
set mapreduce.job.reduces=-1;      
4  使用insert  into的方式导入数据 到 分桶表中

1 创建普通表导入数据
create table tb_stu(
id int, 
name string)
row format delimited fields terminated by '\t';
load data local inpath "/data/stu/" into  table tb_stu ;

2 创建分桶表
create table buck_stu(
id int, 
name string)
clustered by(id)   --分桶表的分桶字段
into 3 buckets     --分桶表的分的份数
row format delimited fields terminated by '\t';

3 开启分桶功能
set hive.enforce.bucketing=true;     -- 开启分桶   就是固定 格式
set mapreduce.job.reduces=-1;

4  使用insert  into的方式导入数据 到 分桶表中
insert into table buck_stu     --要插入的分桶表
select id, name from tb_stu;   --要插入的数据所在的普通表  和字段

抽样查询

--- 抽样查询  
 select * from  buck_stu tablesample(bucket 1 out of 3 on id); 分成3份抽样查三分之一份
 select * from  buck_stu tablesample(bucket 2 out of 3 on id);  这三份凑一起就是一个表
 select * from  buck_stu tablesample(bucket 3 out of 3 on id);
 注意抽样查询是随机的三份分桶都抽   抽的三分之一也是随机的