Hive函数&压缩

最新推荐文章于 2022-11-02 15:26:02 发布

hsiehchou

最新推荐文章于 2022-11-02 15:26:02 发布

阅读量409

点赞数

分类专栏： Hive 文章标签： Hive

本文链接：https://blog.csdn.net/xzddfgj/article/details/88198286

版权

Hive 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、排序

Order By:全局排序
1)按照员工表的奖金金额进行正序排序
select * from emptable order by emptable.comm asc;
可以省略asc

2）按照员工表的奖金金额进行倒序排序
select * from emptable order by emptable.comm desc;

3)按照部门和奖金进行升序排序
select * from emptable order by deptno,comm;

Sort By:内部排序（区内有序，全局无序）
设置reduce个数的属性：set mapreduce.job.reduces = 3;
select * from dept_partitions sort by deptno desc;

Distribute By:分区排序
1）先按照部门编号进行排序再按照地域编号进行降序排序。
select * from dept_partitions distribute by deptno sort by loc desc;

Cluster By:分桶排序
1）按照部门编号进行排序
select * from dept_partitions cluster by deptno;

注意：如果Distrbute和Sort by 是相同字段时，可以用cluster by代替

2、分桶

分桶分的是文件
1）创建分桶表
clustered by(id) into 4 buckets


 
 
  
  hive> 
  
  set mapreduce.job.reduces=
  
  4;

 
 
 
 
  
  hive> create table emptable_buck(
  
  id int, 
  
  name 
  
  string)

 
 
 
 
  
      > clustered 
  
  by(
  
  id) 
  
  into 
  
  4 buckets

 
 
 
 
  
      > row format

 
 
 
 
  
      > delimited fields

 
 
 
 
  
      > terminated 
  
  by '\t';

查看表的描述信息


 
 
  
  hive> desc formatted emptable_buck
  
  ;

加载数据


 
 
  
  hive> load 
  
  data 
  
  local inpath 
  
  '/root/hsiehchou.txt' 
  
  into table emptable_buck;


 
 
  
  hive> create 
  
  table emptable_b(id 
  
  int, name string)

 
 
 
 
  
      > row format

 
 
 
 
  
      > delimited fields

 
 
 
 
  
      > terminated by 
  
  '\t';

清空表


 
 
  
  hive> truncate 
  
  table emptable_buck;

加载数据（桶）


 
 
  
  hive> load 
  
  data 
  
  local inpath 
  
  '/root/hsiehchou.txt' 
  
  into table emptable_b;

设置桶的环境变量(插入数据时分桶，不开启默认在一个桶里面)


 
 
  
  hive> 
  
  set hive.enforce.bucketing=true;

 
 
 
 
  
  hive> truncate 
  
  table emptable_buck;

用户需要统计一个具有代表性的结果时，并不是全部结果！抽样！
(bucket 1 out of 2 on id）
1：第一桶数据
2：代表拿两桶


 
 
  
  hive> select * 
  
  from emptable_buck  tablesample(bucket 
  
  1 
  
  out of 
  
  2 
  
  on 
  
  id);

3、UDF自定义函数

查看内置函数
show functions;
查看函数的详细内容
desc function extended upper;

UDF:一进一出
UDAF:聚合函数多进一出 count /max/avg
UDTF:一进多出

java
导入Hive的lib下的所有jar包
编程java代码


 
 
  
  package com.hsiehchou;

 
 
 
 
  
  import org.apache.hadoop.hive.ql.exec.
  
  UDF;

 
 
 
 
  
  public 
  
  class MyConcat extends UDF {

 
 
 
     
  
  //将大写转换成小写

 
 
 
 
  
      public 
  
  String evaluate(
  
  String a, 
  
  String b) {

 
 
 
         
  
  return a + 
  
  "******" + 
  
  String.valueOf(b);

 
 
 
 
  
      }   

 
 
 
 
  
  }

export此文件，打包jar，放入hsiehchou121中。

添加临时：
add jar /root/Myconcat.jar;
create temporary function my_cat as “com.hsiehchou.MyConcat”;


 
 
  
  <!-- 注册永久：hive-site.xml -->

 
 
 
 
  
  <property>

 
 
 
 
  
  <name>hive.aux.jars.path
  
  </name>

 
 
 
 
  
  <value>file:///root/hd/hive/lib/hive.jar
  
  </value>

 
 
 
 
  
  </property>

4、Hive压缩

存储：hdfs
计算：mapreduce

Map输出阶段压缩方式
开启hive中间传输数据压缩功能
set hive.exec.compress.intermediate=true;

开启map输出压缩
set mapreduce.map.output.compress=true;

设置snappy压缩方式
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.com
press.SnappyCodec;

Reduce输出阶段压缩方式
设置hive输出数据压缩功能
set hive.exec.compress.output=true;

设置mr输出数据压缩
set mapreduce.output.fileoutputformat.compress=true;

指定压缩编码
set mapreduce.output.fileoutputformat.compress.codec=org.apache.
hadoop.io.compress.SnappyCodec;

指定压缩类型块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

测试结果
insert overwrite local directory ‘/root/datas/rs’ select * from emptable order by sal desc;

hsiehchou

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Hive函数&压缩

1、排序Order By:全局排序 1)按照员工表的奖金金额进行正序排序 select * from emptable order by emptable.comm asc; 可以省略asc2）按照员工表的奖金金额进行倒序排序 select * from emptable order by emptable.comm desc;3)按照部门和奖金进行升序排序 select * fr...
复制链接

扫一扫