8.Hive系列之函数及压缩与存储

沈健_算法小生

已于 2024-06-20 21:42:45 修改

阅读量156

点赞数

分类专栏：大数据文章标签： hive hadoop 数据仓库

于 2023-07-22 22:15:47 首次发布

本文链接：https://blog.csdn.net/SJshenjian/article/details/131873442

版权

大数据专栏收录该内容

60 篇文章 3 订阅

订阅专栏

一、函数

# 查看系统自带的函数
show functions;
# 显示自带的函数的用法
desc function upper;
# 详细显示自带的函数的用法
desc function extended upper;
# 如果员工的 comm 为 NULL，则用-1 代替
select comm,nvl(comm, -1) from emp;
# CASE WHEN THEN ELSE END
select dept_id,
 sum(case sex when '男' then 1 else 0 end) male_count,
 sum(case sex when '女' then 1 else 0 end) female_count
from emp_sex group by dept_id;
# 行转列 CONCAT_WS(separator, str1, str2,...)：它是一个特殊形式的CONCAT()。第一个参数剩余参数间的分隔符。分隔符可以是与剩余参数一样的字符串
SELECT t1.c_b, CONCAT_WS("|",collect_set(t1.name))
FROM (SELECT NAME, CONCAT_WS(',',constellation,blood_type) c_b FROM person_info) t1 GROUP BY t1.c_b
# 列转行 EXPLODE(col)：将 hive 一列中复杂的 Array 或者 Map 结构拆分成多行。
# LATERAL VIEW用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias 解释：用于和 split, explode 等 UDTF 一起使用，它能够将一列数据拆成多行数据，在此基础上可以对拆分后的数据进行聚合
SELECT movie, category_name FROM
movie_info
lateral VIEW explode(split(category,",")) movie_info_tmp AS category_name;

自定义函数

Hive 自带了一些函数，比如：max/min 等，但是数量有限，自己可以通过自定义 UDF 来方便的扩展。
当 Hive 提供的内置函数无法满足你的业务处理需要时，此时就可以考虑使用用户自定义函数（UDF：user-defined function）。
根据用户自定义函数类别分为以下三种：
- UDF（User-Defined-Function）
  一进一出
- UDAF（User-Defined Aggregation Function）
  聚集函数，多进一出，类似于：count/max/min
- UDTF（User-Defined Table-Generating Functions）
  一进多出, 如 lateral view explode()
实现方式略，自行使用时查找吧

二、压缩与存储

2.1 创建一个 ZLIB 压缩的 ORC 存储方式

create table log_orc_zlib(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc
tblproperties("orc.compress"="ZLIB");

# 查看插入后数据
dfs -du -h /user/hive/warehouse/log_orc_zlib/ ;
2.78 M /user/hive/warehouse/log_orc_none/000000_0

2.2 创建一个 SNAPPY 压缩的 ORC 存储方式

create table log_orc_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as orc
tblproperties("orc.compress"="SNAPPY");

# 查看插入后数据
dfs -du -h /user/hive/warehouse/log_orc_snappy/;
3.75 M /user/hive/warehouse/log_orc_snappy/000000_0

2.3 创建一个 SNAPPY 压缩的 parquet 存储方式

create table log_parquet_snappy(
track_time string,
url string,
session_id string,
referer string,
ip string,
end_user_id string,
city_id string
)
row format delimited fields terminated by '\t'
stored as parquet
tblproperties("parquet.compression"="SNAPPY");

dfs -du -h /user/hive/warehouse/log_parquet_snappy/;
6.39 MB /user/hive/warehouse/ log_parquet_snappy /000000_0

2.4 存储方式和压缩总结

在实际的项目开发当中，hive 表的数据存储格式一般选择：orc 或 parquet。压缩方式一般选择 snappy，lzo

沈健_算法小生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
8.Hive系列之函数及压缩与存储

在实际的项目开发当中，hive 表的数据存储格式一般选择：orc 或 parquet。压缩方式一般选择 snappy，lzo。
复制链接

扫一扫

专栏目录