hive数据处理的一些小总结

最新推荐文章于 2022-11-04 09:27:32 发布

另一个世界Azure

最新推荐文章于 2022-11-04 09:27:32 发布

阅读量242

点赞数

分类专栏： hive

本文链接：https://blog.csdn.net/huang358468/article/details/117395417

版权

hive 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

本文探讨了数据库设计的重要原则，包括保留源表字段、添加加工字段以确保数据复原，以及在排序、JOIN操作中应注意的细节。通过使用COALESCE函数处理NULL值，创建拉链表以应对全量表JOIN，并强调选择最大分区和限制时间条件的重要性。此外，还提到了排序差异字段的引入以确保数据一致性。

摘要由CSDN通过智能技术生成

表设计原则

保留源表的字段，添加加工字段，如下

with t1 as (
select id
       ,name
       ,address
  from t11 
),
t2 as (
select id
      ,name1
      ,address1
  from t22
)
select t1.id
       -- 留下原始值，方便进行数据复原
      ,t1.name
      ,t1.address
      ,t2.name1
      ,t2.address1
      --  记录最终结果，方便下游数据进行引用
      ,coalesce(t1.name,t2.name1)        finally_name
      ,coalesce(t1.address,t2.address1)  finally_address
  from t1 
left join t2 on t1.id = t2.id
create table finally_result(
    id        string
    ,name     string comment 'xxx'
    ,address  string comment 'xxx'
    ,name1    string comment 'xxx'
    ,address1 string comment 'xxx'
    ,finally_name string comment 'xxx'
    ,finally_address string comment 'xxx'
)

排序

select id
      ,name
      ,age
  from t1 
order by age
以上代码排序会有问题，如下
id    name    age
1     aa      18
2     aa      18

引入排序的差异字段，这样能让数据限制在一个固定的方向中
select id
      ,name
      ,age
  from t1 
order by age
        ,id

表之间的inner Join

因为join涉及到表数据的留下与删除
如果join中小表为全量表，且没有时间标记数据中的添加也删除，这个表应该设计为拉链表，
这样做的原因是在对已经加工的数据进行复原

全量表进行Join时

表要取最大分区，同时要限制表中的数据为小于或者等于当天时间

另一个世界Azure

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录