大数据面试题知识点分析（七）

最新推荐文章于 2021-08-10 00:01:22 发布

雪泪寒飞起来

最新推荐文章于 2021-08-10 00:01:22 发布

阅读量174

点赞数

分类专栏：面试

面试专栏收录该内容

10 篇文章 0 订阅

订阅专栏

转自: https://blog.csdn.net/qq_26803795/article/details/79466592

本篇博客继续HIVE，将所有HIVE优化相关的内容深入清楚：

hive 优化：

1）Map的优化

   • 增加map的个数：
        set mapred.map.tasks=10;
    • 减少map的个数（合并小文件）：

    set mapred.max.split.size=100000000;
    set mapred.min.split.size.per.node=100000000;
    set mapred.min.split.size.per.rack=100000000;
    set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

•

Map端聚合（combiner）：

hive.map.aggr=true;

2）Reduce的优化

    • 设置reduce的个数：
        set mapred.reduce.tasks=10;
    • reduce任务处理的数据量
        set hive.exec.reducers.bytes.per.reducer=100000;

    • 避免使用可能启动mapreduce的查询语句

        1)group by
        2)order by(改用distribute by和sort by)

3）Join的优化

• Join on的条件：


   
   
     
     
      
      
     
     
     
     
      
          
       
       SELECT a.val, b.val, c.val
      
      
     
     

     
     
      
      
     
     
     
     
      
          
       
       FROM a
      
      
     
     

     
     
      
      
     
     
     
     
      
          
       
       JOIN b 
       
       ON (a.key = b.key1)
      
      
     
     

     
     
      
      
     
     
     
     
      
          
       
       JOIN c 
       
       ON (a.key = c.key1)

• Join的顺序：

[sql] view plain copy

print ?

<code class=“language-html”> /+ STREAMTABLE(a) / ：a被视为大表
/+ MAPJOIN(b) /：b被视为小表
SELECT /+ STREAMTABLE(a) / a.val, b.val, c.val
FROM a
JOIN b ON (a.key = b.key1)
JOIN c ON (c.key = b.key1);</code>


    
    
      
      
       
       
      
      
      
      
       
       
        
            /
        
        + STREAMTABLE(a) / ：a被视为大表
       
       
      
      

      
      
       
       
      
      
      
      
       
       
        
            /
        
        + MAPJOIN(b) /：b被视为小表 
       
       
      
      

      
      
       
       
      
      
      
      
       
        
       
       
      
      

      
      
       
       
      
      
      
      
       
       
        
            SELECT /
        
        + STREAMTABLE(a) / a.val, b.val, c.val
       
       
      
      

      
      
       
       
      
      
      
      
       
       
        
            FROM a
       
       
      
      

      
      
       
       
      
      
      
      
       
       
        
            JOIN b ON (a.key = b.key1)
       
       
      
      

      
      
       
       
      
      
      
      
       
       
        
            JOIN c ON (c.key = b.key1);

4）数据倾斜的优化

    • 万能方法：
        hive.groupby.skewindata=true
    • 大小表关联：
        Small_table join big_table

• 数据中有大量0或NULL：


  
  
    
    
     
     
    
    
    
    
     
     
      
              on case when (x.uid = '-' or x.uid = '0‘ or x.uid is null)
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
              then concat('dp_hive_search',rand()) else x.uid
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      end = f.user_id;

• 大大表关联：


  
  
    
    
     
     
    
    
    
    
     
             
      
      Select
      
      /+MAPJOIN(t12)/ 
      
      
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      from dw_log t11
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      join (
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      select
      
      /+MAPJOIN(t)/ t1.
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      from (
     
     
    
    

    
    
     
     
    
    
    
    
     
                     
      
      select user_id 
      
      from dw_log 
      
      group 
      
      by user_id
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
                      ) t
     
     
    
    

    
    
     
     
    
    
    
    
     
                     
      
      join dw_user t1
     
     
    
    

    
    
     
     
    
    
    
    
     
                     
      
      on t.user_id=t1.user_id
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
              ) t12
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      on t11.user_id=t12.user_id

• count distinct时存在大量特殊值：


  
  
    
    
     
     
    
    
    
    
     
             
      
      select 
      
      cast(
      
      count(
      
      distinct user_id)+
      
      1 
      
      as 
      
      bigint) 
      
      as user_cnt
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      from tab_a
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      where user_id 
      
      is 
      
      not 
      
      null 
      
      and user_id <> 
      
      ''

• 空间换时间：


  
  
    
    
     
     
    
    
    
    
     
             
      
      select 
      
      day,
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      count(
      
      case 
      
      when 
      
      type=
      
      'session' 
      
      then 
      
      1 
      
      else 
      
      null 
      
      end) 
      
      as session_cnt,
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      count(
      
      case 
      
      when 
      
      type=
      
      'user' 
      
      then 
      
      1 
      
      else 
      
      null 
      
      end) 
      
      as user_cnt
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      from (
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      select 
      
      day,session_id,
      
      type
     
     
    
    

    
    
     
     
    
    
    
    
     
                     
      
      from (
     
     
    
    

    
    
     
     
    
    
    
    
     
                         
      
      select 
      
      day,session_id,
      
      'session' 
      
      as 
      
      type
     
     
    
    

    
    
     
     
    
    
    
    
     
                         
      
      from 
      
      log
     
     
    
    

    
    
     
     
    
    
    
    
     
                         
      
      union all
     
     
    
    

    
    
     
     
    
    
    
    
     
                         
      
      select 
      
      day user_id,
      
      'user' 
      
      as 
      
      type
     
     
    
    

    
    
     
     
    
    
    
    
     
                         
      
      from 
      
      log
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
                      )
     
     
    
    

    
    
     
     
    
    
    
    
     
                     
      
      group 
      
      by 
      
      day,session_id,
      
      type
     
     
    
    

    
    
     
     
    
    
    
    
     
     
      
                  ) t1
     
     
    
    

    
    
     
     
    
    
    
    
     
             
      
      group 
      
      by 
      
      day

5）其他的优化

    • 分区裁剪（partition）：

        Where中的分区条件，会提前生效，不必特意做子查询，直接Join和GroupBy

    • 笛卡尔积：

        Join的时候不加on条件或者无效的on条件，Hive只能使用1个reducer来完成笛卡尔积

    • Union all：

    先做union all再做join或group by等操作可以有效减少MR过程，多个Select，也只需一个MR

    • Multi-insert & multi-group by：

    从一份基础表中按照不同的维度，一次组合出不同的数据


  
  
    
    
     
     
    
    
    
    
     
     
      
                  FROM from_statement
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      INSERT OVERWRITE 
      
      TABLE table1 [
      
      PARTITION (partcol1=val1)] select_statement1 
      
      group 
      
      by key1
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      INSERT OVERWRITE 
      
      TABLE table2 [
      
      PARTITION(partcol2=val2 )] select_statement2 
      
      group 
      
      by key2

    • Automatic merge：

    当文件大小比阈值小时，hive会启动一个mr进行合并

            hive.merge.mapfiles = true 是否和并 Map 输出文件，默认为 True
            hive.merge.mapredfiles = false 是否合并 Reduce 输出文件，默认为 False
            hive.merge.size.per.task = 256*1000*1000 合并文件的大小

    • Multi-Count Distinct：

    一份表中count多个参数（必须设置参数：set hive.groupby.skewindata=true;）


  
  
    
    
     
     
    
    
    
    
     
                 
      
      select dt, 
      
      count(
      
      distinct uniq_id), 
      
      count(
      
      distinct ip)
     
     
    
    

    
    
     
     
    
    
    
    
     
                 
      
      from ods_log 
      
      where dt=
      
      20170301 
      
      group 
      
      by dt