一个hive任务的分析

一位开发人员每天跑相同的任务,但是他发现某一天同样的任务处理大约差不多量的数据(都是大约4亿条,hive count的结果),处理时间却相差3倍之多,一个不到一个小时,一个运行了三个小时,记录一下分析步骤。
1. 首先用explain看一下hive语句执行步骤再去找对应的job任务。 可以看出此sql语句分4个任务去跑。
Stage-1:执行join操作
Stage-2:根据Stage-1 执行count操作
Stage-0:数据移动到新表的操作。
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        m
          TableScan
            alias: m
            Reduce Output Operator
              key expressions:
                    expr: rtb_label
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: rtb_label
                    type: string
              tag: 0
              value expressions:
                    expr: label_id
                    type: int
                    expr: extract
                    type: string
        t:rtb_requests
          TableScan
            alias: rtb_requests
            Filter Operator
              predicate:
                  expr: (tid <> '')
                  type: boolean
              Lateral View Forward
                Select Operator
                  expressions:
                        expr: tid
                        type: string
                        expr: opxpid
                        type: string
                        expr: request_date_i
                        type: string
                  outputColumnNames: tid, opxpid, request_date_i
                  Lateral View Join Operator
                    outputColumnNames: _col12, _col16, _col40, _col42
                    Select Operator
                      expressions:
                            expr: _col12
                            type: string
                            expr: _col16
                            type: string
                            expr: _col42
                            type: string
                      outputColumnNames: _col0, _col1, _col2
                      Reduce Output Operator
                        key expressions:
                              expr: _col2
                              type: string
                        sort order: +
                        Map-reduce partition columns:
                              expr: _col2
                              type: string
                        tag: 1
                        value expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: string
                Select Operator
                  expressions:
                        expr: array(detected_content_labels[0],CASE WHEN ((user_agent is null or (user_agent = ''))) THEN ('device_Undefined') WHEN (((user_agent rlike 'iP(a|ro)d') or (user_agent rlike 'Tablet'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'Kindle') or (user_agent rlike 'Mac.OS')) and (user_agent rlike 'Silk'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'HTC(.Flyer|\_Flyer)|ViewPad7|Dell.Streak.7|Next2|nook') and (user_agent rlike 'RUTEM/')) and (user_agent rlike 'Silk'))) THEN ('device_Tablet') WHEN ((((user_agent rlike 'Opera') and (user_agent rlike 'Windows.NT.5')) and (user_agent rlike 'HTC|Xda|Mini|Vario|SAMSUNG-GT-i8000|SAMSUNG-SGH-i9'))) THEN ('device_Mobile') WHEN ((((user_agent rlike 'Windows') and (not (user_agent rlike 'Phone'))) or (user_agent rlike 'Win(9|.9|NT)'))) THEN ('device_Computer') WHEN (((user_agent rlike 'Macintosh|PowerPC') and (not (user_agent rlike 'Silk')))) THEN ('device_Computer') WHEN ((user_agent rlike 'Android|Phone|Mobile')) THEN ('device_Mobile') WHEN (((user_agent rlike 'Linux') and (user_agent rlike 'X11'))) THEN ('device_Computer') WHEN ((user_agent rlike 'Solaris|SunOS|OpenBSD|FreeBSD')) THEN ('device_Computer') ELSE ('device_Other') END,CASE WHEN (((user_agent = '') or user_agent is null)) THEN ('browser_Undefined') WHEN ((user_agent rlike 'Chrome/[0-9]{1,2}')) THEN ('browser_Chrome') WHEN ((user_agent rlike 'MSIE [0-9]{1,2}.[0-9]+')) THEN ('browser_Internet Explorer') WHEN ((user_agent rlike 'Firefox/[0-9]{1,2}')) THEN ('browser_Firefox') WHEN ((user_agent rlike 'Opera Mini/[0-9]{1,2}')) THEN ('browser_Opera Mini') WHEN ((user_agent rlike 'Opera/[0-9]{1,2}')) THEN ('browser_Opera') WHEN (((user_agent rlike 'Version/[0-9]{1,2}') and (user_agent rlike 'Safari/[0-9]+'))) THEN ('browser_Safari') ELSE ('browser_Other') END,CASE WHEN ((user_agent is null or (user_agent = ''))) THEN ('os_Undefined') WHEN (((user_agent rlike 'iPad') or (user_agent rlike 'iPhone'))) THEN ('os_iOS') WHEN ((user_agent rlike 'Android')) THEN ('os_Android') WHEN (((user_agent rlike 'Windows') or (user_agent rlike 'Win(9|.9|NT)'))) THEN ('os_Windows') WHEN (((user_agent rlike 'Macintosh|PowerPC') and (user_agent rlike 'Mac OS'))) THEN ('os_Mac OS') WHEN (((user_agent rlike 'Linux') and (user_agent rlike 'X11'))) THEN ('os_Linux') WHEN ((user_agent rlike 'Solaris|SunOS|OpenBSD|FreeBSD')) THEN ('os_Other') ELSE ('os_Other') END,CASE WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0059'))) THEN ('00_01') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0159'))) THEN ('01_02') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0259'))) THEN ('02_03') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0359'))) THEN ('03_04') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0400') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0459'))) THEN ('04_05') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0500') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0559'))) THEN ('05_06') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0600') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0659'))) THEN ('06_07') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0700') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0759'))) THEN ('07_08') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0800') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0859'))) THEN ('08_09') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '0900') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '0959'))) THEN ('09_10') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1059'))) THEN ('10_11') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1159'))) THEN ('11_12') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1259'))) THEN ('12_13') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1359'))) THEN ('13_14') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1400') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1459'))) THEN ('14_15') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1500') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1559'))) THEN ('15_16') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1600') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1659'))) THEN ('16_17') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1700') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1759'))) THEN ('17_18') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1800') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1859'))) THEN ('18_19') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '1900') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '1959'))) THEN ('19_20') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2000') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2059'))) THEN ('20_21') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2100') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2159'))) THEN ('21_22') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2200') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2259'))) THEN ('22_23') WHEN (((from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') >= '2300') and (from_unixtime(unix_timestamp(UDFToString(request_time)), 'HHmm') <= '2359'))) THEN ('23_00') END,CASE WHEN (((country = 'China') and (city is not null and (city <> '')))) THEN (city) WHEN (((country = 'China') and ((city = '') or city is null))) THEN ('China') ELSE (country) END)
                        type: array<string>
                  outputColumnNames: _col0
                  UDTF Operator
                    function name: explode
                    Lateral View Join Operator
                      outputColumnNames: _col12, _col16, _col40, _col42
                      Select Operator
                        expressions:
                              expr: _col12
                              type: string
                              expr: _col16
                              type: string
                              expr: _col42
                              type: string
                        outputColumnNames: _col0, _col1, _col2
                        Reduce Output Operator
                          key expressions:
                                expr: _col2
                                type: string
                          sort order: +
                          Map-reduce partition columns:
                                expr: _col2
                                type: string
                          tag: 1
                          value expressions:
                                expr: _col0
                                type: string
                                expr: _col1
                                type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col1} {VALUE._col2}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col1, _col2, _col5, _col6
          Select Operator
            expressions:
                  expr: _col5
                  type: string
                  expr: _col6
                  type: string
                  expr: _col1
                  type: int
                  expr: _col2
                  type: string
            outputColumnNames: _col5, _col6, _col1, _col2
            Group By Operator
              aggregations:
                    expr: count(_col1)
              bucketGroup: false
              keys:
                    expr: _col5
                    type: string
                    expr: _col6
                    type: string
                    expr: _col1
                    type: int
                    expr: _col2
                    type: string
              mode: hash
              outputColumnNames: _col0, _col1, _col2, _col3, _col4
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        maprfs:/user/hdbatch/tmp/hive/hive_2013-10-12_15-25-43_516_746550569391015319/-mr-10002
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: string
                    expr: _col2
                    type: int
                    expr: _col3
                    type: string
              sort order: ++++
              Map-reduce partition columns:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: string
                    expr: _col2
                    type: int
                    expr: _col3
                    type: string
              tag: -1
              value expressions:
                    expr: _col4
                    type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
                expr: KEY._col1
                type: string
                expr: KEY._col2
                type: int
                expr: KEY._col3
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1, _col2, _col3, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col2
                  type: int
                  expr: _col4
                  type: bigint
                  expr: _col3
                  type: string
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2, _col3, _col4
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
                    expr: _col3
                    type: string
                    expr: _col4
                    type: string
              outputColumnNames: _col0, _col1, _col2, _col3, _col4
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: tanx_rtb.cookie_labels

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            request_date_i 20131010
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: tanx_rtb.cookie_labels

  Stage: Stage-3
    Stats-Aggr Operator
2. 去hadoop监控页面查找任务。可以看出两者运行时间相差3倍以上
Stage2:

Stage6:

3. 分析了一下Stage6和stage2的任务:可以看出map input的数据根本就不一样,相差3倍之多,这就很奇怪了,因为我们用hive count 得的结果条数基本一致,为什么map input 输入就相差这么多呢?
任务1:

任务2: 

4.接着分析去查找这两个任务的map input 输入路径:发现两者的数据路径任务一路径显然多,而任务二显然少,于是去HDFS查找发现 MR的input path 和HDFS的数据不一样(有延迟数据).


评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值