Hive 2.3 decimal精度损失问题

1.准备测试数据

    使用如下建表语句,并插入测试数据:

CREATE TABLE IF NOT EXISTS test_decimal(

    md5 string,

    id int,

    ty int,

    amount decimal(38, 12)

) stored as orc ;

insert into table test_decimal values

('9F99855A44BD41FE592B69E0D36BF3E8', 4591, 2, 188593.210890000000),

('9F99855A44BD41FE592B69E0D36BF3E8', 4592, 2, 177918.123481132049),

('9F99855A44BD41FE592B69E0D36BF3E8', 4593, 2, 10675.087408867951);

2.使用测试sql测试(在2.3.x版本中执行的)

    使用测试sql,发现测试的结果有精度损失:

hive> select id, sum(amount) from test_decimal group by id;

OK

4591 188593.210890000000

4592 177918.123481132049

4593 10675.087408867951

Time taken: 28.013 seconds, Fetched: 3 row(s)

 

hive>  select id, sum(amount) * -1 from test_decimal group by id;

OK

4591 -188593.2108900000

4592 -177918.1234811320

4593 -10675.0874088680

Time taken: 26.016 seconds, Fetched: 3 row(s)

    通过比较测试结果发现,在sum函数之后乘以 -1 导致精度损失了2位。

3.通过分析执行计划查找两条sql的执行计划的区别,查找原因(在2.3.x版本中执行的)

    直接输出sum的sql的执行计划:

hive> explain select id, sum(amount) from test_decimal group by id;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: test_decimal

            Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

            Select Operator

              expressions: id (type: int), amount (type: decimal(38,12))

              outputColumnNames: id, amount

              Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                aggregations: sum(amount)

                keys: id (type: int)

                mode: hash

                outputColumnNames: _col0, _col1

                Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  key expressions: _col0 (type: int)

                  sort order: +

                  Map-reduce partition columns: _col0 (type: int)

                  Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

                  value expressions: _col1 (type: decimal(38,12))

      Reduce Operator Tree:

        Group By Operator

          aggregations: sum(VALUE._col0)

          keys: KEY._col0 (type: int)

          mode: mergepartial

          outputColumnNames: _col0, _col1

          Statistics: Num rows: 1 Data size: 236 Basic stats: COMPLETE Column stats: NONE

          File Output Operator

            compressed: false

            Statistics: Num rows: 1 Data size: 236 Basic stats: COMPLETE Column stats: NONE

            table:

                input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink

Time taken: 0.16 seconds, Fetched: 48 row(s)

sum后乘以 -1 的sql的执行计划:

hive> explain select id, sum(amount)*-1 from test_decimal group by id;

OK

STAGE DEPENDENCIES:

  Stage-1 is a root stage

  Stage-0 depends on stages: Stage-1

STAGE PLANS:

  Stage: Stage-1

    Map Reduce

      Map Operator Tree:

          TableScan

            alias: test_decimal

            Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

            Select Operator

              expressions: id (type: int), amount (type: decimal(38,12))

              outputColumnNames: id, amount

              Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

              Group By Operator

                aggregations: sum(amount)

                keys: id (type: int)

                mode: hash

                outputColumnNames: _col0, _col1

                Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

                Reduce Output Operator

                  key expressions: _col0 (type: int)

                  sort order: +

                  Map-reduce partition columns: _col0 (type: int)

                  Statistics: Num rows: 3 Data size: 708 Basic stats: COMPLETE Column stats: NONE

                  value expressions: _col1 (type: decimal(38,12))

      Reduce Operator Tree:

        Group By Operator

          aggregations: sum(VALUE._col0)

          keys: KEY._col0 (type: int)

          mode: mergepartial

          outputColumnNames: _col0, _col1

          Statistics: Num rows: 1 Data size: 236 Basic stats: COMPLETE Column stats: NONE

          Select Operator

            expressions: _col0 (type: int), (_col1 * -1) (type: decimal(38,10))

            outputColumnNames: _col0, _col1

            Statistics: Num rows: 1 Data size: 236 Basic stats: COMPLETE Column stats: NONE

            File Output Operator

              compressed: false

              Statistics: Num rows: 1 Data size: 236 Basic stats: COMPLETE Column stats: NONE

              table:

                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat

                  output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0

    Fetch Operator

      limit: -1

      Processor Tree:

        ListSink

Time taken: 4.656 seconds, Fetched: 52 row(s)

    通过查看两条sql(两条sql的差别是在sum函数后面有没有乘以 -1)的执行计划发现,直接输出sum的结果是 (value expressions: _col1 (type:decimal(38,12)))类型,输出乘以 -1 的结果是 ( expressions: _col0 (type: int), (_col1 * -1) (type:decimal(38,10)) )类型,说明乘以 -1 之后,精度损失了2位。

4.分析源码,找原因

    通过分析sum后乘以 -1 的代码,其中关键的代码如下:

    2.3.x 版本的GenericUDFOPMultiply 类的关键代码

 

    prec1 和 scale1 代表的是 decimal(38,12)

    prec2 和 scale2 代表的是 -1被转换成decimal后的类型为 decimal(1, 0)

    其中 adjustPrecScale()方法的代码在其父类GenericUDFBaseNumeric中,代码如下:

 

    decimal支持的最大精度为38,而通过上面的计算,发现精度precision字段的值已经达到了40,超过的最大精度,因此,需要重新计算精度,计算后的结果是,将小数的精度减少了2位为10,精度使用最大精度值38。

    在hive2.3.x中,算术运算的

  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值