hive decimal bug， nvl(decimal,1)=0

cclovezbf

已于 2023-10-12 16:10:24 修改

阅读量1.4k

点赞数 7

分类专栏： hive 文章标签： hive decimal vectorized

于 2022-05-25 10:52:41 首次发布

本文链接：https://blog.csdn.net/cclovezbf/article/details/124959845

版权

hive 专栏收录该内容

49 篇文章 10 订阅

订阅专栏

话不多说直接上图 hive版本为2.1.2-cdh6.3.2

一个nvl (decimal,1)结果=0 怎么就出现bug了呢？？？

按道理不是两列值相等么，真是服了，怎么这么多bug，直接百度再说，结果是我的文章，，难道要我自己借鉴自己？

——————————————————————————————————————————

将decimal转string

继续研究，根据我上次研究的问题，decimal容易出现精度丢失等问题，我记得我当时转了string就好了。那么测试下，结果ok，但是decimal报错的问题还是没解决。

继续测试

这次测试发现了一个很神奇的东西。

nvl +了一个值居然就正确了，但是和原先的值又不一样。所以这个值是一个错误的正确值。

这个和上面的基本一样只是，没有把t.pay_standard_rate转为decimal 因为本身就是decimal.

结果值完全不一样。。

测试中最有意思的事来了

注意看这个sql和之前的sql一模一样就是多了个判断这个值是否是=0.

然后就出现了正确的结果！！！！！！！！！！

这时候就有疑问了，我查另外一列和我查nvl(T.PAY_STANDARD_RATE , 1) 有关系吗？

我抓周树人和鲁迅有关系么？你说瞎子坑，又不是说我李青打得不好？老子叫盖伦又不叫德玛.

大杀招

还是要依靠explain.既然你说周树人和鲁迅不一样，那我就把你们两个好好检查下

explain select
T.PAY_STANDARD_RATE,
nvl(T.PAY_STANDARD_RATE , 1),
T.PAY_STANDARD_RATE=1
from ODSCOSTDATA.UPAY_PAYMENT T

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      DagName: hive_20220525095616_c2aec1ed-f5ba-4625-a587-e323d8ac3e1c:3954
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: t
                  Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: pay_standard_rate (type: decimal(38,18)), NVL(pay_standard_rate,1) (type: decimal(38,18)), (pay_standard_rate = 1) (type: boolean)
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

explain
select T.PAY_STANDARD_RATE,
nvl(T.PAY_STANDARD_RATE , 1)
from ODSCOSTDATA.UPAY_PAYMENT T

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      DagName: hive_20220525095658_abf11a23-3e89-47fa-ad1b-8e84b3988910:3955
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: t
                  Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                  Select Operator
                    expressions: pay_standard_rate (type: decimal(38,18)), NVL(pay_standard_rate,1) (type: decimal(38,18))
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 600 Data size: 67284 Basic stats: COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                          output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                          serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Execution mode: vectorized

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

注意看下面的多了一行 Execution mode: vectorized .

这个又是啥？因为我之前研究过这个 vectorized，所以我知道这是啥。。。。

给小伙伴一个办法。set 列出所有属性值

结论

经过测试 hive.vectorized.execution.enabled=true改为false不使用矢量化模式就可解决bug ！！!

扩展学习

这个矢量化是什么？为什么他自然开启，开启了这么多bug，开启有什么好处？

Vectorized Query Execution - Apache Hive - Apache Software Foundationhttps://cwiki.apache.org/confluence/display/Hive/Vectorized+Query+Execution

Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type). Simple operations like arithmetic and comparisons are done by quickly iterating through the vectors in a tight loop, with no or very few function calls or conditional branches inside the loop. These loops compile in a streamlined way that uses relatively few instructions and finishes each instruction in fewer clock cycles, on average, by effectively using the processor pipeline and cache memory. A detailed design document is attached to the vectorized query execution JIRA, at [HIVE-4160] Vectorized Query Execution in Hive - ASF JIRA.

简单的来说这个减少cpu使用率，批量查询，加快查询速度

开启方式1. set hive.vectorized.execution.enabled = true

2.hive表格式必须是orc

Supported data types and operations

The following data types are currently supported for vectorized execution:

tinyint
smallint
int
bigint
boolean
float
double
decimal
date
timestamp (see Limitations below)
string

Using other data types will cause your query to execute using standard, row-at-a-time execution.

使用其他类型会不执行vectorized模式，变成一次一条...

The following expressions can be vectorized when used on supported types:

arithmetic: +, -, *, /, %
AND, OR, NOT
comparisons <, >, <=, >=, =, !=, BETWEEN, IN ( list-of-constants ) as filters
Boolean-valued expressions (non-filters) using AND, OR, NOT, <, >, <=, >=, =, !=
IS [NOT] NULL
all math functions (SIN, LOG, etc.)
string functions SUBSTR, CONCAT, TRIM, LTRIM, RTRIM, LOWER, UPPER, LENGTH
type casts
Hive user-defined functions, including standard and generic UDFs
date functions (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND, UNIX_TIMESTAMP)
the IF conditional expression

注意啊这上面说的都是在where里比较计算还是用 vectorized mode。

但是我之前是在select 语句里的select 用到了 column =1 这种就会退出矢量化模式

最后说下如果有相同的小伙伴遇到这个问题。

set hive.vectorized.execution.enabled = false
或者在select 里加一列 column=1 随便等于几都行，主要是退出矢量化模式。

也可以通过explain，看下当前sql是否在矢量化模式。

如果帮到你，点个赞是对我最大的支持

cclovezbf

关注

7
点赞
踩
13

收藏

觉得还不错? 一键收藏
2
评论
hive decimal bug， nvl(decimal,1)=0

话不多说直接上图 hive版本为2.1.2-cdh6.3.2一个nvl (decimal,1)结果=0 怎么就出现bug了呢？？？按道理不是两列值相等么，真是服了，怎么这么多bug，直接百度再说，结果是我的文章，，难道要我自己借鉴自己？——————————————————————————————————————————将decimal转string继续研究，根据我上次研究的问题，decimal容易出现精度丢失等问题，我记得我当时转了string就好了。那么测试下，结果ok，但是d
复制链接

扫一扫

专栏目录