spark on hive任务丢失parquet.io.ParquetDecodingException: Can not read value at 0 in block

解决一个问题记录一下:
spark提交任务,发现任务意外job aborted
无法继续跑。根据任务发现是利用sparksql 查询某张表的时候,读parquet出了问题.困扰很久,把程序改了很久,才从网上找到了帖子,希望能够帮到大家.我是内网作业报错信息也是借鉴网上的。spark是1.5.1远古版本

附上我参考的帖子
如下

ERROR: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1521667682013_4868_1_00, diagnostics=[Task failed, taskId=task_1521667682013_4868_1_00_000082, diagnostics=[TaskAttempt 0 failed, info=[Error: Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: java.io.IOException: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file hdfs://shastina/sys/datalake_dev/venmo/data/managed_zone/integration/ACCOUNT_20180305/part-r-00082-bc0c080c-4080-4f6b-9b94-f5bafb5234db.snappy.parquet
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
    at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
    at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:194)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:185)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:185)
    at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:181)
    at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

原因如下:

Root Cause:

This issue is caused because of different parquet conventions used in Hive and Spark. In Hive, the decimal datatype is represented as fixed bytes (INT 32). In Spark 1.4 or later the default convention is to use the Standard Parquet representation for decimal data type. As per the Standard Parquet representation based on the precision of the column datatype, the underlying representation changes.

eg: 
DECIMAL can be used to annotate the following types:
int32: for 1 <= precision <= 9
int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Hence this issue happens only with the usage of datatypes which have different representations in the different Parquet conventions. If the datatype is DECIMAL (10,3), both the conventions represent it as INT32, hence we won't face an issue. If you are not aware of the internal representation of the datatypes it is safe to use the same convention used for writing while reading. With Hive, you do not have the flexibility to choose the Parquet convention. But with Spark, you do.

Solution:
The convention used by Spark to write Parquet data is configurable. This is determined by the property spark.sql.parquet.writeLegacyFormat
The default value is false. If set to "true", Spark will use the same convention as Hive for writing the Parquet data. This will help to solve the issue. 

其实就是hive和spark读取parquet的decimal格式问题。采用
解决办法:
–conf “spark.sql.parquet.writeLegacyFormat=true”
由于我的版本较低,spark1.5.1
该参数并未生效。

鉴于程序中无法读写这个parquet表。我采用出问题的表单独建成orc表来处理。

ROW FORMAT SERDE 
   'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

就解决了。绕了个弯。

orc表确实可以解决这个问题,有个不好的点
我大部分的表是parquet。在做增量入全量时候parquet->orc这个过程。有些耗时

我讲decimal类型数据改为double 不破坏原有统一性的情况性解决了。时效也没有丝毫改变

后面还出了一个问题。
unscalaed value is too large for decision .
这是指表结构前后不一致请检查表结构。大多是string转decimal出的问题

java.lang.ClassNotFoundException是Java编程中的一个异常类,它表示无法找到指定的类。在你提供的问题中,异常的完整名称是java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.mapreduce.ParquetInputFormat。 这个异常通常发生在Java程序中使用Hadoop和Hive进行数据处理时。它的出现意味着Java虚拟机(JVM)无法加载指定的类。 主要原因可能有以下几种情况: 1. 缺少相关的依赖库:在使用HiveParquet进行数据处理时,需要正确配置和引入相关的依赖库。如果依赖库缺失或版本不匹配,就可能导致ClassNotFoundException的异常。解决方法是确保所有的依赖库都被正确引入,并且版本相互兼容。 2. 类路径配置错误:JVM在运行Java程序时会在类路径中查找需要的类。如果类路径没有正确配置,就无法找到目标类,也会触发ClassNotFoundException。检查类路径配置,确保包含了正确的依赖库和类路径。 3. 类名拼写错误:有时候,类名可能被错误地拼写,导致JVM无法找到目标类。在这种情况下,只需检查并修正类名的拼写错误即可。 综上所述,java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.io.parquet.mapreduce.ParquetInputFormat是一个Java类加载异常,常见于使用Hadoop和Hive进行数据处理时。解决这个问题的关键是检查依赖库的引入、类路径的配置和类名的拼写,确保所有配置正确无误。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值