sparksql查询hive表报错:codec com.hadoop.compression.lzo.LzoCodec not found 问题探究

1、问题描述

在某次测试时,我想用复用已有的表结构,所以拷贝了原来表的部分建表语句重新建了一个表,通过命令行往这个表添加数据正常,但是通过sparksql从该表查询数据时报错:

Exception in thread "main" java.lang.RuntimeException: Error in configuring object
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:113)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
	at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:191)
	at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.UnionRDD.$anonfun$getPartitions$1(UnionRDD.scala:85)
	at org.apache.spark.rdd.UnionRDD.$anonfun$getPartitions$1$adapted(UnionRDD.scala:85)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.immutable.List.map(List.scala:298)
	at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:85)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
	at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:442)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2929)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:825)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:784)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:793)
	at org.apache.spark.sql.TestMain$.main(TestMain.scala:23)
	at org.apache.spark.sql.TestMain.main(TestMain.scala)
Caused by: java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:110)
	... 59 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
	at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
	at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
	at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
	... 64 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
	at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
	... 66 more

2、问题初步思考

在使用sparksql进行查询时,被拷贝的表是可以正常查询的,但是相同表结构的新表查询却报错需要一个的lzo压缩相关的类,这里猜测是两张表存储文件的格式不一样,于是通过show create table命令对比两个表的完整建表情况(该命令查询出的建表命令会比自己原来建表语句多出一些信息):

被拷贝的表结构如下:
CREATE TABLE `test_fact_de`(
  `etl_date` string COMMENT 'YYYY-MM-DD hh:mm:ss', 
  `name` string COMMENT 'name', 
  `type` string COMMENT 'type', 
  `area` string COMMENT 'area', 
  `price` bigint COMMENT 'price', 
  `num` bigint COMMENT 'num', 
  `id` bigint COMMENT 'id')
COMMENT 'olap'
PARTITIONED BY ( 
  `dt` string COMMENT 'dt')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://bigdata.node1:9000/opt/data/hive/warehouse/wyt.db/test_fact_de'
TBLPROPERTIES (
  'transient_lastDdlTime'='1653146615')


新建表的结构信息如下:
CREATE TABLE `test_fact_de2`(
  `etl_date` string COMMENT 'YYYY-MM-DD hh:mm:ss', 
  `name` string COMMENT 'name', 
  `type` string COMMENT 'type', 
  `area` string COMMENT 'area', 
  `price` bigint COMMENT 'price', 
  `num` bigint COMMENT 'num', 
  `id` bigint COMMENT 'id')
COMMENT 'olap'
PARTITIONED BY ( 
  `dt` string COMMENT 'dt')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://bigdata.node1:9000/opt/data/hive/warehouse/wyt.db/test_fact_de2'
TBLPROPERTIES (
  'transient_lastDdlTime'='1653146629')

可以看到两张表的INPUTFORMAT和OUTPUTFORMAT格式完全不同,一个是ORC存储格式,一个Text文本存储格式(INPUTFORMAT、OUTPUTFORMAT和文件格式的关系可以参考这篇文章:hive FileFormate 和SerDe的关联关系_Interest1_wyt的博客-CSDN博客

3、问题探究

对比后确定ORC存储格式的文件读取时不需要LZO压缩相关的类,但是Text存储格式的文件读取时需要LZO相关的类。于是到官网上查找两种存储格式的区别:

TEXTFILE:TEXTFILE是默认的文件存储格式,如果建表或者更改表时没有显式指定文件存储格式,也没有通过 hive.default.fileformat参数配置修改默认的文件存储格式。那么hive默认将数据存储成简单的TEXT文件。

ORC:全称为Optimized Row Columnar,在0.11.0版本后引入,该文件格式提供了一种高效的方式来存储 Hive 数据。 它旨在克服其他 Hive 文件格式的限制。 使得读写速度都比较高效。

不过这两种格式的读写都会有压缩和解压缩的过程,最终从官网没有找到我想要的答案,于是决定从源码的角度看看问题出现的原因。

因为我是通过sparksql查询的hive表,而sparksql进行解析查询的过程比较复杂,所以这里我就不详细介绍了,直接讲解根据源码得出来的结论:

1、查询orc格式的hive最终会走FileScanRDD逻辑,其直接根据hive文件结尾的类型标识选取对应的压缩类处理。

2、查询text格式的hive表最终会走HadoopRDD逻辑,其计算时会先加载项目中现有的全部压缩类,以及配置文件中io.compression.codecs配置声明需要加载的压缩类(这个是问题出现的根本原因)

最终查看我的配置文件中确实有这个配置项,如下:

 

注释该配置后,text类型的表查询正常。

4、解决方法

上述注释io.compression.codecs配置的方法只是治标不治本。
首先我们既然配置了io.compression.codecs,那就说明这个配置项对我们或对其它使用者有用,而且该配置项可以配置多个压缩类,所以这个不能随便注释。
其次,我测试在hadoop的配置文件中添加io.compression.codecs也会报错。

因此实际中我们很难采用这种方式去解决。最好的解决方式就是引入对应的压缩类(目前经过测试该jar包是包含有LZO相关的压缩类的,至于其它类型的压缩类是否包含,暂时没有测试)

<!-- https://mvnrepository.com/artifact/com.hadoop.compression/hadoop-gpl-compression -->
<dependency>
    <groupId>com.hadoop.compression</groupId>
    <artifactId>hadoop-gpl-compression</artifactId>
    <version>0.1.0</version>
</dependency>

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值