1、问题描述
在某次测试时,我想用复用已有的表结构,所以拷贝了原来表的部分建表语句重新建了一个表,通过命令行往这个表添加数据正常,但是通过sparksql从该表查询数据时报错:
Exception in thread "main" java.lang.RuntimeException: Error in configuring object
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:113)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:79)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:137)
at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:191)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:205)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.UnionRDD.$anonfun$getPartitions$1(UnionRDD.scala:85)
at org.apache.spark.rdd.UnionRDD.$anonfun$getPartitions$1$adapted(UnionRDD.scala:85)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:85)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:442)
at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2929)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
at org.apache.spark.sql.Dataset.show(Dataset.scala:825)
at org.apache.spark.sql.Dataset.show(Dataset.scala:784)
at org.apache.spark.sql.Dataset.show(Dataset.scala:793)
at org.apache.spark.sql.TestMain$.main(TestMain.scala:23)
at org.apache.spark.sql.TestMain.main(TestMain.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:110)
... 59 more
Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
... 64 more
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2499)
at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
... 66 more
2、问题初步思考
在使用sparksql进行查询时,被拷贝的表是可以正常查询的,但是相同表结构的新表查询却报错需要一个的lzo压缩相关的类,这里猜测是两张表存储文件的格式不一样,于是通过show create table命令对比两个表的完整建表情况(该命令查询出的建表命令会比自己原来建表语句多出一些信息):
被拷贝的表结构如下:
CREATE TABLE `test_fact_de`(
`etl_date` string COMMENT 'YYYY-MM-DD hh:mm:ss',
`name` string COMMENT 'name',
`type` string COMMENT 'type',
`area` string COMMENT 'area',
`price` bigint COMMENT 'price',
`num` bigint COMMENT 'num',
`id` bigint COMMENT 'id')
COMMENT 'olap'
PARTITIONED BY (
`dt` string COMMENT 'dt')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://bigdata.node1:9000/opt/data/hive/warehouse/wyt.db/test_fact_de'
TBLPROPERTIES (
'transient_lastDdlTime'='1653146615')
新建表的结构信息如下:
CREATE TABLE `test_fact_de2`(
`etl_date` string COMMENT 'YYYY-MM-DD hh:mm:ss',
`name` string COMMENT 'name',
`type` string COMMENT 'type',
`area` string COMMENT 'area',
`price` bigint COMMENT 'price',
`num` bigint COMMENT 'num',
`id` bigint COMMENT 'id')
COMMENT 'olap'
PARTITIONED BY (
`dt` string COMMENT 'dt')
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://bigdata.node1:9000/opt/data/hive/warehouse/wyt.db/test_fact_de2'
TBLPROPERTIES (
'transient_lastDdlTime'='1653146629')
可以看到两张表的INPUTFORMAT和OUTPUTFORMAT格式完全不同,一个是ORC存储格式,一个Text文本存储格式(INPUTFORMAT、OUTPUTFORMAT和文件格式的关系可以参考这篇文章:hive FileFormate 和SerDe的关联关系_Interest1_wyt的博客-CSDN博客)
3、问题探究
对比后确定ORC存储格式的文件读取时不需要LZO压缩相关的类,但是Text存储格式的文件读取时需要LZO相关的类。于是到官网上查找两种存储格式的区别:
TEXTFILE:TEXTFILE是默认的文件存储格式,如果建表或者更改表时没有显式指定文件存储格式,也没有通过 hive.default.fileformat参数配置修改默认的文件存储格式。那么hive默认将数据存储成简单的TEXT文件。
ORC:全称为Optimized Row Columnar,在0.11.0版本后引入,该文件格式提供了一种高效的方式来存储 Hive 数据。 它旨在克服其他 Hive 文件格式的限制。 使得读写速度都比较高效。
不过这两种格式的读写都会有压缩和解压缩的过程,最终从官网没有找到我想要的答案,于是决定从源码的角度看看问题出现的原因。
因为我是通过sparksql查询的hive表,而sparksql进行解析查询的过程比较复杂,所以这里我就不详细介绍了,直接讲解根据源码得出来的结论:
1、查询orc格式的hive最终会走FileScanRDD逻辑,其直接根据hive文件结尾的类型标识选取对应的压缩类处理。
2、查询text格式的hive表最终会走HadoopRDD逻辑,其计算时会先加载项目中现有的全部压缩类,以及配置文件中io.compression.codecs配置声明需要加载的压缩类(这个是问题出现的根本原因)。
最终查看我的配置文件中确实有这个配置项,如下:
注释该配置后,text类型的表查询正常。
4、解决方法
上述注释io.compression.codecs配置的方法只是治标不治本。
首先我们既然配置了io.compression.codecs,那就说明这个配置项对我们或对其它使用者有用,而且该配置项可以配置多个压缩类,所以这个不能随便注释。
其次,我测试在hadoop的配置文件中添加io.compression.codecs也会报错。
因此实际中我们很难采用这种方式去解决。最好的解决方式就是引入对应的压缩类(目前经过测试该jar包是包含有LZO相关的压缩类的,至于其它类型的压缩类是否包含,暂时没有测试)
<!-- https://mvnrepository.com/artifact/com.hadoop.compression/hadoop-gpl-compression -->
<dependency>
<groupId>com.hadoop.compression</groupId>
<artifactId>hadoop-gpl-compression</artifactId>
<version>0.1.0</version>
</dependency>