spark读取csv、orc等文件异常
1 异常现象
spark读取csv、orc等文件出现解析异常
java.lang.IllegalArgumentException: Illegal pattern component: XXX
java.lang.NoSuchFieldError: KRYO_SARG_BUFFER
1.1 读取csv文件异常内容如下:
Exception in thread "main" java.lang.IllegalArgumentException: Illegal pattern component: XXX
at org.apache.commons.lang3.time.FastDateFormat.parsePattern(FastDateFormat.java:577)
at org.apache.commons.lang3.time.FastDateFormat.init(FastDateFormat.java:444)
at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:437)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:110)
at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:109)
at org.apache.commons.lang3.time.FormatCache.getInstance(FormatCache.java:82)
at org.apache.commons.lang3.time.FastDateFormat.getInstance(FastDateFormat.java:205)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:140)
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:42)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:187)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:187)
at scala.Option.orElse(Option.scala:289)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:466)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
1.2 读取orc文件异常内容如下:
java.lang.NoSuchFieldError: KRYO_SARG_BUFFER
at org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:95)
at org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:160)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:159)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:339)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:331)
at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:381)
at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:310)
at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42)
at org.apache.spark.sql.execution.BaseLimitExec$class.inputRDDs(limit.scala:62)
at org.apache.spark.sql.execution.LocalLimitExec.inputRDDs(limit.scala:98)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:70)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.eagerExecute(ShuffleExchangeExec.scala:112)
at org.apache.spark.sql.execution.adaptive.ShuffleQueryStage.executeStage(QueryStage.scala:249)
at org.apache.spark.sql.execution.adaptive.QueryStage.doExecute(QueryStage.scala:201)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:137)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:161)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:158)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:133)
at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$7$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:77)
at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$7$$anonfun$apply$2$$anonfun$apply$3.apply(QueryStage.scala:77)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionIdAndJobDesc(SQLExecution.scala:162)
at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$7$$anonfun$apply$2.apply(QueryStage.scala:76)
at org.apache.spark.sql.execution.adaptive.QueryStage$$anonfun$7$$anonfun$apply$2.apply(QueryStage.scala:76)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2 排查思路及流程
-
同等代码在原环境运行正常,忽然出现运行出现异常情况,在idea模拟代码,发现可正常运行;
-
查看org.apache.commons.lang3.time.FastDateFormat源码(commons-lang3),557、444等行数与报错代码对应不上。
-
引入commons-lang3 3.10版本的包,发现出现与服务器上一样的错误信息;
Exception in thread "main" java.lang.IllegalArgumentException: Illegal pattern component: XXX at org.apache.commons.lang3.time.FastDateFormat.parsePattern(FastDateFormat.java:577) at org.apache.commons.lang3.time.FastDateFormat.init(FastDateFormat.java:444) at org.apache.commons.lang3.time.FastDateFormat.<init>(FastDateFormat.java:437) at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:110) at org.apache.commons.lang3.time.FastDateFormat$1.createInstance(FastDateFormat.java:109) at org.apache.commons.lang3.time.FormatCache.getInstance(FormatCache.java:82) at org.apache.commons.lang3.time.FastDateFormat.getInstance(FastDateFormat.java:205) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:139) at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:41) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:58) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$6.apply(DataSource.scala:180) at scala.Option.orElse(Option.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:179) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
-
排查线上是否是jar包版本变更所致
发现线上jar包版本为3.5,且idea切换为3.5版本可正常运行。故排除jar包版本问题。
-
参照网上资料https://blog.csdn.net/qq1226317595/article/details/100540091,增加option(“timestampFormat”, “yyyy/MM/dd HH:mm:ss ZZ”) 可解决此问题。但考虑到读取orc也有异常情况,且之前运行正常,故继续深究其原因。
-
查看idea报错信息具体位置,发现有两个FastDateFormat类,分别在 commons-lang3-3.10和hive-exec 2.3.5的包中。
于是怀疑是hive-exec中的FastDateFormat类与commons-lang3中的FastDateFormat冲突所致。
-
取消引入pom中hive-exec依赖,改用spark-hive依赖。重新运行后,异常消失,程序正常运行。
注:也不知当初怎么傻傻的引入了hive-exec依赖,也恰是这样的巧合才致使本地可以场景重现,提供了排查问题的便利。
-
验证线上环境是否也是引入了hive-exec包
确认线上使用了引入了同样的依赖。因yarn界面只能查看到当天任务的情况,故无法判断之前正常运行的时候是否没有引入hive-exec包,排查到这里,感觉很快就能解决这个问题了。
-
查看hive-exec包是如何引入的,干掉它应该就可以恢复正常了
搜索 /opt/apps/extra-jars/hive-exec-2.3.5.jar 发现只有一个结果
搜索 /opt/apps/extra-jars/ 发现有10个结果,并查看到hive-exec是通过spark.driver.extraClassPath和spark.executor.extraClassPath 的方式提交到上去的。 -
手动指定spark.executor.extraClassPath 、spark.driver.extraClassPath避免默认加载hive-exec包产生冲突。增加配置如下 jar包为任一jar包即可,只要不加载其默认配置即可。
--conf spark.executor.extraClassPath=commons-lang3-3.5.jar \ --conf spark.driver.extraClassPath=commons-lang3-3.5.jar \
-
验证结果
程序提交,并正常运行结束。读取orc文件问题同样解决。
spark-submit \ --master yarn \ --deploy-mode client \ --conf spark.executor.extraClassPath=commons-lang3-3.5.jar \ --conf spark.driver.extraClassPath=commons-lang3-3.5.jar \ --conf spark.rdd.compress=true \ --conf spark.sql.sources.default=csv \ --conf spark.scheduler.listenerbus.eventqueue.capacity=100000 \ --queue default \ --class com.sinoiov.bigdata.track.ParseTrack \ original-parse-vehicle-track-1.0.0.jar
查看页面,确认无hive-exec包依赖
3 总结
1、发现报错类信息与使用类位置对应不上即该联想到是类冲突导致。
2、依赖引入不规范,埋下了类冲突的风险。
3、线上环境不可完全依赖,不可避免地会产生今日环境与昨日环境可能存在差异的情况。