spark常见错误
刚开始接触spark总是遇到一些问题,后续会继续补充。
- Exception in thread “main” org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT OUTER join between logical plans LocalLimit 21
出现错误,这时,需要添加
spark.conf.set(“spark.sql.crossJoin.enabled”, “true”) (这个需要试一下,网上找的解决办法) - Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting
‘spark.debug.maxToStringFields’ in SparkEnv.conf.
出现错误,这时,需要添加,config(“spark.debug.maxToStringFields”, 1000). - Exception in thread "main"org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
解决方案:
主要原因在于下面的textFile读取文件时未读取到数据,导致上面的错误。因此需要看文件路径或者hdfs是否启动。 - ERROR: org.apache.spark.sql.AnalysisException: resolved attribute(s)id#426
bf_dec_sharp = bf_dec_sharp.toDF(bf_dec_sharp.columns: _*) - 解决A master URL must be set in your configuration错误
在运行Spark的程序时,出现了如下错误:
从提示中可以看出找不到程序运行的master,此时需要配置环境变量。
传递给spark的master url可以有如下几种:
local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
点击RUN => edit configuration,在左侧点击该项目。在右侧VM options中输入“-Dspark.master=local”,指示本程序本地单线程运行,再次运行即可。配置如图:
- 问题:org.apache.spark.sql.AnalysisException: Table or view not found:
gpsdb
.t_ana_ds_dq_conf
; line 1 pos 14;
'Project [*]
± 'UnresolvedRelationgpsdb
.t_ana_ds_dq_conf
解决的办法:
看看自己的项目中是否配置hive-site.xml(重点,我自己就是这个错误)
那么去哪找呢?
去集群服务器上:find -name hive-site.xml
找到之后拷贝到项目的资源文件下面就可以了,打包的时候在项目的根目录下,会自动加载jar根目录下的hive-site.xml
为什么要添加:spark要查找hive中的数据,需要这个配置文件,里面是hive的一些信息。
(这个解决方法有待验证)
问题已解决:少了hive-site.xml的配置,具体看下图:
(注意:开始加了配置文件还是不可以,后来让同事帮忙看了下,原来是后面还缺少HiveSupport())
- 错误: Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping.
Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping.
java.lang.ClassNotFoundException: DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
这是执行的语句:
spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar "2018-03-01" "2018-03-31"
后来发现 原来 我zhongjian 忘记加一个空格(无语…)
spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN /home/DS_DQ_ANA.jar "2018-06-01" "2018-06-30"
于是 ,问题解决…
8.问题
ERROR scheduler.TaskSetManager: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/08/30 11:21:35 WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 152.0 (TID 24031, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:210)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:310)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
at cal_corp_monthly_report(<console>:70)
... 52 elided
WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 228.0 (TID 36119, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
解决办法:spark.driver.maxResultSize默认大小为1G 每个Spark action(如collect)所有分区的序列化结果的总大小限制,简而言之就是executor给driver返回的结果过大,报这个错说明需要提高这个值或者避免使用类似的方法,比如countByValue,countByKey等。
将值调大即可
spark.driver.maxResultSize 2g
9.问题: ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerExecutorMetricsUpdate(13,WrappedArray())
ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(13,WrappedArray())
问题解决:
- 出现 cannot resolve symbol XXX
问题解决:
File–>Invalidated cache
reimport maven,重新把maven import到Idea中。