spark常见错误

spark常见错误

刚开始接触spark总是遇到一些问题,后续会继续补充。

  1. Exception in thread “main” org.apache.spark.sql.AnalysisException: Detected cartesian product for LEFT OUTER join between logical plans LocalLimit 21
    出现错误,这时,需要添加
    spark.conf.set(“spark.sql.crossJoin.enabled”, “true”) (这个需要试一下,网上找的解决办法)
  2. Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting
    ‘spark.debug.maxToStringFields’ in SparkEnv.conf.
    出现错误,这时,需要添加,config(“spark.debug.maxToStringFields”, 1000).
  3. Exception in thread "main"org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
    解决方案:
    主要原因在于下面的textFile读取文件时未读取到数据,导致上面的错误。因此需要看文件路径或者hdfs是否启动。
  4. ERROR: org.apache.spark.sql.AnalysisException: resolved attribute(s)id#426
    bf_dec_sharp = bf_dec_sharp.toDF(bf_dec_sharp.columns: _*)
  5. 解决A master URL must be set in your configuration错误
    在运行Spark的程序时,出现了如下错误:
    这里写图片描述
    从提示中可以看出找不到程序运行的master,此时需要配置环境变量。
    传递给spark的master url可以有如下几种:

local 本地单线程
local[K] 本地多线程(指定K个内核)
local[*] 本地多线程(指定所有可用内核)
spark://HOST:PORT 连接到指定的 Spark standalone cluster master,需要指定端口。
mesos://HOST:PORT 连接到指定的 Mesos 集群,需要指定端口。
yarn-client客户端模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。
yarn-cluster集群模式 连接到 YARN 集群。需要配置 HADOOP_CONF_DIR。

点击RUN => edit configuration,在左侧点击该项目。在右侧VM options中输入“-Dspark.master=local”,指示本程序本地单线程运行,再次运行即可。配置如图:
这里写图片描述

  1. 问题:org.apache.spark.sql.AnalysisException: Table or view not found: gpsdb.t_ana_ds_dq_conf; line 1 pos 14;
    'Project [*]
    ± 'UnresolvedRelation gpsdb.t_ana_ds_dq_conf
    这里写图片描述
    解决的办法:
    看看自己的项目中是否配置hive-site.xml(重点,我自己就是这个错误)
    那么去哪找呢?
    去集群服务器上:find -name hive-site.xml
    找到之后拷贝到项目的资源文件下面就可以了,打包的时候在项目的根目录下,会自动加载jar根目录下的hive-site.xml
    为什么要添加:spark要查找hive中的数据,需要这个配置文件,里面是hive的一些信息。
    (这个解决方法有待验证)
    问题已解决:少了hive-site.xml的配置,具体看下图:
    (注意:开始加了配置文件还是不可以,后来让同事帮忙看了下,原来是后面还缺少HiveSupport())
    在这里插入图片描述
  2. 错误: Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping.
Warning: Local jar /var/lib/hadoop-hdfs/2018-06-01 does not exist, skipping.
java.lang.ClassNotFoundException: DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:695)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

这是执行的语句:

  spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN/home/DS_DQ_ANA.jar "2018-03-01" "2018-03-31"  

这里写图片描述
后来发现 原来 我zhongjian 忘记加一个空格(无语…)

spark2-submit --master yarn --deploy-mode client --num-executors 20 --executor-memory 6G --executor-cores 3 --driver-memory 2G --conf spark.default.parallelism=500 --class DATA_QUALITY.DQ_MAIN /home/DS_DQ_ANA.jar "2018-06-01" "2018-06-30"

这里写图片描述
于是 ,问题解决…

8.问题

ERROR scheduler.TaskSetManager: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
18/08/30 11:21:35 WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 152.0 (TID 24031, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
  at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
  at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
  at scala.Option.foreach(Option.scala:257)
  at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
  at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:210)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:310)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
  at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
  at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
  at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:185)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:592)
  at cal_corp_monthly_report(<console>:70)
  ... 52 elided

WARN scheduler.TaskSetManager: Lost task 11870.0 in stage 228.0 (TID 36119, localhost, executor driver): TaskKilled (killed intentionally)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 11870 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

解决办法:spark.driver.maxResultSize默认大小为1G 每个Spark action(如collect)所有分区的序列化结果的总大小限制,简而言之就是executor给driver返回的结果过大,报这个错说明需要提高这个值或者避免使用类似的方法,比如countByValue,countByKey等。

将值调大即可

spark.driver.maxResultSize 2g

9.问题: ERROR LiveListenerBus: SparkListenerBus has already
stopped! Dropping event SparkListenerExecutorMetricsUpdate(13,WrappedArray())

ERROR scheduler.LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate(13,WrappedArray())

问题解决:

  1. 出现 cannot resolve symbol XXX
    问题解决:
    File–>Invalidated cache
    reimport maven,重新把maven import到Idea中。
    这里写图片描述
  • 1
    点赞
  • 19
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值