FAQ

最新推荐文章于 2021-03-04 05:00:33 发布

weixin_34345560

最新推荐文章于 2021-03-04 05:00:33 发布

阅读量140

点赞数

文章标签： ui java 大数据

原文链接：https://my.oschina.net/sunmin/blog/3024532

版权

为什么80%的码农都做不了架构师？>>>

运行环境

EMR版本: EMR-3.14.0 集群类型: HADOOP 软件信息: HDFS2.7.2 / YARN2.7.2 / Hive2.3.3 / Ganglia3.7.2 / Zookeeper3.4.13 / Spark2.3.1 / HBase1.1.1 / HUE4.1.0 / Zeppelin0.8.0 / Tez0.9.1 / Presto0.208 / Sqoop1.4.7 / Pig0.14.0 / Storm1.1.2 / Ranger1.0.0 / Impala2.10.0 / Flink1.4.0 / Knox0.13.0 / ApacheDS2.0.0

FAQ

No such file direction 'oss:/xxx/qmc'_da_coinorder/parttag=2016-10-06-000001_0' 答：数据刷新之后及时刷一下元数据。命令：msck repair table
问：Caused by: org.apache.spark.SparkException: Failed to get broadcast_199_piece0 of broadcast_199
答：流计算的 sparkSession 不能提出来在外边公用，放到函数里面就好，每次结束有stop控制。
Exception in thread "main" java.lang.NoSuchMethodError: org.slf4j.MDC.getCopyOfContextMap()Ljava/util/Map，这是因为jar包版本冲突造成的。
启动spark SQL时,报错: Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver ") was not found in the CLASSPATH. Please check your CLASSPATH s
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [12 pecification, and the name of the driver. 在$SPARK_HOME/conf/spark-env.sh文件中配置: export SPARK_CLASSPATH=$HIVE_HOME/lib/mysql-connector-java-5.1.6-bin.jar 0 seconds]. This timeout is controlled by spark.rpc.askTimeout 分配的core不够, 多分配几核的CPU
.启动计算任务时报错: status.SparkJobMonitor: 2017-01-04 11:53:51,564 Stage-0_0: 0(+1)/1 status.SparkJobMonitor: 2017-01-04 11:53:54,564 Stage-0_0: 0(+1)/1 status.SparkJobMonitor: 2017-01-04 11:53:55,564 Stage-0_0: 0(+1)/1 status.SparkJobMonitor: 2017-01-04 11:53:56,564 Stage-0_0: 0(+1)/1 资源不够, 分配大点内存, 默认值为512MB.
Spakr集群的所有运行数据在Master重启是都会丢失解决方案: 配置spark.deploy.recoveryMode选项为ZOOKEEPER
提交spark计算任务时,报错: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 0 on 192.168.10.38: remote Rpc client disassociated [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 1 on 192.168.10.38: remote Rpc client disassociated [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 2 on 192.168.10.38: remote Rpc client disassociated [org.apache.spark.scheduler.TaskSchedulerImpl]-[ERROR] Lost executor 3 on 192.168.10.38: remote Rpc client disassociated [org.apache.spark.scheduler.TaskSetManager]-[ERROR] Task 3 in stage 0.0 failed 4 times; aborting job Exception in thread "main" org.apache.spark.SparkException : Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 14, 192.168.10.38): ExecutorLostFailure (executor 3 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)

解决方案: 这里遇到的问题主要是因为数据源数据量过大，而机器的内存无法满足需求，导致长时间执行超时断开的情况，数据无法有效进行交互计算，因此有必要增加内存

内存不足或数据倾斜导致Executor Lost（spark-submit提交） ERROR TaskSchedulerImpl: Lost executor 6 on 192.168.10.37: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 6 from TaskSet 6.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@192.168.10.37:42250] has failed, address is now gated for [5000] ms. Reason: [Disassociated] WARN TaskSetManager: Lost task 3.0 in stage 6.0 (TID 102, 192.168.10.37): ExecutorLostFailure (executor 6 lost) INFO DAGScheduler: Executor lost: 6 (epoch 8) INFO BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, 192.168.10.37, 57139) INFO BlockManagerMaster: Removed 6 successfully in removeExecutor INFO AppClient$ClientEndpoint: Executor updated: app-20160115142128-0001/6 is now EXITED (Command exited with code 52) INFO SparkDeploySchedulerBackend: Executor app-20160115142128-0001/6 removed: Command exited with code 52 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 6 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 142, 192.168.10.36): ExecutorLostFailure (executor 4 lost) WARN TaskSetManager: Lost task 4.1 in stage 6.0 (TID 137, 192.168.10.38): java.lang.OutOfMemoryError: GC overhead limit exceeded 解决办法：由于我们在执行Spark任务是，读取所需要的原数据，数据量太大，导致在Worker上面分配的任务执行数据时所需要的内存不够，直接导致内存溢出了，所以我们有必要增加Worker上面的内存来满足程序运行需要。在Spark Streaming或者其他spark任务中，会遇到在Spark中常见的问题，典型如Executor Lost相关的问题(shuffle fetch失败，Task失败重试等)。这就意味着发生了内存不足或者数据倾斜的问题。这个目前需要考虑如下几个点以获得解决方案： A.相同资源下，增加partition数可以减少内存问题。原因如下：通过增加partition数，每个task要处理的数据少了，同一时间内，所有正在运行的task要处理的数量少了很多，所有Executor占用的内存也变小了。这可以缓解数据倾斜以及内存不足的压力。 B.关注shuffle read阶段的并行数。例如reduce, group 之类的函数，其实他们都有第二个参数，并行度(partition数)，只是大家一般都不设置。不过出了问题再设置一下，也不错。 C.给一个Executor核数设置的太多，也就意味着同一时刻，在该Executor的内存压力会更大，GC也会更频繁。我一般会控制在3个左右。然后通过提高Executor数量来保持资源的总量不变。
如何定位spark的数据倾斜解决方案：在Spark Web UI看一下当前stage各个task分配的数据量以及执行时间，根据stage划分原理定位代码中shuffle类算子
如何解决spark数据倾斜解决方案：过滤少数导致倾斜的key（仅限于抛弃的Key对作业影响很小）提高shuffle操作并行度（提升效果有限）两阶段聚合（局部聚合+全局聚合），先对相同的key加前缀变成多个key，局部shuffle后再去掉前缀，再次进行全局shuffle（仅适用于聚合类的shuffle操作，效果明显，对于join类的shuffle操作无效），将reduce join转为map join，将小表进行广播，对大表map操作，遍历小表数据（仅适用于大小表或RDD情况）使用随机前缀和扩容RDD进行join，对其中一个RDD每条数据打上n以内的随机前缀，用flatMap算子对另一个RDD进行n倍扩容并扩容后的每条数据依次打上0~n的前缀，最后将两个改造key后的RDD进行join（能大幅缓解join类型数据倾斜，需要消耗巨额内存）
presto进程一旦启动，JVM server会一直占用内存
如果maven下载很慢，很可能是被天朝的GFW墙了，可以在maven安装目录的setting.conf配置文件mirrors标签下加入国内镜像抵制**党的网络封锁，例如：

<mirror> <id>nexus-aliyun</id> <mirrorOf>*</mirrorOf> <name>Nexus aliyun</name> <url>http://maven.aliyun.com/nexus/content/groups/public</url> </mirror>

数据倾斜只发生在shuffle过程，可能触发shuffle操作的算子有：distinct,groupByKey, reduceByKey, aggregateByKey, join, cogroup, repartition等
Spark的Driver只有在Action时才会收到结果
Spark需要全局聚合变量时应当使用累加器（Accumulator）
所有自定义类要实现serializable接口，否则在集群中无法生效。或者使用list等基本类型
不要随意格式化HDFS，这会带来数据版本不一致等诸多问题，格式化前要清空数据文件夹
小于128M的小文件都会占据一个128M的BLOCK，合并或者删除小文件节省磁
长时间等待无反应，并且看到服务器上面的web界面有内存和核心数，但是没有分配，如下图。Yarn 资源不足。 [Stage 0:>(0 + 0) / 42] 或者日志信息显示： 16/01/15 14:18:56 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources