Spark 循环迭代式作业与作业间结果传递测试

package com.fw.sparktest

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object TestDAGsBC {

  def main(args: Array[String]): Unit = {

    val sparkConf: SparkConf = new SparkConf()
      .setAppName("test_spark")
      .setMaster("spark://master:7077")
      .set("spark.local.dir", "./tmp")

    val sc = new SparkContext(sparkConf)

    val rangeData = sc.range(1, 100)
    rangeData.cache()

    val rangeDataCount = rangeData.count()
    println("rangeDataCount: " + rangeDataCount)

    val i = job_1(rangeData)
    println("job_1 i: " + i)

    val sum = job_2(rangeData, i)
    println("job_2 sum: " + sum)

  }

  /**
    * 循环 Job 测试,上一次的迭代结果作为下一次的输入
    */
  def job_1(rangeData: RDD[Long]): Long = {
    var i = 1L
    while (i < 1000000) {
      val bcI = rangeData.context.broadcast(i) // 广播 Job 结果作为迭代参数
      println("bcI.id: " + bcI.id)
      // JOB 2
      val rangeDataSum = rangeData.map(_ => {
        println("bcRangeDataCount: " + bcI.value)
        bcI.value
      }).sum()
      println("rangeDataSum: " + rangeDataSum)
      i += rangeDataSum.toLong // 迭代结果
      bcI.unpersist(blocking = true)
    }
    i
  }

  /**
    * 连续第二个 Job 测试,上次 Job 结果作为本次输入
    */
  def job_2(rangeData: RDD[Long], i: Long): Long = {
    val bcI = rangeData.context.broadcast(i)
    val dataSum = rangeData.mapPartitions { iter =>
      val w = bcI.value
      iter.map(_ => (w))
    }.sum()
    println("dataSum: " + dataSum)
    dataSum.toLong
  }

}
[hadoop@master ~]$ spark-submit --class com.fw.sparktest.TestDAGsBC spark_practise-1.0-jar-with-dependencies.jar
20/04/19 09:33:34 WARN Utils: Your hostname, master resolves to a loopback address: 127.0.0.1; using 192.168.0.200 instead (on interface ens33)
20/04/19 09:33:34 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/04/19 09:33:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/04/19 09:33:35 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
rangeDataCount: 99
bcI.id: 1
rangeDataSum: 99.0
bcI.id: 3
rangeDataSum: 9900.0
bcI.id: 5
rangeDataSum: 990000.0
job_1 i: 1000000
dataSum: 9.9E7
job_2 sum: 99000000

 

注:1. 如果写成 object extends App {} 的形式,第二个 Job 会拿不到广播变量
      报错误日志:
	ERROR TaskSetManager: Task 0 in stage 1.0 failed 4 times; aborting job
	Exception in thread "main" org.apache.spark.SparkException: Job aborted 
	due to stage failure: Task 0 in stage 1.0 failed 4 times, 
	most recent failure: Lost task 0.3 in stage 1.0 (TID 8, 192.168.0.200, executor 0):
	 java.lang.NullPointerException
      解决:改写成:object {def mian()} 的形式即可

 

 

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值