spark官方文档阅读笔记1

spark官方文档阅读笔记1

sparkContext的初始化

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

如果在本地运行,则master的值为"local",如果在集群运行,则值为master主机的ip地址,问题:如果在yarn运行呢? 参考文档

RDDs的初始化

  • 通过内存集合数据
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
  • 通过文件
val distFile = sc.textFile("data.txt")

Understanding closures

一种错误的代码:
var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)

错误的原因大概是说spark会将要计算的数据分片,分好片的数据会被集群上的不同分别执行,因此每个集群上的exector拿到的都是counter的副本,而且他们计算后的counter并没有写回到driver上,因此在driver上的counter的值依然是0。当然,如果是在本机单机运行,值也许是对的,还未测试。后面会看到,为了解决这种需求,虽然不推荐这种需求,spark推出了共享变量。

针对rdd的transform和action

典型的wordCount:

scala版本:

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

java版本:

JavaPairRDD<String, Integer> wordCount = logData.flatMap(s -> Arrays.asList(s.split(" " )).iterator()).mapToPair(p->new Tuple2<>(p,1)).reduceByKey((a, b)->a+b).sortByKey();

Shuffle operations

这点跟mapreduce一样,实际操作中应该尽量避免shuffle。

RDDS持久化

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.-----懒得翻译

共享变量

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

广播变量

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. 实现代码:

val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

广播变量应该只读,一般应该是会是上下文之类的东西。

Accumulators

  • 累加器用以解决Understanding closures章节提到的问题,解决那个问题的正确代码如下:
scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10
  • 自定义的累加器 代码:
class VectorAccumulatorV2 extends AccumulatorV2[MyVector, MyVector] {

  private val myVector: MyVector = MyVector.createZeroVector

  def reset(): Unit = {
    myVector.reset()
  }

  def add(v: MyVector): Unit = {
    myVector.add(v)
  }
  ...
}

// Then, create an Accumulator of this type:
val myVectorAcc = new VectorAccumulatorV2
// Then, register it into spark context:
sc.register(myVectorAcc, "MyVectorAcc1")

自定义累加器必须继承AccumulatorV2,并且实现reset、add、merger三个方法。

注意,累加器只有在执行action时会被触发。 问题:如何使用?

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值