spark官方文档阅读笔记1

最新推荐文章于 2021-06-22 14:07:43 发布

fenghaichun

最新推荐文章于 2021-06-22 14:07:43 发布

阅读量323

点赞数

分类专栏： Spark

本文链接：https://blog.csdn.net/fadetoblackfff/article/details/78492622

版权

Spark 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

spark官方文档阅读笔记1

sparkContext的初始化

val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)

如果在本地运行，则master的值为"local"，如果在集群运行，则值为master主机的ip地址，问题：如果在yarn运行呢？参考文档

RDDs的初始化

通过内存集合数据

val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

通过文件

val distFile = sc.textFile("data.txt")

Understanding closures

一种错误的代码：
var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don't do this!!
rdd.foreach(x => counter += x)

println("Counter value: " + counter)

错误的原因大概是说spark会将要计算的数据分片，分好片的数据会被集群上的不同分别执行，因此每个集群上的exector拿到的都是counter的副本，而且他们计算后的counter并没有写回到driver上，因此在driver上的counter的值依然是0。当然，如果是在本机单机运行，值也许是对的，还未测试。后面会看到，为了解决这种需求，虽然不推荐这种需求，spark推出了共享变量。

针对rdd的transform和action

典型的wordCount:

scala版本：

val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)

java版本：

JavaPairRDD<String, Integer> wordCount = logData.flatMap(s -> Arrays.asList(s.split(" " )).iterator()).mapToPair(p->new Tuple2<>(p,1)).reduceByKey((a, b)->a+b).sortByKey();

Shuffle operations

这点跟mapreduce一样，实际操作中应该尽量避免shuffle。

RDDS持久化

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.-----懒得翻译

共享变量

Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.

广播变量

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. 实现代码：

val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar.value

广播变量应该只读，一般应该是会是上下文之类的东西。

Accumulators

累加器用以解决Understanding closures章节提到的问题，解决那个问题的正确代码如下：

scala> val accum = sc.longAccumulator("My Accumulator")
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10

自定义的累加器代码：

class VectorAccumulatorV2 extends AccumulatorV2[MyVector, MyVector] {

  private val myVector: MyVector = MyVector.createZeroVector

  def reset(): Unit = {
    myVector.reset()
  }

  def add(v: MyVector): Unit = {
    myVector.add(v)
  }
  ...
}

// Then, create an Accumulator of this type:
val myVectorAcc = new VectorAccumulatorV2
// Then, register it into spark context:
sc.register(myVectorAcc, "MyVectorAcc1")

自定义累加器必须继承AccumulatorV2，并且实现reset、add、merger三个方法。

注意，累加器只有在执行action时会被触发。问题：如何使用？

fenghaichun

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
spark官方文档阅读笔记1

spark官方文档阅读笔记1sparkContext的初始化val conf = new SparkConf().setAppName(appName).setMaster(master)new SparkContext(conf)如果在本地运行，则master的值为"local"，如果在集群运行，则值为master主机的ip地址，问题：如果在yarn运行呢？参考文档
复制链接

扫一扫

专栏目录