spark学习chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis

最新推荐文章于 2019-10-15 10:27:21 发布

银灯玉箫

最新推荐文章于 2019-10-15 10:27:21 发布

阅读量197

点赞数 1

分类专栏： Java

本文链接：https://blog.csdn.net/lilele12211104/article/details/97795838

版权

Java 专栏收录该内容

26 篇文章 0 订阅

订阅专栏

chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis

RDD 相关

spark computes RDD only in a lazy fashion Spark’s
RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist()

creating RDDs

loading an external dataset
parallelizing a collection in your driver program

val parallelize_lines = sc.parallelize(List(“pandas”,“i like you”))
sc.textFile(“README.md”)

RDD operations

transformations and actions
Transformations are operations on RDDs that return a new RDD, such as map() and filter()
Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first()

Transformations return RDDS whereas actions return some other data type.

RDDs have collect() function to retrieve the entire RDD that is useful if your program filters RDDs down to a very small size and you’d like to deal with it locally.

Element-wise transformations

map()

filter()

Some simple set operations

RDD1.distinct() is very expensive because it requires shuffling all the data over the network to ensure that we receive only one copy of each element.
RDD1.union(RDD2)
RDD1.intersection(RDD2)
RDD1.subtract(RDD2)

Actions

Reduce() takes a function that operates on two elements of the type in your RDD and returns a new element of the same type.A simple example of such a function is +, which we can use to sum our RDD.

val sum = rdd.reduce((x,y) => x+y)

The aggregate() function frees us from the constraint of having the return be the same type as the RDD we are woking on.

scala> val result = input.aggregate((0,0))((acc,value) => (acc._1 + value, acc._2 + 1), (acc1, acc2) => (acc1._1 + acc2._1, acc1._2+acc2._2))
result: (Int, Int) = (10,4)

scala> val avg = result._1 / result._2.toDouble
avg: Double = 2.5

在这里插入图片描述

persistence

In Scala and Java, the default persist() the data in the JVM heap as unserialized objects.

在这里插入图片描述

Finally, RDDs come with a method called unpersist() that lets you manually remove them from the cache.

银灯玉箫

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark学习chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis

chapter 1-3 Learning Spark.Lighting-Fast Big Data AnalysisRDD 相关spark computes RDD only in a lazy fashion Spark’sRDDs are by default recomputed each time you run an action on them. If you would...
复制链接

扫一扫