chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis
RDD 相关
-
spark computes RDD only in a lazy fashion Spark’s
-
RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist()
creating RDDs
-
loading an external dataset
-
parallelizing a collection in your driver program
val parallelize_lines = sc.parallelize(List(“pandas”,“i like you”))
sc.textFile(“README.md”)
RDD operations
transformations and actions
Transformations are operations on RDDs that return a new RDD, such as map() and filter()
Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first()
Transformations return RDDS whereas actions return some other data type.
RDDs have collect() function to retrieve the entire RDD that is useful if your program filters RDDs down to a very small size and you’d like to deal with it locally.
Element-wise transformations
map()
filter()
Some simple set operations
- RDD1.distinct() is very expensive because it requires shuffling all the data over the network to ensure that we receive only one copy of each element.
- RDD1.union(RDD2)
- RDD1.intersection(RDD2)
- RDD1.subtract(RDD2)
Actions
Reduce() takes a function that operates on two elements of the type in your RDD and returns a new element of the same type.A simple example of such a function is +, which we can use to sum our RDD.
val sum = rdd.reduce((x,y) => x+y)
The aggregate() function frees us from the constraint of having the return be the same type as the RDD we are woking on.
scala> val result = input.aggregate((0,0))((acc,value) => (acc._1 + value, acc._2 + 1), (acc1, acc2) => (acc1._1 + acc2._1, acc1._2+acc2._2))
result: (Int, Int) = (10,4)
scala> val avg = result._1 / result._2.toDouble
avg: Double = 2.5
persistence
In Scala and Java, the default persist() the data in the JVM heap as unserialized objects.
Finally, RDDs come with a method called unpersist() that lets you manually remove them from the cache.