spark学习chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis

chapter 1-3 Learning Spark.Lighting-Fast Big Data Analysis

RDD 相关

  1. spark computes RDD only in a lazy fashion Spark’s

  2. RDDs are by default recomputed each time you run an action on them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist()

creating RDDs
  1. loading an external dataset

  2. parallelizing a collection in your driver program

    val parallelize_lines = sc.parallelize(List(“pandas”,“i like you”))
    sc.textFile(“README.md”)

RDD operations

transformations and actions
Transformations are operations on RDDs that return a new RDD, such as map() and filter()
Actions are operations that return a result to the driver program or write it to storage, and kick off a computation, such as count() and first()

Transformations return RDDS whereas actions return some other data type.

RDDs have collect() function to retrieve the entire RDD that is useful if your program filters RDDs down to a very small size and you’d like to deal with it locally.

Element-wise transformations
map()

filter()
Some simple set operations
  1. RDD1.distinct() is very expensive because it requires shuffling all the data over the network to ensure that we receive only one copy of each element.
  2. RDD1.union(RDD2)
  3. RDD1.intersection(RDD2)
  4. RDD1.subtract(RDD2)
Actions

Reduce() takes a function that operates on two elements of the type in your RDD and returns a new element of the same type.A simple example of such a function is +, which we can use to sum our RDD.

val sum = rdd.reduce((x,y) => x+y)

The aggregate() function frees us from the constraint of having the return be the same type as the RDD we are woking on.

scala> val result = input.aggregate((0,0))((acc,value) => (acc._1 + value, acc._2 + 1), (acc1, acc2) => (acc1._1 + acc2._1, acc1._2+acc2._2))
result: (Int, Int) = (10,4)

scala> val avg = result._1 / result._2.toDouble
avg: Double = 2.5

在这里插入图片描述

在这里插入图片描述

persistence

In Scala and Java, the default persist() the data in the JVM heap as unserialized objects.

在这里插入图片描述

在这里插入图片描述

Finally, RDDs come with a method called unpersist() that lets you manually remove them from the cache.

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值