笔记 MIT6.824 Lecture 15: Big Data: Spark

前言

介绍了Spark,属于evolution of MapReduce,介绍了programming model,execution strategy & fault tolerance


一、programming model


**val lines = spark.read.textFile("in").rdd** 读取文件,build computation的图

lines.collect()
– lines yields a list of strings, one per line of input
– if we run lines.collect() again, it re-reads file “in”

val links1 = lines.map{ s => val parts = s.split("\s+"); (parts(0), parts(1)) }
links1.collect()

– map, split, tuple – acts on each line in turn
– parses each string “x y” into tuple ( “x”, “y” )

val links2 = links1.distinct()
把一样的放在一起

val links3 = links2.groupByKey()
– groupByKey() sorts or hashes to bring instances of each key together
之后数据会变成
在这里插入图片描述

开始iteration
val jj = links4.join(ranks)
– the join brings each page’s link list and current rank together

MapReduce 的逻辑
val contribs = jj.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
– for each link, the “from” page’s rank divided by number of its links
ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)
– sum up the links that lead to each page

第二个iteration
val jj2 = links4.join(ranks)
– join() brings together equal keys; must sort or hash
val contribs2 = jj2.values.flatMap{ case (urls, rank) => urls.map(url => (url, rank / urls.size)) }
ranks = contribs2.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)

– reduceByKey() brings together equal keys

有很多separate mapreduce programs


二、execution strategy

2.1 计算过程

在这里插入图片描述

2.2 Execution

在这里插入图片描述


三、Fault tolerance

  1. 一般来说如果其中一个worker fails,就repeat computation,但是这个再计算可以分配给很多workers,达到parallel computation

  2. 通过checkpoint减少recovery的计算量


总结

limitation
– all records treated the same way
– Transformations are functional - turn input into output
– No notion of modifying data

优势
– 比mapreduce好
– 引入了直观的dataflow的view
– 可以通过把数据写在内存中提升性能

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值