Spark Learning

offcially manual : https://spark.apache.org/docs/latest/rdd-programming-guide.html

一: Spark versus Hadoop

Spark is faster than Hadoop cause hadoop execute disk io to retain failure tolerant function,whereas Spark through its functional programming .

二 :spark RDD(resilient distributed datasets)

TRANSFORMATION: LAZY to execute,like filter(),map(),flatMap() and so forth,spark could optimize chain operations,never execute intermediate process.

ACTION: EAGER to execute. like count(), foreach()  countByKey and so forth

三:spark job execution

141202_tFON_2942603.png

141424_j0Ti_2942603.png

四:COMMON USED API

transformation: groupBy   groupByKey  reduceBy reduceByKey  mapValues keys不会立即计算结果(lazy)

093218_fjKQ_2942603.png

111633_JF5O_2942603.png

112213_b4PI_2942603.png

WikipediaRanking assignment:使用inverted index配合reduceByKey排序 比传统的遍历行查找内容aggregate速度快上一倍

package wikipedia

import java.util.stream.Collectors

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel

case class WikipediaArticle(title: String, text: String) {
  /**
    * @return Whether the text of this article mentions `lang` or not
    * @param lang Language to look for (e.g. "Scala")
    */
  def mentionsLanguage(lang: String): Boolean = text.split(' ').contains(lang)
}

object WikipediaRanking {

  val langs = List(
    "JavaScript", "Java", "PHP", "Python", "C#", "C++", "Ruby", "CSS",
    "Objective-C", "Perl", "Scala", "Haskell", "MATLAB", "Clojure", "Groovy")

  val conf: SparkConf = new SparkConf().setAppName("Spark RDD").setMaster("local[*]").set("spark.executor.memory", "2g")
  val sc: SparkContext = new SparkContext(conf)

  // Hint: use a combination of `sc.textFile`, `WikipediaData.filePath` and `WikipediaData.parse`
  val wikiRdd: RDD[WikipediaArticle] = sc.textFile(WikipediaData.filePath).flatMap(lines => lines.split("\n")).map(x=>WikipediaData.parse(x))
  /** Returns the number of articles on which the language `lang` occurs.
   *  Hint1: consider using method `aggregate` on RDD[T].
   *  Hint2: consider using method `mentionsLanguage` on `WikipediaArticle`
   */
  def occurrencesOfLang(lang: String, rdd: RDD[WikipediaArticle]): Int = rdd.aggregate(0)((acc,article)=>
    if(article.mentionsLanguage(lang)) acc+1 else acc,(acc1, acc2) => (acc1 + acc2))

  /* (1) Use `occurrencesOfLang` to compute the ranking of the languages
   *     (`val langs`) by determining the number of Wikipedia articles that
   *     mention each language at least once. Don't forget to sort the
   *     languages by their occurrence, in decreasing order!
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  //Result��List(("Scala", 999999), ("JavaScript", 1278), ("LOLCODE", 982), ("Java", 42))
  def rankLangs(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] =
    langs.map((lang)=>(lang,occurrencesOfLang(lang,rdd))).sortWith((x,y)=>x._2>y._2)



  /* Compute an inverted index of the set of articles, mapping each language
   * to the Wikipedia pages in which it occurs.
   */
  def makeIndex(langs: List[String], rdd: RDD[WikipediaArticle]): RDD[(String, Iterable[WikipediaArticle])] = {
    rdd.map((w)=>(w,langs.filter((o)=>w.mentionsLanguage(o)).toList)).map(x=>x._2.map((ls)=>(ls,x._1))).flatMap(x=>x).groupByKey()
  }




  /* (2) Compute the language ranking again, but now using the inverted index. Can you notice
   *     a performance improvement?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsUsingIndex(index: RDD[(String, Iterable[WikipediaArticle])]): List[(String, Int)] =
  index.map((o)=>(o._1,o._2.size)).sortBy(_._2,false).collect().toList

  /* (3) Use `reduceByKey` so that the computation of the index and the ranking are combined.
   *     Can you notice an improvement in performance compared to measuring *both* the computation of the index
   *     and the computation of the ranking? If so, can you think of a reason?
   *
   *   Note: this operation is long-running. It can potentially run for
   *   several seconds.
   */
  def rankLangsReduceByKey(langs: List[String], rdd: RDD[WikipediaArticle]): List[(String, Int)] =
    rdd.map((w)=>(w,langs.filter((o)=>w.mentionsLanguage(o)).toList)).map(x=>x._2.map((ls)=>(ls,x._1))).flatMap(x=>x).map((m)=>(m._1,1)).reduceByKey(_+_).sortBy(_._2,false).collect().toList


  def main(args: Array[String]) {

    /* Languages ranked according to (1) */
    val langsRanked: List[(String, Int)] = timed("Part 1: naive ranking", rankLangs(langs, wikiRdd))

    /* An inverted index mapping languages to wikipedia pages on which they appear */
    def index: RDD[(String, Iterable[WikipediaArticle])] = makeIndex(langs, wikiRdd)

    /* Languages ranked according to (2), using the inverted index */
    val langsRanked2: List[(String, Int)] = timed("Part 2: ranking using inverted index", rankLangsUsingIndex(index))

    /* Languages ranked according to (3) */
    val langsRanked3: List[(String, Int)] = timed("Part 3: ranking using reduceByKey", rankLangsReduceByKey(langs, wikiRdd))

    /* Output the speed of each ranking */
    println(timing)
    sc.stop()
  }

  val timing = new StringBuffer
  def timed[T](label: String, code: => T): T = {
    val start = System.currentTimeMillis()
    val result = code
    val stop = System.currentTimeMillis()
    timing.append(s"Processing $label took ${stop - start} ms.\n")
    result
  }
}

pair RDDs

114036_2eTQ_2942603.png

转载于:https://my.oschina.net/odetteisgorgeous/blog/1530050

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 经导师精心指导并认可、获 98 分的毕业设计项目!【项目资源】:微信小程序。【项目说明】:聚焦计算机相关专业毕设及实战操练,可作课程设计与期末大作业,含全部源码,能直用于毕设,经严格调试,运行有保障!【项目服务】:有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值