Learning Spark

原创 2016年08月29日 14:09:57

Spark basic concepts:

    1, RDD (resillient distributed dataset)
    2, Task: shuffleMapTask and resultTask (simillar to map and reduce)
    3, Job: a job can be made of multiple tasks
    4, Stage: a job can have multiple stages
    5, Partition: RDD can be partitioned into different machine
    6, NarrowDependency:  Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution. 
    7, ShuffleDependency: also called wideDependency, child RDD depend on all partitions of the parent RDD.
    8: DAG: Directed Acycle graph, no parent depend on child RDD

Spark Core functions:

    1, SparkContext: for driverApplication execution and output, we need to initiallize SparkContext before submit spark jobs
        SparkContext has:
        1) communiation
        2) distributed deployment
        3) message 
        4) storage
        5) computation
        6) cashe
        7) measurement system
        8) file service
        9) web service
        Application need use SparkContext API to create jobs, 
            use DAGScheduler, plan RDDs in DAG to different stages and submit the stages.
            use TaskScheduler, apply resouces, submit jobs and requst cluster for scheduling
    2, Storage System
        1) Spark take memory as priority, if memory is not enough, then consider to use disk, Tachyon (distributed memory file system)
    3, Computation Engine:
    4, Deployment
        1) Standalone
        2) Yarn
        3) Mesos

Tuning Spark:

    1, Data Serialization:
        1) Java serializaion (object --> byte --> object)
        2) Kyro serializaton (10x faster than Java serialization) (object --> object)
            val conf = new SparkConf().setMaster(...).setAppName(...)
            conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
            val sc = new SparkContext(conf)
    2, Memory Tuning:
        1) object header: 16 bytes
        2) String header: 40 bytes
        3) Common collection class: HashMap or LinkedList, 8 bytes
        4) Collections of primitive types often store them as "boxed" object as java.lang.Integer
    3, Memory management overview
        1)Memory usage in Spark largely falls under one of two categories: execution and storage. 
            a) Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations
            b) Storage memory refers to that used for caching and propagating internal data across the cluster
        2) M/R
            a) When no execution memory is used, storage can acquire all the available memory and vice versa.
            b) R describes a subregion within M where cached blocks are never evicted
        3) This design ensures several desirable properties:
            a) First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills.
            b) Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted.
            c) Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.



  • 2017-08-25 11:09
  • 7.82MB
  • 下载

Learning Spark笔记7-数据分组、连接、排序数据

数据分组 一个常见的使用方式是按键分组我们的数据-例如,查看所有的客户订单。 如果我们的数据已经有key了,那么groupByKey()会使用key来分组我们的数据。在一个RDD上...

Learning Spark SQL epub

  • 2017-10-06 21:17
  • 17.19MB
  • 下载

Learning Spark笔记10-PageRank

PageRank 基于Google的Larry Page命名的PageRank算法旨在根据有多少文档具有链接的方式为一组中的每个文档分配重要度(“等级”)。 它可以用于对网页进行排名,当然也...

Learning Spark——RDD常用操作

本文内容主要包括: 基本转换操作 键值对转换操作 行动操作 RDD支持两种操作:转换(Transformation)操作和行动(Action)操作。为什么会分为两种操作,这两种...
  • Trigl
  • Trigl
  • 2017-04-27 17:33
  • 1354

Machine Learning with Spark

  • 2016-05-30 10:41
  • 5.38MB
  • 下载