Learning Spark

原创 2016年08月29日 14:09:57

Spark basic concepts:

    1, RDD (resillient distributed dataset)
    2, Task: shuffleMapTask and resultTask (simillar to map and reduce)
    3, Job: a job can be made of multiple tasks
    4, Stage: a job can have multiple stages
    5, Partition: RDD can be partitioned into different machine
    6, NarrowDependency:  Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution. 
    7, ShuffleDependency: also called wideDependency, child RDD depend on all partitions of the parent RDD.
    8: DAG: Directed Acycle graph, no parent depend on child RDD

Spark Core functions:

    1, SparkContext: for driverApplication execution and output, we need to initiallize SparkContext before submit spark jobs
        SparkContext has:
        1) communiation
        2) distributed deployment
        3) message 
        4) storage
        5) computation
        6) cashe
        7) measurement system
        8) file service
        9) web service
        Application need use SparkContext API to create jobs, 
            use DAGScheduler, plan RDDs in DAG to different stages and submit the stages.
            use TaskScheduler, apply resouces, submit jobs and requst cluster for scheduling
    2, Storage System
        1) Spark take memory as priority, if memory is not enough, then consider to use disk, Tachyon (distributed memory file system)
    3, Computation Engine:
    4, Deployment
        1) Standalone
        2) Yarn
        3) Mesos

Tuning Spark:

    1, Data Serialization:
        1) Java serializaion (object --> byte --> object)
        2) Kyro serializaton (10x faster than Java serialization) (object --> object)
            val conf = new SparkConf().setMaster(...).setAppName(...)
            conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
            val sc = new SparkContext(conf)
    2, Memory Tuning:
        1) object header: 16 bytes
        2) String header: 40 bytes
        3) Common collection class: HashMap or LinkedList, 8 bytes
        4) Collections of primitive types often store them as "boxed" object as java.lang.Integer
    3, Memory management overview
        1)Memory usage in Spark largely falls under one of two categories: execution and storage. 
            a) Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations
            b) Storage memory refers to that used for caching and propagating internal data across the cluster
        2) M/R
            a) When no execution memory is used, storage can acquire all the available memory and vice versa.
            b) R describes a subregion within M where cached blocks are never evicted
        3) This design ensures several desirable properties:
            a) First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills.
            b) Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted.
            c) Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.

Machine Learning On Spark——第一节:基础数据结构(一)

## 本节主要内容 ## 1. 本地向量和矩阵 2. 带类标签的特征向量(Labeled point) 3. 分布式矩阵 ## 1. 本地向量和矩阵 ## 本地向量(Local Vecto...
  • lovehuangjiaju
  • lovehuangjiaju
  • 2015年09月13日 21:31
  • 9365

【深度学习Deep Learning】资料大全

最近在学深度学习相关的东西,在网上搜集到了一些不错的资料,现在汇总一下: Free Online Books Deep Learning66 by Yoshua Bengio, Ian Goo...
  • Real_Myth
  • Real_Myth
  • 2016年08月23日 15:56
  • 2799

Machine Learning With Spark学习笔记(在10万电影数据上训练、使用推荐模型)

我们现在开始训练模型,还输入参数如下: rank:ALS中因子的个数,通常来说越大越好,但是对内存占用率有直接影响,通常rank在10到200之间。 iterations:迭代次数,每次迭代都会减...
  • 2015年08月14日 16:56
  • 4055

Machine Learning with Spark pdf

  • 2017年11月11日 19:56
  • 4.63MB
  • 下载


  • 2017年12月10日 11:04
  • 6.64MB
  • 下载

machine learning with spark

  • 2015年03月03日 13:15
  • 4.59MB
  • 下载

Learning Spark SQL - Aurobindo Sarkar

  • 2017年10月28日 21:25
  • 39.01MB
  • 下载

Learning Spark

  • 2015年02月26日 15:45
  • 7.82MB
  • 下载

Learning Spark SQL epub

  • 2017年10月06日 21:17
  • 17.19MB
  • 下载

Mastering Machine Learning with Spark 2.X azw3

  • 2017年10月06日 22:27
  • 9.8MB
  • 下载
您举报文章:Learning Spark