Learning Spark

原创 2016年08月29日 14:09:57

Spark basic concepts:

    1, RDD (resillient distributed dataset)
    2, Task: shuffleMapTask and resultTask (simillar to map and reduce)
    3, Job: a job can be made of multiple tasks
    4, Stage: a job can have multiple stages
    5, Partition: RDD can be partitioned into different machine
    6, NarrowDependency:  Base class for dependencies where each partition of the child RDD depends on a small number of partitions of the parent RDD. Narrow dependencies allow for pipelined execution. 
    7, ShuffleDependency: also called wideDependency, child RDD depend on all partitions of the parent RDD.
    8: DAG: Directed Acycle graph, no parent depend on child RDD

Spark Core functions:

    1, SparkContext: for driverApplication execution and output, we need to initiallize SparkContext before submit spark jobs
        SparkContext has:
        1) communiation
        2) distributed deployment
        3) message 
        4) storage
        5) computation
        6) cashe
        7) measurement system
        8) file service
        9) web service
        Application need use SparkContext API to create jobs, 
            use DAGScheduler, plan RDDs in DAG to different stages and submit the stages.
            use TaskScheduler, apply resouces, submit jobs and requst cluster for scheduling
    2, Storage System
        1) Spark take memory as priority, if memory is not enough, then consider to use disk, Tachyon (distributed memory file system)
    3, Computation Engine:
    4, Deployment
        1) Standalone
        2) Yarn
        3) Mesos

Tuning Spark:

    1, Data Serialization:
        1) Java serializaion (object --> byte --> object)
        2) Kyro serializaton (10x faster than Java serialization) (object --> object)
            val conf = new SparkConf().setMaster(...).setAppName(...)
            conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
            val sc = new SparkContext(conf)
    2, Memory Tuning:
        1) object header: 16 bytes
        2) String header: 40 bytes
        3) Common collection class: HashMap or LinkedList, 8 bytes
        4) Collections of primitive types often store them as "boxed" object as java.lang.Integer
    3, Memory management overview
        1)Memory usage in Spark largely falls under one of two categories: execution and storage. 
            a) Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations
            b) Storage memory refers to that used for caching and propagating internal data across the cluster
        2) M/R
            a) When no execution memory is used, storage can acquire all the available memory and vice versa.
            b) R describes a subregion within M where cached blocks are never evicted
        3) This design ensures several desirable properties:
            a) First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills.
            b) Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted.
            c) Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.

Learning Spark 中文版,第三章到第八章整理完毕,PDF可下载

下载地址:Learning Spark中文版3-8章带目录文字版 http://download.csdn.net/detail/coding_hello/9161...
  • coding_hello
  • coding_hello
  • 2015年10月07日 23:00
  • 4795

Learning Spark 第三章 RDD编程 已翻译整理完毕,PDF可下载

下载地址:Learning Spark - 第三章 RDD编程 第三章整理完成了,排版也还Ok,下面是截图,PDF的,加了书签。翻译的内容基本和原作的页码一致,方便对照。...
  • coding_hello
  • coding_hello
  • 2015年09月27日 22:11
  • 726

Learning Spark 学习笔记 第二章 Downloading Spark and Getting Started

spark方面的书太少,而且中文版的几乎没有,且仅有的中文版都翻译的一塌糊涂,所以我这个英文很烂的人只能一边开着有道一边硬着头皮啃英文版的书,先入手一本《Learning Spark》,只是为了记录自...
  • hjbbjh0521
  • hjbbjh0521
  • 2015年06月19日 10:30
  • 691

Learning Spark 第四章 处理键值对 已翻译整理完毕,PDF可下载

下载地址:Learning Spark中文版 第四章 处理键值对 第四章整理完成了,排版也还Ok,下面是截图,PDF的,加了书签。翻译的内容基本和原作的页码一致,方便对照。...
  • coding_hello
  • coding_hello
  • 2015年10月03日 08:11
  • 574

Machine Learning On Spark——第一节:基础数据结构(一)

## 本节主要内容 ## 1. 本地向量和矩阵 2. 带类标签的特征向量(Labeled point) 3. 分布式矩阵 ## 1. 本地向量和矩阵 ## 本地向量(Local Vecto...
  • lovehuangjiaju
  • lovehuangjiaju
  • 2015年09月13日 21:31
  • 9790


  • l294265421
  • l294265421
  • 2015年08月30日 00:56
  • 920

spark1.1.0 Machine Learning Library (MLlib)

Machine Learning Library (MLlib) MLlib is Spark’s scalable machine learning library consisting of...
  • luyee2010
  • luyee2010
  • 2014年09月15日 12:51
  • 3842

Learning Spark - LIGHTNING-FAST DATA ANALYSIS 第四章 - (4)

接着续,每天5分钟:Learning Spark - LIGHTNING-FAST DATA ANALYSIS 第四章 - (3) 停电了2个多小时,还好又来了~ 第四章结束啦~~ 待会再出个完...
  • coding_hello
  • coding_hello
  • 2015年10月01日 22:20
  • 573

Machine Learning With Spark学习笔记(提取10万电影数据特征)

  • 2015年08月14日 11:38
  • 3710

Deep Learning in Customer Churn Prediction (六) (Spark SQL 特征构建实践)

Deep Learning in Customer Churn Prediction (五) (Spark RDD 特征构建实践尝试) 中进行了使用RDD进行特征构建的尝试,如果使用scala容器作为...
  • sinat_30665603
  • sinat_30665603
  • 2017年07月15日 11:08
  • 383
您举报文章:Learning Spark