Spark First Touch

Big Data Analytics

Is the process of examining large data sets to uncover hidden patterns, unknown correlations.

Big Data Analytics Types 

Batch Analytics

Analytics based on the data collected over a period of time; Using historical data

Real-Time Analytics

Analytics based on immediate data for instant result 

Spark v.s. Hadoop

 

Hadoop is used for batch processing.

Spark can be used for both batch processing and real-time processing.

MapReduce Challenges

  • Reading 128M data from Disk
  • Writing result back to Disk
  • Result Sent to Reduce over Network
  • Reducer Save the result to Disk

Spark Solution

  • RDD = Resilient Distributed Data; (Resilient = Reliable); Immutable (Not able to change)
  • RDD to create another RDD (Transformation)
  • RDD to get the result (Action)

Spark Features

  • Speed: in-memory computation.
  • Polyglot: Java, Scala, Python, R.
  • Advanced Analytics:
  • Hadoop Integration
  • Machine Learning: Mahout (Hadoop) -> MLib (Spark)

 

 

DAG

Directed: only a single direction

Acyclic: no looping

 

Structure

Driver

  • Broadcast
  • Take
  • DAG

Executor

Shuffle

 

RDD

  • A set of data that can not change
  • Replay is import
  • In Memory
    • Many Caching Options
  • DataFrame
    • RDD[Row] with a Schema

DAG

One way map flow control, that doesn't reverse back. Source, Transformation, RDD, Action;

Action

  • count
  • take
  • foreach

Transformation (build the DAG, but doesn't execute until the question is asked i.e. action)

  • Map
  • ReduceByKey
  • GroupByKey
  • JoinByKey

Flume Java

do distributed programming with the same code you could do non-distributed programming

// setting up SparkContext
val conf = new SparkConf().setMaster("local[2]")
val sc = new SparkContext(conf)

// RDD Read this file for me into
val line = sc.textFile(path, 2)

// RDD Split each line based on space
val words = lines.flatMap(_.split(" "))

// RDD (word, 1), i.e. tuple
val pairs = words.map(word => (word, 1))

// Reduce By Key to get the aggreated value of each key
val wordCounts = pairs.reduceByKey(_+_)

// Local Collection
val localValues = wordCounts.take(1000)

localValues.foreach(r => println(r))

Managing Parallelism

  • Too many sub-tasks lead to excessive amount of start/shutdown tasks;
  • Skew - most of the tasks go to one machine (joining on a null value;
    • Hash -> Mod and Salt.
  • Cartesian Join
    • Nested Structures
    • Windowing
    • ReduceByKey
  • Repartitioning


 

 

 

 

 

 

转载于:https://my.oschina.net/u/3551123/blog/1301855

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值