Spark First Touch

最新推荐文章于 2024-09-27 00:45:01 发布

chiruxu4359

最新推荐文章于 2024-09-27 00:45:01 发布

阅读量93

点赞数

文章标签： java 大数据 scala

原文链接：https://my.oschina.net/u/3551123/blog/1301855

版权

Big Data Analytics

Is the process of examining large data sets to uncover hidden patterns, unknown correlations.

Big Data Analytics Types

Batch Analytics

Analytics based on the data collected over a period of time; Using historical data

Real-Time Analytics

Analytics based on immediate data for instant result

Spark v.s. Hadoop

Hadoop is used for batch processing.

Spark can be used for both batch processing and real-time processing.

MapReduce Challenges

Reading 128M data from Disk
Writing result back to Disk
Result Sent to Reduce over Network
Reducer Save the result to Disk

Spark Solution

RDD = Resilient Distributed Data; (Resilient = Reliable); Immutable (Not able to change)
RDD to create another RDD (Transformation)
RDD to get the result (Action)

Spark Features

Speed: in-memory computation.
Polyglot: Java, Scala, Python, R.
Advanced Analytics:
Hadoop Integration
Machine Learning: Mahout (Hadoop) -> MLib (Spark)

DAG

Directed: only a single direction

Acyclic: no looping

Structure

Driver

Broadcast
Take
DAG

Executor

Shuffle

RDD

A set of data that can not change
Replay is import
In Memory
- Many Caching Options
DataFrame
- RDD[Row] with a Schema

DAG

One way map flow control, that doesn't reverse back. Source, Transformation, RDD, Action;

Action

count
take
foreach

Transformation (build the DAG, but doesn't execute until the question is asked i.e. action)

Map
ReduceByKey
GroupByKey
JoinByKey

Flume Java

do distributed programming with the same code you could do non-distributed programming

// setting up SparkContext
val conf = new SparkConf().setMaster("local[2]")
val sc = new SparkContext(conf)

// RDD Read this file for me into
val line = sc.textFile(path, 2)

// RDD Split each line based on space
val words = lines.flatMap(_.split(" "))

// RDD (word, 1), i.e. tuple
val pairs = words.map(word => (word, 1))

// Reduce By Key to get the aggreated value of each key
val wordCounts = pairs.reduceByKey(_+_)

// Local Collection
val localValues = wordCounts.take(1000)

localValues.foreach(r => println(r))

Managing Parallelism

Too many sub-tasks lead to excessive amount of start/shutdown tasks;
Skew - most of the tasks go to one machine (joining on a null value;
- Hash -> Mod and Salt.
Cartesian Join
- Nested Structures
- Windowing
- ReduceByKey
Repartitioning