Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations
1. RDDs
A distributed memory abstraction enabling in-memory computations in large clusters in a fault-tolerant manner.
The primary data abstraction in spark enabling operations on collection of elements in parallel.
R: recompute missing partitions due to node failure
D: data distributed on multiple nodes in a cluster
D: a collection of partitioned elements(dataset)
traits:
- In memory: data inside RDD is stored in memory as much (size) and long (time) as possible
- Immutable(read-only): no charge after creation, only transformed by using transformations to new RDDs
- Lazily evaluated: RDD data not avaliable/transform until an action is executed that triggers(触发) the execution
- Parallel: process the data in parallel
- Partitioned(分离的): the data in RDD is partitioned and then distributed across nodes in a cluster
- Cacheable: hold all the data in a persistent “storage” like the memory and disk
Operations:
- Transformation: takes an RDD and returns a new RDD but nothing gets evaluated/computed
- Action: all the data processing queries are computed(evaluated) and the result value is returned
transformations
lazy evaluation: just remember transformations applied to the base dataset(result not be evaluated)
map(function): return a new distributed dataset by passing all elements of the source through a function
filter(function): return a new dataset formed by selecting those elements of the source on which function returns ture
flatMap(function): similar to map, but each element can return 0 or more results though function(function should return a sequence rather than a single item)
mapPartitions(function): similar to map, but runs separately on each partition(block) of RDD
Actions
reduce(function): aggregate the elements of dataset using a function(which takes two argument and returns one)
collect()
count()
first() = take(1)
take(n): return an array with the first n of the dataset
takeSample(withReplecement, num, [seed]): return an array with a random sample of num elements of the dataset
takeOrdered(n, ordering): return the first n elemrnts of the RDD using their natural order or a custom comparator
spark key-value pairs
groupByKey([numpartitions]): return a dataset of (K, Iterable) pairs
reduceByKey(function, [numpartitions]): return a dataset of (K, Iterable) pairs for each key are aggregate using gived ruduction function
sortByKey([ascending], [numpartitions]): return a dataset of (K, Iterable) pairs sorted by keys in ascending or decending
2. Dataframe
difference between dataframe and dataset
Dataframe: schema, generic untyped(like a table)
Dataset: static, strongly-typed
why dataframe(rather than RDDs)
- A distributed collection with the same schema
- Can be constructed from external data sources or RDDs into essentially an RDD of row objects
- Support relational operators(where, groupBy) as well as Spark operations
- Faster and more efficient than RDDs
3. Machine Learning Piplines
MLlib
- ML alogrithms
- Featurization
- Pipelines
- Persistence: save/load algorithms, models and pipelines
- utilities: linear algebra, statistics, data handling
Lab
RDDs and shared variables
#list -> RDD
data = [1,2,3,4,5]
rddData = sc.parallelize(data)
rddData.collect()
# output [1,2,3,4,5]
sc.parallelize(data, 16)
# the number of partitions can be set manually
broadcastVar = sc.broadcast([1,2,3])
broadcastVar.value
#[1,2,3]
# broadcast value is use for avoiding creating a copy of a large variable for each task
# can be cached in serialized form and deserilized before running each task
accumu = sc.accumulator(0)
sc.parallelize([1,2,3,4]).foreach(lambda x: accum.add(x))
accumu.value
# 10
dataframe
# from RDD to dataframe
rdd = sc.parallelze([1,2,3,4])
df = rdd.toDF(["a", "b", "c"])
rdd
# df 基本操作
df.show()
df.printSchema()
df.drop("_c0")
df.describe().show()
# df -> rdd
rdd2 = df.rdd
rdd2.collect()