Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations

Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations

1. RDDs

A distributed memory abstraction enabling in-memory computations in large clusters in a fault-tolerant manner.

The primary data abstraction in spark enabling operations on collection of elements in parallel.

R: recompute missing partitions due to node failure
D: data distributed on multiple nodes in a cluster
D: a collection of partitioned elements(dataset)

traits:

  1. In memory: data inside RDD is stored in memory as much (size) and long (time) as possible
  2. Immutable(read-only): no charge after creation, only transformed by using transformations to new RDDs
  3. Lazily evaluated: RDD data not avaliable/transform until an action is executed that triggers(触发) the execution
  4. Parallel: process the data in parallel
  5. Partitioned(分离的): the data in RDD is partitioned and then distributed across nodes in a cluster
  6. Cacheable: hold all the data in a persistent “storage” like the memory and disk

Operations:

  1. Transformation: takes an RDD and returns a new RDD but nothing gets evaluated/computed
  2. Action: all the data processing queries are computed(evaluated) and the result value is returned

transformations

lazy evaluation: just remember transformations applied to the base dataset(result not be evaluated)

map(function): return a new distributed dataset by passing all elements of the source through a function

filter(function): return a new dataset formed by selecting those elements of the source on which function returns ture

flatMap(function): similar to map, but each element can return 0 or more results though function(function should return a sequence rather than a single item)

mapPartitions(function): similar to map, but runs separately on each partition(block) of RDD

Actions

reduce(function): aggregate the elements of dataset using a function(which takes two argument and returns one)

collect()

count()

first() = take(1)

take(n): return an array with the first n of the dataset

takeSample(withReplecement, num, [seed]): return an array with a random sample of num elements of the dataset

takeOrdered(n, ordering): return the first n elemrnts of the RDD using their natural order or a custom comparator

spark key-value pairs

groupByKey([numpartitions]): return a dataset of (K, Iterable) pairs

reduceByKey(function, [numpartitions]): return a dataset of (K, Iterable) pairs for each key are aggregate using gived ruduction function

sortByKey([ascending], [numpartitions]): return a dataset of (K, Iterable) pairs sorted by keys in ascending or decending

2. Dataframe

difference between dataframe and dataset

Dataframe: schema, generic untyped(like a table)
Dataset: static, strongly-typed

why dataframe(rather than RDDs)

  1. A distributed collection with the same schema
  2. Can be constructed from external data sources or RDDs into essentially an RDD of row objects
  3. Support relational operators(where, groupBy) as well as Spark operations
  4. Faster and more efficient than RDDs

3. Machine Learning Piplines

MLlib

  1. ML alogrithms
  2. Featurization
  3. Pipelines
  4. Persistence: save/load algorithms, models and pipelines
  5. utilities: linear algebra, statistics, data handling

Lab

RDDs and shared variables

#list -> RDD
data = [1,2,3,4,5]
rddData = sc.parallelize(data)
rddData.collect()
# output [1,2,3,4,5]

sc.parallelize(data, 16)
# the number of partitions can be set manually

broadcastVar = sc.broadcast([1,2,3])
broadcastVar.value
#[1,2,3]
# broadcast value is use for avoiding creating a copy of a large variable for each task
# can be cached in serialized form and deserilized before running each task

accumu = sc.accumulator(0)
sc.parallelize([1,2,3,4]).foreach(lambda x: accum.add(x))
accumu.value
# 10

dataframe

# from RDD to dataframe
rdd = sc.parallelze([1,2,3,4])
df = rdd.toDF(["a", "b", "c"])
rdd
# df 基本操作
df.show()
df.printSchema()
df.drop("_c0")
df.describe().show()

# df -> rdd
rdd2 = df.rdd
rdd2.collect()

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值