Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations

最新推荐文章于 2024-08-11 16:03:42 发布

Yoga_sky

最新推荐文章于 2024-08-11 16:03:42 发布

阅读量103

点赞数

文章标签： spark big data 大数据

本文链接：https://blog.csdn.net/Yoga_sky/article/details/125118013

版权

Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations

1. RDDs

A distributed memory abstraction enabling in-memory computations in large clusters in a fault-tolerant manner.

The primary data abstraction in spark enabling operations on collection of elements in parallel.

R: recompute missing partitions due to node failure
D: data distributed on multiple nodes in a cluster
D: a collection of partitioned elements(dataset)

traits:

In memory: data inside RDD is stored in memory as much (size) and long (time) as possible
Immutable(read-only): no charge after creation, only transformed by using transformations to new RDDs
Lazily evaluated: RDD data not avaliable/transform until an action is executed that triggers(触发) the execution
Parallel: process the data in parallel
Partitioned(分离的): the data in RDD is partitioned and then distributed across nodes in a cluster
Cacheable: hold all the data in a persistent “storage” like the memory and disk

Operations:

Transformation: takes an RDD and returns a new RDD but nothing gets evaluated/computed
Action: all the data processing queries are computed(evaluated) and the result value is returned

transformations

lazy evaluation: just remember transformations applied to the base dataset(result not be evaluated)

map(function): return a new distributed dataset by passing all elements of the source through a function

filter(function): return a new dataset formed by selecting those elements of the source on which function returns ture

flatMap(function): similar to map, but each element can return 0 or more results though function(function should return a sequence rather than a single item)

mapPartitions(function): similar to map, but runs separately on each partition(block) of RDD

Actions

reduce(function): aggregate the elements of dataset using a function(which takes two argument and returns one)

collect()

count()

first() = take(1)

take(n): return an array with the first n of the dataset

takeSample(withReplecement, num, [seed]): return an array with a random sample of num elements of the dataset

takeOrdered(n, ordering): return the first n elemrnts of the RDD using their natural order or a custom comparator

spark key-value pairs

groupByKey([numpartitions]): return a dataset of (K, Iterable) pairs

reduceByKey(function, [numpartitions]): return a dataset of (K, Iterable) pairs for each key are aggregate using gived ruduction function

sortByKey([ascending], [numpartitions]): return a dataset of (K, Iterable) pairs sorted by keys in ascending or decending

2. Dataframe

difference between dataframe and dataset

Dataframe: schema, generic untyped(like a table)
Dataset: static, strongly-typed

why dataframe(rather than RDDs)

A distributed collection with the same schema
Can be constructed from external data sources or RDDs into essentially an RDD of row objects
Support relational operators(where, groupBy) as well as Spark operations
Faster and more efficient than RDDs

3. Machine Learning Piplines

MLlib

ML alogrithms
Featurization
Pipelines
Persistence: save/load algorithms, models and pipelines
utilities: linear algebra, statistics, data handling

Lab

RDDs and shared variables

#list -> RDD
data = [1,2,3,4,5]
rddData = sc.parallelize(data)
rddData.collect()
# output [1,2,3,4,5]

sc.parallelize(data, 16)
# the number of partitions can be set manually

broadcastVar = sc.broadcast([1,2,3])
broadcastVar.value
#[1,2,3]
# broadcast value is use for avoiding creating a copy of a large variable for each task
# can be cached in serialized form and deserilized before running each task

accumu = sc.accumulator(0)
sc.parallelize([1,2,3,4]).foreach(lambda x: accum.add(x))
accumu.value
# 10

dataframe

# from RDD to dataframe
rdd = sc.parallelze([1,2,3,4])
df = rdd.toDF(["a", "b", "c"])
rdd
# df 基本操作
df.show()
df.printSchema()
df.drop("_c0")
df.describe().show()

# df -> rdd
rdd2 = df.rdd
rdd2.collect()

Yoga_sky

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations

Lecture 2: Spark RDDs, Dataframe, ML Piplines, and Parallelizations
复制链接

扫一扫