spark笔记

最新推荐文章于 2024-07-15 19:16:39 发布

Wind_OverSpace

最新推荐文章于 2024-07-15 19:16:39 发布

阅读量124

点赞数

分类专栏：笔记文章标签： spark

本文链接：https://blog.csdn.net/konanllx/article/details/82987374

版权

笔记专栏收录该内容

5 篇文章 0 订阅

订阅专栏

这里写自定义目录标题

spark-submit提交模式
Saprk Core
- RDD
Spark Streaming
Spark SQL

spark-submit提交模式

基本语句、参数简介及提交模式

Saprk Core

RDD

RDD的定义，简介

RDD源码定义
abstract class RDD[T: ClassTag](
@transient private var sc: SparkContext,
@transient private var deps: Seq[Dependency[]]
) extends Serializable with Logging

官网介绍
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

RDD：Resilient Distributed Dataset 弹性分布式数据集
Represents an
   immutable：不可变
   partitioned collection of elements ：分区
        Array(1,2,3,4,5,6,7,8,9,10) 3个分区： (1,2,3) (4,5,6) (7,8,9,10)
   that can be operated on in parallel：并行计算的问题

1）RDD是一个抽象类
2）带泛型的，可以支持多种类型： String、Person、User

RDD创建方式
There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

  RDD的两种创建方式
      1. Parallelized Collections
      2. External Datasets

RDD的五大特性
Internally, each RDD is characterized by five main properties:

A list of partitions
一系列的分区/分片
A function for computing each split/partition
y = f(x)
rdd.map(_+1)
A list of dependencies on other RDDs
rdd1 ==> rdd2 ==> rdd3 ==> rdd4
dependencies: *****

rdda = 5个partition
==>map
rddb = 5个partition
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g.
block locations for an HDFS file) 数据在哪优先把作业调度到数据所在的节点进行计算：移动数据不如移动计算

https://blog.csdn.net/struct_slllp_main/article/details/76209056
https://blog.csdn.net/budong282712018/article/details/51458974

Spark Streaming

Spark SQL

Wind_OverSpace

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
spark笔记

这里写自定义目录标题spark-submit提交模式Saprk CoreRDDSpark StreamingSpark SQLspark-submit提交模式基本语句、参数简介及提交模式Saprk CoreRDDRDD的定义，简介RDD源码定义abstract class RDD[T: ClassTag](@transient private var sc: SparkCon...
复制链接

扫一扫