spark context源码解读
spark context是spark的上下文环境,也是spark程序的入口,在spark2.0中sparkcontext融入到sparksession中,直接可以用sparksession.sparkContext去调用它。
spark程序是运行在jvm上的,一个jvm只能有一个活跃的sparkcontext,所以在你代码末尾加上一个sparkcontext.stop,去关闭它。
下面就说说sparkcontext源码中的api。
spark必须设定2个参数,一个是appname,一个是运行方式
/**
* @param master Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
* @param appName A name for your application, to display on the cluster web UI
* 应用程序的名称,会显示在集群的web界面上
* @param conf a [[org.apache.spark.SparkConf]] object specifying other Spark parameters
* SparkConf的对象为了配置Spark的参数
*/
def this(master: String, appName: String, conf: SparkConf) =
this(conf.setMaster(master).setAppName(appName))
除了上面说的2个参数外,还可以输入另外几个参数,但这另外几个参数不是必须输入的
/**
* @param master Cluster URL to connect to (e.g. mesos://host:port, spark://host:port, local[4]).
* 集群的URL
* @param appName A name for your application, to display on the cluster web UI
* 应用程序的名称,会显示在集群的web界面上
* @param sparkHome The SPARK_HOME directory on the slave nodes
* 子节点的SPARK_HOME目录
* @param jarFile JAR file to send to the cluster. This can be a path on the local file system
* or an HDFS, HTTP, HTTPS, or FTP URL.
* 要送到集群的jar文件,这些可以是一个路径,在hdfs上,http,https或者ftp url
*/
def this(master: String, appName: String, sparkHome: String, jarFile: String) =
this(new SparkContext(master, appName, sparkHome, Seq(jarFile)))
当你没有设置spark的分区时候,它会用默认的分区
/** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD).
* 当用户没指定的时候,默认的并行级别(e.g. parallelize and makeRDD).
* */
def defaultParallelism: java.lang.Integer = sc.defaultParallelism
跟上一个类似,最小分区数
/** Default min number of partitions for Hadoop RDDs when not given by user
* 当用户没指定的时候,默认最小的分区数
* */
def defaultMinPartitions: java.lang.Integer = sc.defaultMinPartitions
parallelize
/** Distribute a local Scala collection to form an RDD.
* 分配本地Scala集合形成一个RDD。就是说可以自己用一些数据做成一个rdd
* */
def parallelize[T](list: java.util.List[T], numSlices: Int): JavaRDD[T] = {
implicit val ctag: ClassTag[T] = fakeClassTag
sc.parallelize(list.asScala, numSlices)
}
emptyRDD
/** Get an RDD that has no partitions or elements.
* 得到一个RDD 没有分区,或者没有元素
* */
def emptyRDD[T]: JavaRDD[T] = {
implicit val ctag: ClassTag[T] = fakeClassTag
JavaRDD.fromRDD(new EmptyRDD[T](sc))
}
textFile
/**
* Read a text file from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI, and return it as an RDD of Strings.
*
* 从HDFS,或者本地文件系统(所有节点都可以得到数据) 或者任何hadoop支持的文件系统的url
* 上读取一个text文件,返回一个String型的RDD
*/
def textFile(path: String, minPartitions: Int): JavaRDD[String] =
sc.textFile(path, minPartitions)
wholeTextFiles
/**
* Read a directory of text files from HDFS, a local file system (available on all nodes), or any
* Hadoop-supported file system URI. Each file is read as a single record and returned in a
* key-value pair, where the key is the path of each file, the value is the content of each file.
*
* 从HDFS,或者本地文件系统(所有节点都可以得到数据) 或者任何hadoop支持的文件系统的url
* 上读取一个文件夹里面都是text文件,每个文件被读取为单个记录,并返回一个键值对,其中键是每个文件的路径,
* 该值是每个文件的内容。
*
*
* <p> For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do
* {{{
* JavaPairRDD<String, String> rdd = sparkContext.wholeTextFiles("hdfs://a-hdfs-path")
* }}}
*
* <p> then `rdd` contains
* {{{
* (a-hdfs-path/part-00000, its content)
* (a-hdfs-path/part-00001, its content)
* ...
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* @note Small files are preferred, large file is also allowable, but may cause bad performance.
* 小文件优先,大文件也是允许的,但可能会造成性能不好。
*
* @param minPartitions A suggestion value of the minimal splitting number for input data.
* 用于输入数据的最小分区数建议值。
*/
def wholeTextFiles(path: String, minPartitions: Int): JavaPairRDD[String, String] =
new JavaPairRDD(sc.wholeTextFiles(path, minPartitions))
binaryFiles
/**
* Read a directory of binary files from HDFS, a local file system (available on all nodes),
* or any Hadoop-supported file system URI as a byte array. Each file is read as a single
* record and returned in a key-value pair, where the key is the path of each file,
* the value is the content of each file.
*
* 读目录从HDFS的文件,本地文件系统(所有节点上可用),或任何Hadoop支持的文件系统的URI作为一个byte数组。
* 每个文件被读取为单个记录,并返回一个键值对,其中键是每个文件的路径,该值是每个文件的内容。
*
* For example, if you have the following files:
* {{{
* hdfs://a-hdfs-path/part-00000
* hdfs://a-hdfs-path/part-00001
* ...
* hdfs://a-hdfs-path/part-nnnnn
* }}}
*
* Do
* {{{
* JavaPairRDD<String, byte[]> rdd = sparkContext.dataStreamFiles("hdfs://a-hdfs-path")
* }}}
*
* then `rdd` contains
* {{{
* (a-hdfs-path/part-00000, its content)
* (a-hdfs-path/part-00001, its content)
* ...
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* @note Small files are preferred; very large files but may cause bad performance.
* 小文件优先,大文件也是允许的,但可能会造成性能不好。
* @param minPartitions A suggestion value of the minimal splitting number for input data.
* 用于输入数据的最小分区数建议值。
*/
def binaryFiles(path: String, minPartitions: Int): JavaPairRDD[String, PortableDataStream] =
new JavaPairRDD(sc.binaryFiles(path, minPartitions))
sequenceFile
/**
* Get an RDD for a Hadoop SequenceFile with given key and value types.
* 得到一个RDD为Hadoop SequenceFile使用给定的键和值的类型。
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
*
* 由于Hadoop的RecordReader类re-uses 相同的可写的对象为每个记录,直接缓存返回的RDD会创造出很多引用
* 相同的对象。如果你打算直接缓存Hadoop可写的对象,你应该首先将它们复制使用`map`函数。
*
*/
def sequenceFile[K, V](path: String,
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int
): JavaPairRDD[K, V] = {
implicit val ctagK: ClassTag[K] = ClassTag(keyClass)
implicit val ctagV: ClassTag[V] = ClassTag(valueClass)
new JavaPairRDD(sc.sequenceFile(path, keyClass, valueClass, minPartitions))
}
objectFile
/**
* Load an RDD saved as a SequenceFile containing serialized objects, with NullWritable keys and
* BytesWritable values that contain a serialized partition. This is still an experimental storage
* format and may not be supported exactly as is in future Spark releases. It will also be pretty
* slow if you use the default serializer (Java serialization), though the nice thing about it is
* that there's very little effort required to save arbitrary objects.
*
* 加载一个RDD保存为SequenceFile包含序列化的对象,使用nullwritable keys和byteswritable values包含
* 序列化的分区。这仍然是一个实验性的存储。格式可能不支持正是因为未来的Spark版本。这也将是相当缓慢,
* 如果使用默认的序列化程序(java序列化),但好的事情是有一点小小的努力,需要保存任意对象。
*/
def objectFile[T](path: String, minPartitions: Int): JavaRDD[T] = {
implicit val ctag: ClassTag[T] = fakeClassTag
sc.objectFile(path, minPartitions)(ctag)
}
hadoopRDD
/**
* Get an RDD for a Hadoop-readable dataset from a Hadoop JobConf giving its InputFormat and any
* other necessary info (e.g. file name for a filesystem-based dataset, table name for HyperTable,
* etc).
*
* 从Hadoop jobconf给InputFormat和任何其他必要信息得到一个Hadoop可读数据RDD
* (一个基于文件系统的数据集,HyperTable的表名称等)。
*
* @param conf JobConf for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
*
* jobconf设置数据集。注意:这将投入广播。因此如果你计划使用这个配置创建多个RDDs,
* 你需要确保你不会修改设置一个安全的方法是创建一个新的RDD新配置。
*
* @param inputFormatClass Class of the InputFormat
* @param keyClass Class of the keys
* @param valueClass Class of the values
* @param minPartitions Minimum number of Hadoop Splits to generate.
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
*
* 由于Hadoop的RecordReader类re-uses 相同的可写的对象为每个记录,直接缓存返回的RDD会创造出很多引用
* 相同的对象。如果你打算直接缓存Hadoop可写的对象,你应该首先将它们复制使用`map`函数。
*
*/
def hadoopRDD[K, V, F <: InputFormat[K, V]](
conf: JobConf,
inputFormatClass: Class[F],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int
): JavaPairRDD[K, V] = {
implicit val ctagK: ClassTag[K] = ClassTag(keyClass)
implicit val ctagV: ClassTag[V] = ClassTag(valueClass)
val rdd = sc.hadoopRDD(conf, inputFormatClass, keyClass, valueClass, minPartitions)
new JavaHadoopRDD(rdd.asInstanceOf[HadoopRDD[K, V]])
}
hadoopFile
/**
* Get an RDD for a Hadoop file with an arbitrary InputFormat.
* 得到一个任意InputFormat输入格式的Hadoop文件得到一个RDD。
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
*
* 由于Hadoop的RecordReader类re-uses 相同的可写的对象为每个记录,直接缓存返回的RDD会创造出很多引用
* 相同的对象。如果你打算直接缓存Hadoop可写的对象,你应该首先将它们复制使用`map`函数。
*/
def hadoopFile[K, V, F <: InputFormat[K, V]](
path: String,
inputFormatClass: Class[F],
keyClass: Class[K],
valueClass: Class[V],
minPartitions: Int
): JavaPairRDD[K, V] = {
implicit val ctagK: ClassTag[K] = ClassTag(keyClass)
implicit val ctagV: ClassTag[V] = ClassTag(valueClass)
val rdd = sc.hadoopFile(path, inputFormatClass, keyClass, valueClass, minPartitions)
new JavaHadoopRDD(rdd.asInstanceOf[HadoopRDD[K, V]])
}
newAPIHadoopFile
/**
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
* and extra configuration options to pass to the input format.
* 在给定的任意新的API InputFormat和额外的配置选项传递给输入格式Hadoop文件RDD。
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
*
* 由于Hadoop的RecordReader类re-uses 相同的可写的对象为每个记录,直接缓存返回的RDD会创造出很多引用
* 相同的对象。如果你打算直接缓存Hadoop可写的对象,你应该首先将它们复制使用`map`函数。
*
*/
def newAPIHadoopFile[K, V, F <: NewInputFormat[K, V]](
path: String,
fClass: Class[F],
kClass: Class[K],
vClass: Class[V],
conf: Configuration): JavaPairRDD[K, V] = {
implicit val ctagK: ClassTag[K] = ClassTag(kClass)
implicit val ctagV: ClassTag[V] = ClassTag(vClass)
val rdd = sc.newAPIHadoopFile(path, fClass, kClass, vClass, conf)
new JavaNewHadoopRDD(rdd.asInstanceOf[NewHadoopRDD[K, V]])
}
newAPIHadoopRDD
/**
* Get an RDD for a given Hadoop file with an arbitrary new API InputFormat
* and extra configuration options to pass to the input format.
*
* 在给定的任意新的API InputFormat和额外的配置选项传递给输入格式Hadoop文件RDD。
*
* @param conf Configuration for setting up the dataset. Note: This will be put into a Broadcast.
* Therefore if you plan to reuse this conf to create multiple RDDs, you need to make
* sure you won't modify the conf. A safe approach is always creating a new conf for
* a new RDD.
*
* jobconf设置数据集。注意:这将投入广播。因此如果你计划使用这个配置创建多个RDDs,
* 你需要确保你不会修改设置一个安全的方法是创建一个新的RDD新配置。
*
* @param fClass Class of the InputFormat
* @param kClass Class of the keys
* @param vClass Class of the values
*
* @note Because Hadoop's RecordReader class re-uses the same Writable object for each
* record, directly caching the returned RDD will create many references to the same object.
* If you plan to directly cache Hadoop writable objects, you should first copy them using
* a `map` function.
*
* 由于Hadoop的RecordReader类re-uses 相同的可写的对象为每个记录,直接缓存返回的RDD会创造出很多引用
* 相同的对象。如果你打算直接缓存Hadoop可写的对象,你应该首先将它们复制使用`map`函数。
*
*/
def newAPIHadoopRDD[K, V, F <: NewInputFormat[K, V]](
conf: Configuration,
fClass: Class[F],
kClass: Class[K],
vClass: Class[V]): JavaPairRDD[K, V] = {
implicit val ctagK: ClassTag[K] = ClassTag(kClass)
implicit val ctagV: ClassTag[V] = ClassTag(vClass)
val rdd = sc.newAPIHadoopRDD(conf, fClass, kClass, vClass)
new JavaNewHadoopRDD(rdd.asInstanceOf[NewHadoopRDD[K, V]])
}
union
/** Build the union of two or more RDDs.
* 求2个或者多个RDD的交集
* */
override def union[T](first: JavaRDD[T], rest: java.util.List[JavaRDD[T]]): JavaRDD[T] = {
val rdds: Seq[RDD[T]] = (Seq(first) ++ rest.asScala).map(_.rdd)
implicit val ctag: ClassTag[T] = first.classTag
sc.union(rdds)
}
intAccumulator
/**
* Create an [[org.apache.spark.Accumulator]] integer variable, which tasks can "add" values
* to using the `add` method. Only the master can access the accumulator's `value`.
*
* 创建一个[[org.apache.spark.Accumulator]]整形累加器,它可以通过`add`方法添加值,只有主节点
* 能访问它的值。
*/
@deprecated("use sc().longAccumulator()", "2.0.0")
def intAccumulator(initialValue: Int): Accumulator[java.lang.Integer] =
sc.accumulator(initialValue)(IntAccumulatorParam).asInstanceOf[Accumulator[java.lang.Integer]]
doubleAccumulator
/**
* Create an [[org.apache.spark.Accumulator]] double variable, which tasks can "add" values
* to using the `add` method. Only the master can access the accumulator's `value`.
*
* 创建一个[[org.apache.spark.Accumulator]]整形累加器,它可以通过`add`方法添加值,
* 只有主节点能访问它的值。
*
* This version supports naming the accumulator for display in Spark's web UI.
* 在Spark的web UI中显示累加器的值
*/
@deprecated("use sc().doubleAccumulator(String)", "2.0.0")
def doubleAccumulator(initialValue: Double, name: String): Accumulator[java.lang.Double] =
sc.accumulator(initialValue, name)(DoubleAccumulatorParam)
.asInstanceOf[Accumulator[java.lang.Double]]
accumulator
/**
* Create an [[org.apache.spark.Accumulator]] integer variable, which tasks can "add" values
* to using the `add` method. Only the master can access the accumulator's `value`.
*
* 创建一个[[org.apache.spark.Accumulator]]整形累加器,它可以通过`add`方法添加值,
* 只有主节点能访问它的值。
*
* This version supports naming the accumulator for display in Spark's web UI.
* 在Spark的web UI中显示累加器的值
*/
@deprecated("use sc().longAccumulator(String)", "2.0.0")
def accumulator(initialValue: Int, name: String): Accumulator[java.lang.Integer] =
intAccumulator(initialValue, name)
broadcast
/**
* Broadcast a read-only variable to the cluster, returning a
* [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
* The variable will be sent to each cluster only once.
*
* 集群可读的广播变量,返回一个[[org.apache.spark.broadcast.Broadcast]]对象为了读取数据从分布式函数中,
* 这个变量将会被送到集群的每个节点,仅仅发送一次。
*/
def broadcast[T](value: T): Broadcast[T] = sc.broadcast(value)(fakeClassTag)
stop
/** Shut down the SparkContext.
* 停止SparkContext
* */
def stop() {
sc.stop()
}
getSparkHome
/**
* Get Spark's home location from either a value set through the constructor,
* or the spark.home Java property, or the SPARK_HOME environment variable
* (in that order of preference). If neither of these is set, return None.
*
* 无论是通过构造函数的值得到或spark.home java属性获取Spark的家目录。或spark_home环境变量
* (按优先顺序排列)。如果两者都没有设置,则返回none。
*/
def getSparkHome(): Optional[String] = JavaUtils.optionToOptional(sc.getSparkHome())
addFile
/**
* Add a file to be downloaded with this Spark job on every node.
* The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported
* filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs,
* use `SparkFiles.get(fileName)` to find its download location.
*
* 添加一个文件要下载这个Spark的工作在每个节点上,`path`路径可以是本地文件,一个在HDFS中的文件
* (或者任何hadoop支持的文件),或者一个HTTP,HTTPS和FTP的URI。访问火Spark jobs的文件,
* 使用`SparkFiles.get(fileName)` 找到下载位置。
*
*/
def addFile(path: String) {
sc.addFile(path)
}
setCheckpointDir
/**
* Set the directory under which RDDs are going to be checkpointed. The directory must
* be a HDFS path if running on a cluster.
* 设置一个目录,该目录下的RDDS将设置检查点。如果运行在集群上该目录必须是一个HDFS路径。
*/
def setCheckpointDir(dir: String) {
sc.setCheckpointDir(dir)
}
getConf
/**
* Return a copy of this JavaSparkContext's configuration. The configuration ''cannot'' be
* changed at runtime.
* 返回一个JavaSparkContext's configuration的拷贝,配置的'cannot”可以在运行时改变。
*/
def getConf: SparkConf = sc.getConf
setCallSite
/**
* Pass-through to SparkContext.setCallSite. For API support only.
* 设置SparkContext.setCallSite。仅用于API支持。
*/
def setCallSite(site: String) {
sc.setCallSite(site)
}
clearCallSite
/**
* Pass-through to SparkContext.setCallSite. For API support only.
* 清除SparkContext.setCallSite。仅用于API支持。
*/
def clearCallSite() {
sc.clearCallSite()
}
setLocalProperty
/**
* Set a local property that affects jobs submitted from this thread, and all child
* threads, such as the Spark fair scheduler pool.
* 设置一个影响此线程提交的作业的所有本地属性,以及所有子线程,如Spark公平调度程序池。
*
*
* These properties are inherited by child threads spawned from this thread. This
* may have unexpected consequences when working with thread pools. The standard java
* implementation of thread pools have worker threads spawn other worker threads.
* As a result, local properties may propagate unpredictably.
*
* 这些性质都是由孩子从这个线程的线程继承。这可能会有意想不到的后果当工作在线程池。
* 线程池线程标准java实现产卵其他工作线程,性能可能传播不可预知的地方。
*
*/
def setLocalProperty(key: String, value: String): Unit = sc.setLocalProperty(key, value)
getLocalProperty
/**
* Get a local property set in this thread, or null if it is missing. See
* `org.apache.spark.api.java.JavaSparkContext.setLocalProperty`.
* 得到一个本地的属性在这个线程中,如果丢失则返回NULL,看`org.apache.spark.api.java.JavaSparkContext.setLocalProperty`.
*/
def getLocalProperty(key: String): String = sc.getLocalProperty(key)
setLogLevel
/** Control our logLevel. This overrides any user-defined log settings.
* 控制我们的日志级别,这个将重写用户定义的log设置
* @param logLevel The desired log level as a string.
* 参数是一个字符串
* Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
*日志级别有四种,具体需要哪种自己设置
*
*/
def setLogLevel(logLevel: String) {
sc.setLogLevel(logLevel)
}
setJobGroup
/**
* Assigns a group ID to all the jobs started by this thread until the group ID is set to a
* different value or cleared.
*
* 将组ID分配给该线程启动的所有作业,直到组ID被设置为不同的值或清除为止。
*
* Often, a unit of execution in an application consists of multiple Spark actions or jobs.
* Application programmers can use this method to group all those jobs together and give a
* group description. Once set, the Spark web UI will associate such jobs with this group.
*
* 通常,应用程序中的一个执行单元由多个 Spark actions或作业组成。
* 应用程序程序员可以使用此方法将所有这些工作组合在一起,并进行组描述。
* 一旦设置,Spark的Web UI就会联想associate到这样的工作,与本组。
*
* The application can also use `org.apache.spark.api.java.JavaSparkContext.cancelJobGroup`
* to cancel all running jobs in this group. For example,
* {{{
* // In the main thread:
* sc.setJobGroup("some_job_to_cancel", "some job description");
* rdd.map(...).count();
*
* // In a separate thread:
* sc.cancelJobGroup("some_job_to_cancel");
* }}}
*
* If interruptOnCancel is set to true for the job group, then job cancellation will result
* in Thread.interrupt() being called on the job's executor threads. This is useful to help ensure
* that the tasks are actually stopped in a timely manner, but is off by default due to HDFS-1208,
* where HDFS may respond to Thread.interrupt() by marking nodes as dead.
*/
def setJobGroup(groupId: String, description: String, interruptOnCancel: Boolean): Unit =
sc.setJobGroup(groupId, description, interruptOnCancel)
clearJobGroup
/** Clear the current thread's job group ID and its description.
* 清除当前线程的工作组ID及其描述。
* */
def clearJobGroup(): Unit = sc.clearJobGroup()
cancelJobGroup
/**
* Cancel active jobs for the specified group. See
* `org.apache.spark.api.java.JavaSparkContext.setJobGroup` for more information.
*
* 取消指定组的活动作业。详情查看`org.apache.spark.api.java.JavaSparkContext.setJobGroup`
*/
def cancelJobGroup(groupId: String): Unit = sc.cancelJobGroup(groupId)
cancelAllJobs
/** Cancel all jobs that have been scheduled or are running.
* 取消所有已被调度的或运行的工作。
* */
def cancelAllJobs(): Unit = sc.cancelAllJobs()
getPersistentRDDs
/**
* Returns a Java map of JavaRDDs that have marked themselves as persistent via cache() call.
* 返回一个包含JavaRDDs的java map,这些RDD是被标志被持久化的通过调用cache()方法。
*
* @note This does not necessarily mean the caching or computation was successful.
* 这并不一定意味着缓存或计算是成功的。
*/
def getPersistentRDDs: JMap[java.lang.Integer, JavaRDD[_]] = {
sc.getPersistentRDDs.mapValues(s => JavaRDD.fromRDD(s))
.asJava.asInstanceOf[JMap[java.lang.Integer, JavaRDD[_]]]
}
sparkcontext的api就总结到这里了。
每天进步一点,重在坚持。