什么是DAG
DAG是有向无环图,它的功能是在Spark运行应用程序(Application)时,首先建立一个有向无环图(DAG),图上的每个节点是一个操作,而Spark的操作分为两类,一类是Transform,一类是Action。在应用程序执行过程中,只有遇到Action类的操作时,才会出发作业(Job)的提交。一个应用程序可以包含多个作业。在提交作业后,首先根据DAG计算这个作业包含哪些Stage,然后每个Stage分解成一些Task
SparkContext、SparkConf和SparkEnv
在实例化SparkContext的过程中,会实例化SparkEnv,为了实例化SparkEnv,Spark启动了多个环节,这从SparkEnv的构造函数中即可看到端倪
new SparkEnv(
executorId,
actorSystem,
serializer,
closureSerializer,
cacheManager,
mapOutputTracker,
shuffleManager,
broadcastManager,
blockTransferService,
blockManager,
securityManager,
httpFileServer,
sparkFilesDir,
metricsSystem,
shuffleMemoryManager,
conf
上面的每个变量都对应着Spark的某个方面,每个变量所属的类型如下:
class SparkEnv (
val executorId: String,
val actorSystem: ActorSystem,
val serializer: Serializer,
val closureSerializer: Serializer,
val cacheManager: CacheManager,
val mapOutputTracker: MapOutputTracker,
val shuffleManager: ShuffleManager,
val broadcastManager: BroadcastManager,
val blockTransferService: BlockTransferService,
val blockManager: BlockManager,
val securityManager: SecurityManager,
val httpFileServer: HttpFileServer,
val sparkFilesDir: String,
val metricsSystem: MetricsSystem,
val shuffleMemoryManager: ShuffleMemoryManager,
val conf: SparkConf) extends Logging {
方法体
}
Spark对于SparkEnv的ScalaDoc说明是:
/**
* :: DeveloperApi ::
* Holds all the runtime environment objects for a running Spark instance (either master or worker),
* including the serializer, Akka actor system, block manager, map output tracker, etc. Currently
* Spark code finds the SparkEnv through a global variable, so all the threads can access the same
* SparkEnv. It can be accessed by SparkEnv.get (e.g. after creating a SparkContext).
*
* NOTE: This is not intended for external use. This is exposed for Shark and may be made private
* in a future release.
*/
3. 如果
val rdd = sc.textFile("file:///D:/words"),如果words是一个目录,而它底下有N个文本文件,那么最终的数据结果中有N个文件,分别是part-00000到part-0000X(X=N-1),这表示Spark对N个文件进行了分区,产生N个分区,每个分区对应一个Task?理论是这样,实际上,分区数还要看文件划分的block块个数
val rdd = sc.textFile("file:///D:/words"),如果words是一个目录,而它底下有N个文本文件,那么最终的数据结果中有N个文件,分别是part-00000到part-0000X(X=N-1),这表示Spark对N个文件进行了分区,产生N个分区,每个分区对应一个Task?理论是这样,实际上,分区数还要看文件划分的block块个数
package spark.examples.rdd
import org.apache.spark.{SparkContext, SparkConf}
object SparkSaveMultiFiles {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkRDDJoin").setMaster("local");
val sc = new SparkContext(conf);
val rdd = sc.textFile("file:///D:/wordcount")
val result = rdd.filter(_.contains("WordCount"))
result.foreach(println)
}
}
如上代码,d:/wordcount目录保存了多个文本文件