Spark学习笔记之初识

1 spark官网 http://spark.apache.org/
2 学习版本为1.5.0

Spark架构,官方文档解读

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
跟其他分布式系统一样,每个节点的spark 应用程序都是一系列独立的进程,这些进程由主节点的SparkContext对象管理,这个对象叫做驱动程序。

Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications.
集群管理程序可能有很多种,Mesos or YARN等,主要是为应用程序分配资源,SparkContext要和集群管理程序进行连接才能在多集群上驱动应用程序。

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
完成连接之后,SparkContext向各个节点发送执行代码,最后分配执行任务。

spark 运行架构原理图

注意点:
1 Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
不同的SparkContext之间不能共享数据

2 Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
spark对YARN等集群管理器有很好的支持

3The driver program must listen for and accept incoming connections from its executors throughout its lifetime (e.g., see spark.driver.port and spark.fileserver.port in the network config section). As such, the driver program must be network addressable from the worker nodes.
驱动程序一直监视节点知道任务完成,因此这个期间要保证主节点和其他业务节点的网络通信

4Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.
最好做本地集群,驱动服务器和执行节点服务器最好在一个物理位置上就很靠近的局域网之内

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值