Spark是新一代分布式内存计算框架,Apache开源的顶级项目。相比于Hadoop Map-Reduce计算框架,Spark将中间计算结果保留在内存中,速度提升10~100倍;同时它还提供更丰富的算子,采用弹性分布式数据集(RDD)实现迭代计算,更好地适用于数据挖掘、机器学习算法,极大提升开发效率。相比于在物理机上部署,在Kubernetes集群上部署Spark集群,具有以下优势:
快速部署:安装1000台级别的Spark集群,在Kubernetes集群上只需设定worker副本数目replicas=1000,即可一键部署。
快速升级:升级Spark版本,只需替换Spark镜像,一键升级。
弹性伸缩:需要扩容、缩容时,自动修改worker副本数目replicas即可。
高一致性:各个Kubernetes节点上运行的Spark环境一致、版本一致。
高可用性:如果Spark所在的某些node或pod死掉,Kubernetes会自动将计算任务,转移到其他node或创建新pod。
强隔离性:通过设定资源配额等方式,可与WebService应用部署在同一集群,提升机器资源使用效率,从而降低服务器成本。
创建Spark集群的配置文件如下:
spark-cluster.yaml
# ================================= Spark Master =================================
kind: ReplicationController
apiVersion: v1
metadata:
name: spark-master-controller
namespace: spark-cluster
spec:
replicas: 1
selector:
component: spark-master
template:
metadata:
labels:
component: spark-master
spec:
containers:
- name: spark-master
image: index.docker.io/caicloud/spark:1.5.2
env:
- name: TZ
value: Asia/Shanghai
command: ["/start-master"]
ports:
- containerPort: 7077
- containerPort: 8080
resources:
requests:
cpu: 100m
# ================================= Master-Service =================================
---
kind: Service
apiVersion: v1
metadata:
name: spark-master
namespace: spark-cluster
spec:
type: NodePort
ports:
- port: 7077
targetPort: 7077
name: spark
- port: 8080
targetPort: 8080
nodePort: 8080
name: http
selector:
component: spark-master
# ================================= Spark Workers =================================
---
kind: ReplicationController
apiVersion: v1
metadata:
name: spark-worker-controller
namespace: spark-cluster
spec:
replicas: 4
selector:
component: spark-worker
template:
metadata:
labels:
component: spark-worker
spec:
containers:
- name: spark-worker
image: index.docker.io/caicloud/spark:1.5.2
env:
- name: TZ
value: Asia/Shanghai
command: ["/start-worker"]
ports:
- containerPort: 8081
resources:
requests:
cpu: 100m
# ================================= Spark UI Proxy =================================
---
kind: ReplicationController
apiVersion: v1
metadata:
name: spark-ui-proxy-controller
namespace: spark-cluster
spec:
replicas: 1
selector:
component: spark-ui-proxy
template:
metadata:
labels:
component: spark-ui-proxy
spec:
containers:
- name: spark-ui-proxy
image: elsonrodriguez/spark-ui-proxy:1.0
env:
- name: TZ
value: Asia/Shanghai
ports:
- containerPort: 80
resources:
requests:
cpu: 100m
args:
- spark-master:8080
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 15
timeoutSeconds: 60
# =============================== Spark UI Proxy Service ===============================
---
kind: Service
apiVersion: v1
metadata:
name: spark-ui-proxy-service
namespace: spark-cluster
spec:
type: NodePort
ports:
- port: 80
targetPort: 80
nodePort: 8081
selector:
component: spark-ui-proxy
创建zeppelin(非必要)
zeppelin.yaml
kind: ReplicationController
apiVersion: v1
metadata:
name: zeppelin-controller
namespace: spark-cluster
spec:
replicas: 1
selector:
component: zeppelin
template:
metadata:
labels:
component: zeppelin
spec:
containers:
- name: zeppelin
image: apache/zeppelin:0.9.0
ports:
- containerPort: 8080
env:
- name: TZ
value: Asia/Shanghai
resources:
requests:
cpu: 100m
zeppelin-svc.yaml
kind: Service
apiVersion: v1
metadata:
name: zeppelin
namespace: spark-cluster
spec:
type: NodePort
ports:
- port: 8079
targetPort: 8080
nodePort: 8079
selector:
component: zeppelin
参考于:https://github.com/kubernetes/examples/tree/master/staging/spark