spark partition

HA方式启动spark

#HA方式启动spark,当Leader,挂掉的时候,standy变为alive
./bin/spark-shell --master spark://xupan001:7070,xupan002:7070

 

指定分区

#指定两个分区,会生成两个作业task,hdfs上会有两个文件
 val rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9), 2)
rdd1.partitions.length //2
#saveAsTextFile
 rdd1.saveAsTextFile("hdfs://xupan001:8020/user/root/spark/output/partition")

PermissionOwnerGroupSizeReplicationBlock SizeName
-rw-r--r--rootsupergroup0 B1128 MB_SUCCESS
-rw-r--r--rootsupergroup8 B1128 MBpart-00000
-rw-r--r--rootsupergroup10 B1128 MBpart-00001

 

cores相关

如果没有指定分区数:文件个数和cores有关,也就是可用核数有关(总核数)
val  rdd1 = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8, 9))
rdd1.partitions.length //6
rdd1.saveAsTextFile("hdfs://xupan001:8020/user/root/spark/output/partition2")

PermissionOwnerGroupSizeReplicationBlock SizeName
-rw-r--r--rootsupergroup0 B1128 MB_SUCCESS
-rw-r--r--rootsupergroup2 B1128 MBpart-00000
-rw-r--r--rootsupergroup4 B1128 MBpart-00001
-rw-r--r--rootsupergroup2 B1128 MBpart-00002
-rw-r--r--rootsupergroup4 B1128 MBpart-00003
-rw-r--r--rootsupergroup2 B1128 MBpart-00004
-rw-r--r--rootsupergroup4 B1128 MBpart-00005

 

对照基本配置:

  • URL: spark://xupan001:7070
  • REST URL: spark://xupan001:6066 (cluster mode)
  • Alive Workers: 3
  • Cores in use: 6 Total, 6 Used
  • Memory in use: 6.0 GB Total, 3.0 GB Used
  • Applications: 1 Running, 5 Completed
  • Drivers: 0 Running, 0 Completed
  • Status: ALIVE

Workers

Worker IdAddressStateCoresMemory
worker-20171211031717-192.168.0.118-7071192.168.0.118:7071ALIVE2 (2 Used)2.0 GB (1024.0 MB Used)
worker-20171211031718-192.168.0.119-7071192.168.0.119:7071ALIVE2 (2 Used)2.0 GB (1024.0 MB Used)
worker-20171211031718-192.168.0.120-7071192.168.0.120:7071ALIVE2 (2 Used)2.0 GB (1024.0 MB Used)

 

 

======================================================

hdfs文件大小相关

从hdfs上读取文件如果没有指定分区,默认为2个分区
scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input/zookeeper.out")
scala> rdd.partitions.length
res3: Int = 2

/**
 * Default min number of partitions for Hadoop RDDs when not given by user
 * Notice that we use math.min so the "defaultMinPartitions" cannot be higher than 2.
 * The reasons for this are discussed in https://github.com/mesos/spark/pull/718
 */
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

 

如果hdfs文件很大,则会根据 文件Size/128个partition,如果余数不足128则Size/128 + 1个partition

 

总结:以上是我在spark2.2.0上做的测试:
1.如果是Driver端的Scala集合并行化创建RDD,并且没有指定RDD的分区,RDD的分区数就是Application分配的总cores数
2:如果是hdfs文件系统的方式读取数据

2.1一个文件文件的大小小于128M
scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input/zookeeper.out",1)
scala> rdd.partitions.length
res0: Int = 1

2.2多个文件,其中一个文件大大小为:

PermissionOwnerGroupSizeReplicationBlock SizeName
-rw-r--r--rootsupergroup4.9 KB1128 MBuserLog.txt
-rw-r--r--rootsupergroup284.35 MB1128 MBuserLogBig.txt
-rw-r--r--rootsupergroup51.83 KB1128 MBzookeeper.out

 

scala> val rdd = sc.textFile("hdfs://xupan001:8020/user/root/spark/input")
rdd: org.apache.spark.rdd.RDD[String] = hdfs://xupan001:8020/user/root/spark/input MapPartitionsRDD[3] at textFile at <console>:24

 

scala> rdd.partitions.length
res1: Int = 5

userLogBig.txt会有3个block

 

转载于:https://my.oschina.net/u/2253438/blog/1590655

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值