统计RDD分区中的元素及数量

Spark RDD是被分区的,在生成RDD时候,一般可以指定分区的数量,如果不指定分区数量,当RDD从集合创建时候,则默认为该程序所分配到的资源的CPU核数,如果是从HDFS文件创建,默认为文件的Block数。

可以利用RDD的mapPartitionsWithIndex方法来统计每个分区中的元素及数量。

关于mapPartitionsWithIndex的介绍可以参考 mapPartitionsWithIndex的介绍

http://lxw1234.com/archives/2015/07/348.htm

具体看例子:

//创建一个RDD,默认分区15个,因为我的spark-shell指定了一共使用15个CPU资源
//–total-executor-cores 15

 
 
  1. scala> var rdd1 = sc.makeRDD(1 to 50)
  2. rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[17] at makeRDD at :21
  3.  
  4. scala> rdd1.partitions.size
  5. res15: Int = 15
  6.  

//统计rdd1每个分区中元素数量

 
 
  1. rdd1.mapPartitionsWithIndex{
  2. (partIdx,iter) => {
  3. var part_map = scala.collection.mutable.Map[String,Int]()
  4. while(iter.hasNext){
  5. var part_name = "part_" + partIdx;
  6. if(part_map.contains(part_name)) {
  7. var ele_cnt = part_map(part_name)
  8. part_map(part_name) = ele_cnt + 1
  9. } else {
  10. part_map(part_name) = 1
  11. }
  12. iter.next()
  13. }
  14. part_map.iterator
  15. }
  16. }.collect
  17.  
  18. res16: Array[(String, Int)] = Array((part_0,3), (part_1,3), (part_2,4), (part_3,3), (part_4,3), (part_5,4), (part_6,3),
  19. (part_7,3), (part_8,4), (part_9,3), (part_10,3), (part_11,4), (part_12,3), (part_13,3), (part_14,4))
  20. //从part_0到part_14,每个分区中的元素数量
  21.  

//统计rdd1每个分区中有哪些元素

 
 
  1. rdd1.mapPartitionsWithIndex{
  2. (partIdx,iter) => {
  3. var part_map = scala.collection.mutable.Map[String,List[Int]]()
  4. while(iter.hasNext){
  5. var part_name = "part_" + partIdx;
  6. var elem = iter.next()
  7. if(part_map.contains(part_name)) {
  8. var elems = part_map(part_name)
  9. elems ::= elem
  10. part_map(part_name) = elems
  11. } else {
  12. part_map(part_name) = List[Int]{elem}
  13. }
  14. }
  15. part_map.iterator
  16. }
  17. }.collect
  18.  
  19. res17: Array[(String, List[Int])] = Array((part_0,List(3, 2, 1)), (part_1,List(6, 5, 4)), (part_2,List(10, 9, 8, 7)), (part_3,List(13, 12, 11)),
  20. (part_4,List(16, 15, 14)), (part_5,List(20, 19, 18, 17)), (part_6,List(23, 22, 21)), (part_7,List(26, 25, 24)), (part_8,List(30, 29, 28, 27)),
  21. (part_9,List(33, 32, 31)), (part_10,List(36, 35, 34)), (part_11,List(40, 39, 38, 37)), (part_12,List(43, 42, 41)), (part_13,List(46, 45, 44)),
  22. (part_14,List(50, 49, 48, 47)))
  23. //从part_0到part14,每个分区中包含的元素
  24.  
  25.  

//从HDFS文件创建的RDD,包含65个分区,因为该文件由65个Block

 
 
  1. scala> var rdd2 = sc.textFile("/logs/2015-07-05/lxw1234.com.log")
  2. rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[21] at textFile at :21
  3.  
  4. scala> rdd2.partitions.size
  5. res18: Int = 65
  6.  

//rdd2每个分区的元素数量

 
 
  1. scala> rdd2.mapPartitionsWithIndex{
  2. | (partIdx,iter) => {
  3. | var part_map = scala.collection.mutable.Map[String,Int]()
  4. | while(iter.hasNext){
  5. | var part_name = "part_" + partIdx;
  6. | if(part_map.contains(part_name)) {
  7. | var ele_cnt = part_map(part_name)
  8. | part_map(part_name) = ele_cnt + 1
  9. | } else {
  10. | part_map(part_name) = 1
  11. | }
  12. | iter.next()
  13. | }
  14. | part_map.iterator
  15. |
  16. | }
  17. | }.collect
  18.  
  19.  
  20. res19: Array[(String, Int)] = Array((part_0,202496), (part_1,225503), (part_2,214375), (part_3,215909),
  21. (part_4,208941), (part_5,205379), (part_6,207894), (part_7,209496), (part_8,213806), (part_9,216962),
  22. (part_10,216091), (part_11,215820), (part_12,217043), (part_13,216556), (part_14,218702), (part_15,218625),
  23. (part_16,218519), (part_17,221056), (part_18,221250), (part_19,222092), (part_20,222339), (part_21,222779),
  24. (part_22,223578), (part_23,222869), (part_24,221543), (part_25,219671), (part_26,222871), (part_27,223200),
  25. (part_28,223282), (part_29,228212), (part_30,223978), (part_31,223024), (part_32,222889), (part_33,222106),
  26. (part_34,221563), (part_35,219208), (part_36,216928), (part_37,216733), (part_38,217214), (part_39,219978),
  27. (part_40,218155), (part_41,219880), (part_42,215833...
  • 0
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值