count 统计RDD中元素的个数。 1 2 3 val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2) c.count res2: Long = 4 countByKey 与count类似,但是是以key为单位进行统计。 注意:此函数返回的是一个map,不是int。 1 2 3 4 5 val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2) c.countByKey res1: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1) c.countByKey.size res2: Int = 2 countByValue 统计一个RDD中各个value的出现次数。返回一个map,map的key是元素的值,value是出现的次数。 1 2 3 4 5 scala> val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[12] at parallelize at <console>:21 scala> b.countByValue res3: scala.collection.Map[Int,Long] = Map(5 -> 1, 1 -> 6, 6 -> 1, 2 -> 3, 7 -> 1, 3 -> 1, 8 -> 1, 4 -> 2) 可以扫码加微信