Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
- 1 count
count 返回的是在一个RDD里面存储的元素的个数
def count(): Long
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
c.count
res2: Long = 4
- 2 countApproxDistinct
计算单一值的大概的出现的次数,假设有一个分布于很多节点的很大的一个RDD,大致的计算速度会快于其他的计算方式,
Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高
def countApproxDistinct(relativeSD: Double = 0.05): Long
val a = sc.parallelize(1 to 10000, 20)
val b = a++a++a++a++a
b.countApproxDistinct(0.1)
res14: Long = 8224
b.countApproxDistinct(0.05)
res15: Long = 9750
b.countApproxDistinct(0.01)
res16: Long = 9947
b.countApproxDistinct(0.001)
res0: Long = 10000
- 3 countApproxDistinctByKey [Pair]
这个作用于一个键值对类型的数据。它和之前的countApproxDistinct
是类似的。不过计算的是每个单独出现的key值的单独的value值出现的次数。RDD包含的元素的值也必须是tuple类型的元素。Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高。
def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]
def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]
val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)
val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)
val c = sc.parallelize(1 to b.count().toInt, 20)
val d = b.zip(c)
d.countApproxDistinctByKey(0.1).collect
res15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))
d.countApproxDistinctByKey(0.01).collect
res16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))
d.countApproxDistinctByKey(0.001).collect
res0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
- 4 countByKey
作用于键值对类型的元素,不过计算的是每个键对应出现的value的次数。
def countByKey(): Map[K, Long]
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)
c.countByKey
res3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
- 5 countByValue
计算一个RDD中,每一个元素出现的次数,返回的结果为一个map型,表示的是每个值出现了几次。
def countByValue(): Map[T, Long]
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))
b.countByValue
res27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)