实现的思路是使用Spark内置函数,
combineByKeyWithClassTag函数,借助HashSet的排序,此例是取组内最大的N个元素一下是代码:
createcombiner就简单的将首个元素装进HashSet然后返回就可以了;
mergevalue插入元素之后,如果元素的个数大于N就删除最小的元素;
mergeCombiner在合并之后,如果总的个数大于N,就从一次删除最小的元素,知道Hashset内只有N 个元素;
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import scala.collection.mutable
object Main {
val N = 3
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local[*]")
.appName("Spark")
.getOrCreate()
val sc = spark.sparkContext
var SampleDataset = List(
("apple.com", 3L),
("apple.com", 4L),
("apple.com", 1L),
("apple.com", 9L),
("google.com", 4L),
("google.com", 1L),
("google.com", 2L),
("google.com", 3L),
("google.com", 11L),
("google.com", 32L),
("slashdot.org", 11L),
("slashdot.org", 12L),
("slashdot.org", 13L),
("slashdot.org", 14L),
("slashdot.org", 15L),
("slashdot.org", 16L),
("slashdot.org", 17L),
("slashdot.org", 18L),
("microsoft.com", 5L),
("microsoft.com", 2L),
("microsoft.com", 6L),
("microsoft.com", 9L),
("google.com", 4L))
val urdd: RDD[(String, Long)] = sc.parallelize(SampleDataset).map((t) => (t._1, t._2))
var topNs = urdd.combineByKeyWithClassTag(
//createCombiner
(firstInt: Long) => {
var uset = new mutable.TreeSet[Long]()
uset += firstInt
},
// mergeValue
(uset: mutable.TreeSet[Long], value: Long) => {
uset += value
while (uset.size > N) {
uset.remove(uset.min)
}
uset
},
//mergeCombiners
(uset1: mutable.TreeSet[Long], uset2: mutable.TreeSet[Long]) => {
var resultSet = uset1 ++ uset2
while (resultSet.size > N) {
resultSet.remove(resultSet.min)
}
resultSet
}
)
import spark.implicits._
topNs.flatMap(rdd => {
var uset = new mutable.HashSet[String]()
for (i <- rdd._2.toList) {
uset += rdd._1 + "/" + i.toString
}
uset
}).map(rdd => {
(rdd.split("/")(0), rdd.split("/")(1))
}).toDF("key", "TopN_values").show()
}
}
以下是执行的结果:
+-------------+-----------+
| key|TopN_values|
+-------------+-----------+
| google.com| 4|
| google.com| 11|
| google.com| 32|
|microsoft.com| 9|
|microsoft.com| 6|
|microsoft.com| 5|
| apple.com| 4|
| apple.com| 9|
| apple.com| 3|
| slashdot.org| 16|
| slashdot.org| 17|
| slashdot.org| 18|
+-------------+-----------+
使用到的依赖:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.11 -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.2.0</version>
</dependency>
</dependencies>