题目
1.在所有的老师中求出最受欢迎的老师Top3
2.求每个学科中最受欢迎老师的top3(至少用2到三种方式实现)
数据内容
http://bigdata.edu360.cn/laozhang
http://bigdata.edu360.cn/laozhang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laozhao
http://bigdata.edu360.cn/laoduan
http://bigdata.edu360.cn/laoduan
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/xiaoxu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://php.edu360.cn/laoli
http://php.edu360.cn/laoliu
http://php.edu360.cn/laoli
http://php.edu360.cn/laoli
1.在所有的老师中求出最受欢迎的老师Top3
思路: 正常wordcount逻辑,先将数据整理,然后拼接成(teacher,1)元组,然后对这个RDD进行分组求和排序
object TeacherTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
val sparkContext = new SparkContext(conf)
val lines: RDD[String] = sparkContext.textFile(args(0))
val line: RDD[(String, Int)] = lines.map(line => {
val teacherName: String = line.split("/")(3)
// val project: String = line.split("/")(2).split("[.]")(0)
(teacherName, 1)
})
val result: RDD[(String, Int)] = line.reduceByKey(_+_).sortBy(_._2, false)
println(result.collect().toBuffer)
}
}
2.求每个学科中最受欢迎老师的top3(至少用2到三种方式实现)
解法1:
思路:组成 ((project, teacherName),1) 元组,然后先按key=(project, teacherName),value=1的元组进行分组聚合,再根据project分组,最后将value排序,注意此时用的排序方法为scala自带排序
object TeacherGroupT1 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
val sparkContext = new SparkContext(conf)
val lines: RDD[String] = sparkContext.textFile(args(0))
val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacherName: String = line.split("/")(3)
val project: String = line.split("/")(2).split("[.]")(0)
((project, teacherName),1)
})
val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(_+_)
val projectReduced: RDD[(String, Iterable[((String, String), Int)])] = reduced.groupBy(_._1._1)
val result: RDD[(String, List[((String, String), Int)])] = projectReduced.mapValues(_.toList.sortBy(_._2).reverse.take(3))
println(result.collect().toBuffer)
sparkContext.stop()
}
}
问题所在:
val projectReduced: RDD[(String, Iterable[((String, String), Int)])] = reduced.groupBy(_._1._1)
val result: RDD[(String, List[((String, String), Int)])] = projectReduced.mapValues(_.toList.sortBy(_._2).reverse.take(3))
首先groupby效率低下,其次mapValues中用到了_toList,即在内存中将数据toList,全部加载到内存后再做排序,数据量大的时候会有性能问题,内存溢出的问题,不建议这样使用
解法二:
思路:不在内存中做toList排序,即采用spark自带的sortBy,即底层调用了RangePartitioner进行分级分区,那么在每个分区中进行排序的数据,就只有该级别的一部分,例如 1-100的放到第一个分区。100-200放到第二个分区, 进行排序,这里其实调用了ShuffledRDD进行了分区操作。
这里首先做了一次filter,取出同一类别的数据,然后对这一部分同一类别的数据单独做分区排序。如果不filter,直接排序的话,则是对所有数据分区排序,结果不正确。
object TeacherGroupT2 {
def main(args: Array[String]): Unit = {
val subjects = Array("javaee", "bigdata", "php")
val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
val sparkContext = new SparkContext(conf)
val lines: RDD[String] = sparkContext.textFile(args(0))
val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacherName: String = line.split("/")(3)
val project: String = line.split("/")(2).split("[.]")(0)
((project, teacherName),1)
})
val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(_+_)
for (sb <- subjects){
val filtered: RDD[((String, String), Int)] = reduced.filter(_._1._1 == sb)
val result = filtered.sortBy(_._2, false).take(3)
println(result.toBuffer)
}
sparkContext.stop()
}
}
问题:
多次Shuffled,首先reduceByKey做了一次shuffled,然后sortBy又会根据有几个科目就做几个shuffled
解法三:
思路: val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(sbPartitioner, _ + _)
解决shuffled问题, 自定义分区 以及自定义排序,通过自定义分区,将不同科目放到不同分区中,再通过自定义排序,每次抽取分区中的一个数据进行类似插入排序的算法,进行操作,这样能够避免一次性全部将数据放到内存来排序导致的问题。
object TeacherGroupT3 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("teacherWork").setMaster("local[3]")
val sparkContext = new SparkContext(conf)
val lines: RDD[String] = sparkContext.textFile(args(0))
val projectAndTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacherName: String = line.split("/")(3)
val project: String = line.split("/")(2).split("[.]")(0)
((project, teacherName), 1)
})
val subjects: Array[String] = projectAndTeacher.map(_._1._1).distinct().collect()
val sbPartitioner = new SubjectPartitioner(subjects)
val reduced: RDD[((String, String), Int)] = projectAndTeacher.reduceByKey(sbPartitioner, _ + _)
val result: RDD[((String, String), Int)] = reduced.mapPartitions(it => {
// it.toList.sortBy(_._2).reverse.take(3).iterator
var set = new mutable.TreeSet[((String, String), Int)]()(new MyComparator)
var setResult = new mutable.TreeSet[((String, String), Int)]()(new MyComparator)
val length = 2
it.foreach(f = x => {
set += x
if (set.size == length) {
set = set.dropRight(1)
}
})
set.iterator
})
println(result.collect().toBuffer)
sparkContext.stop()
}
}
class SubjectPartitioner(val subjects: Array[String]) extends Partitioner {
private val rules: mutable.HashMap[String,Int] = new mutable.HashMap[String, Int]()
var i = 0
for(sb <- subjects){
rules += ((sb, i))
i += 1
}
override def numPartitions: Int = {
subjects.length
}
override def getPartition(key: Any): Int = {
val subject = key.asInstanceOf[(String,String)]._1
rules(subject)
}
}
class MyComparator extends Ordering[((String, String), Int)]{
override def compare(x: ((String, String), Int), y: ((String, String), Int)): Int = {
- (x._2 - y._2)
}
}