案例:
根据学科取得最受欢迎的老师的前两名
这个是数据
http://bigdata.edu360.cn/zhangsan
http://bigdata.edu360.cn/zhangsan
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/wangwu
http://bigdata.edu360.cn/wangwu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/wangwu
http://bigdata.edu360.cn/wangwu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/lisi
http://bigdata.edu360.cn/wangwu
http://bigdata.edu360.cn/wangwu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/zhaoliu
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://javaee.edu360.cn/laoyang
http://python.edu360.cn/laoli
http://python.edu360.cn/laoliu
http://python.edu360.cn/laoli
http://python.edu360.cn/laoli
`基本写法` ------->在List中进行排序会产生内存溢出
package day03
/**
*
* 根據學科取得最受欢迎的老师前2名
*/
import java.net.URL
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object FavTeacherWithObject {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val conf = new SparkConf()
conf.setAppName("FavTeacher").setMaster("local[2]") //local[*]表示用多个线程跑,2表示用两个线程
val sc = new SparkContext(conf)
//读取数据
val lines: RDD[String] = sc.textFile("D:\\data\\teacher.log")
//整理数据,每个老师记一次数
val subjectAddTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacher = line.substring(line.lastIndexOf("/") + 1)
val url = new URL(line).getHost
val subject = url.substring(0, url.indexOf("."))
((subject, teacher), 1)
})
//聚合
val reduced: RDD[((String, String), Int)] = subjectAddTeacher.reduceByKey(_+_)
println(reduced.collect().toBuffer)
//根據學科進行 分組
val grouped: RDD[(String, Iterable[((String, String), Int)])] = reduced.groupBy(_._1._1)
println(grouped.collect().toBuffer)
//排序,这里的排序取前两名, 取到的数据是scala集合list中进行排序的
//先分组,在组内进行排序,这CompactBuffer是迭代器,继承了序列,然后将迭代器转换成list进行排序
//在某种极端的情况,_表示迭代分区的数据,这里是将迭代器的数据一次性的拉去过来后进行toList,如果数据量非常的大,这里肯定会出现OOM(内存溢出)
val sorted: RDD[(String, List[((String, String), Int)])] = grouped.mapValues(_.toList.sortBy( - _._2).take(2))
//println(sorted.collect().toBuffer)
val result = sorted.collect()
result.foreach(println)
//释放资源
sc.stop()
}
}
将数据过滤,同一个key在一个RDD,在RDD中进行排序就不会内存溢出(如果排不下的话就会到磁盘,所以不会溢出)
package day03
/**
* 根據學科取得最受欢迎的老师前2名(过滤后排序)
* ((bigdata, wangwu),10)
* ((javaee,laoyang),8)
*
* 数据:
* http://bigdata.edu360.cn/wangwu
* http://bigdata.edu360.cn/wangwu
* http://javaee.edu360.cn/zhaoliu
* http://javaee.edu360.cn/zhaoliu
* ......
*/
import java.net.URL
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object FavTeacherWithObject2 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val conf = new SparkConf()
conf.setAppName("FavTeacherWithObject2").setMaster("local")
val subjects = Array("bigdata", "javaee", "php")
val sc = new SparkContext(conf)
val lines = sc.textFile("D:\\data\\in\\teacher\\teacher.log")
//整理数据
val sbjectTeacherAndOne: RDD[((String, String), Int)] = lines.map(line => {
val index = line.lastIndexOf("/")
val teacher = line.substring(index + 1)
val httpHost = line.substring(0, index)
val subject = new URL(httpHost).getHost.split("[.]")(0)
((subject, teacher), 1)
})
//和一组合在一起(不好,调用了两次map方法)
//val map: RDD[((String, String), Int)] = sbjectAndteacher.map((_, 1))
//聚合,将学科和老师联合当做key
val reduced: RDD[((String, String), Int)] = sbjectTeacherAndOne.reduceByKey(_+_)
//cache到内存
//val cached = reduced.cache()
//scala的集合排序是在内存中进行的,但是内存有可能不够用
//可以调用RDD的sortby方法,内存+磁盘进行排序
for (sb <- subjects) {
//该RDD中对应的数据仅有一个学科的数据(因为过滤过了)
val filtered: RDD[((String, String), Int)] = reduced.filter(_._1._1 == sb)
//现在调用的是RDD的sortBy方法,(take是一个action,会触发任务提交)
val favTeacher = filtered.sortBy(_._2, false).take(2)
//打印
println(favTeacher.toBuffer)
}
sc.stop()
}
}
key数据量太大的时候就会使得key在一个分区中,从而造成排序混乱,所以自定义分区
package day03
/**
* 根據學科取得最受欢迎的老师前2名(自定义分区)
* ((bigdata, wangwu),10)
* ((javaee,laoyang),8)
*
* 数据:
* http://bigdata.edu360.cn/wangwu
* http://bigdata.edu360.cn/wangwu
* http://javaee.edu360.cn/zhaoliu
* http://javaee.edu360.cn/zhaoliu
* ......
* Created by zhangjingcun on 2018/9/19 8:36.
* */
import java.net.URL
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, Partitioner, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import scala.collection.mutable
object FavTeacherWithObject03 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val topN = args(0).toInt
val conf = new SparkConf()
conf.setAppName("FavTeacher").setMaster("local[2]") //local[*]表示用多个线程跑,2表示用两个线程
val sc = new SparkContext(conf)
//读取数据
val lines: RDD[String] = sc.textFile("D:\\data\\in\\teacher\\teacher.log")
//整理数据,每个老师记一次数
val subjectAddTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacher = line.substring(line.lastIndexOf("/") + 1)
val url = new URL(line).getHost
val subject = url.substring(0, url.indexOf("."))
((subject, teacher), 1)
})
//聚合,将学科和老师联合当做key
val reduced: RDD[((String, String), Int)] = subjectAddTeacher.reduceByKey(_+_)
//计算有多少学科
val subjects: Array[String] = reduced.map(_._1._1).distinct().collect()
//自定义一个分区器,并且按照指定的分区器进行分区
val sbPatitioner = new SubjectParitioner(subjects);
//partitionBy按照指定的分区规则进行分区
//调用partitionBy时RDD的Key是(String, String)
val partitioned: RDD[((String, String), Int)] = reduced.partitionBy(sbPatitioner)
//如果一次拿出一个分区(可以操作一个分区中的数据了)
val sorted: RDD[((String, String), Int)] = partitioned.mapPartitions(it => {
//将迭代器转换成list,然后排序,在转换成迭代器返回
it.toList.sortBy(_._2).reverse.take(topN).iterator
})
//
val r: Array[((String, String), Int)] = sorted.collect()
println(r.toBuffer)
sc.stop()
}
}
//自定义分区器
class SubjectParitioner(sbs: Array[String]) extends Partitioner {
//相当于主构造器(new的时候回执行一次)
//用于存放规则的一个map
val rules = new mutable.HashMap[String, Int]()
var i = 0
for(sb <- sbs) {
//rules(sb) = i
rules.put(sb, i)
i += 1
}
//返回分区的数量(下一个RDD有多少分区)
override def numPartitions: Int = sbs.length
//根据传入的key计算分区标号
//key是一个元组(String, String)
override def getPartition(key: Any): Int = {
//获取学科名称
val subject = key.asInstanceOf[(String, String)]._1
//根据规则计算分区编号,相当于执行apply方法
rules(subject)
}
}
在上面的代码中有两个shuffer过程reduceByKey和partitionBy,但是可以合成一个shuffer
package day03
/**
* 根據學科取得最受欢迎的老师前2名(自定义分区)
* ((bigdata, wangwu),10)
* ((javaee,laoyang),8)
*
* 数据:
* http://bigdata.edu360.cn/wangwu
* http://bigdata.edu360.cn/wangwu
* http://javaee.edu360.cn/zhaoliu
* http://javaee.edu360.cn/zhaoliu
* ......
* Created by zhangjingcun on 2018/9/19 8:36.
* */
import java.net.URL
import org.apache.log4j.{Level, Logger}
import org.apache.spark.{HashPartitioner, Partitioner, SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import scala.collection.mutable
object FavTeacherWithObject04 {
def main(args: Array[String]): Unit = {
Logger.getLogger("org.apache.spark").setLevel(Level.OFF)
val topN = args(0).toInt
val conf = new SparkConf()
conf.setAppName("FavTeacher").setMaster("local[2]") //local[*]表示用多个线程跑,2表示用两个线程
val sc = new SparkContext(conf)
//读取数据
val lines: RDD[String] = sc.textFile("D:\\data\\in\\teacher\\teacher.log")
//整理数据,每个老师记一次数
val subjectAddTeacher: RDD[((String, String), Int)] = lines.map(line => {
val teacher = line.substring(line.lastIndexOf("/") + 1)
val url = new URL(line).getHost
val subject = url.substring(0, url.indexOf("."))
((subject, teacher), 1)
})
//计算有多少学科
val subjects: Array[String] = subjectAddTeacher.map(_._1._1).distinct().collect()
//自定义一个分区器,并且按照指定的分区器进行分区
val sbPatitioner = new SubjectParitioner2(subjects);
//聚合,将学科和老师联合当做key,**这时候两个合并成一个shuffer**
val reduced: RDD[((String, String), Int)] = subjectAddTeacher.reduceByKey(sbPatitioner,_+_)
//partitionBy按照指定的分区规则进行分区
//调用partitionBy时RDD的Key是(String, String)
val partitioned: RDD[((String, String), Int)] = reduced.partitionBy(sbPatitioner)
//如果一次拿出一个分区(可以操作一个分区中的数据了)
val sorted: RDD[((String, String), Int)] = partitioned.mapPartitions(it => {
//将迭代器转换成list,然后排序,在转换成迭代器返回
it.toList.sortBy(_._2).reverse.take(topN).iterator
})
//
val r: Array[((String, String), Int)] = sorted.collect()
println(r.toBuffer)
sc.stop()
}
}
//自定义分区器
class SubjectParitioner2(sbs: Array[String]) extends Partitioner {
//相当于主构造器(new的时候回执行一次)
//用于存放规则的一个map
val rules = new mutable.HashMap[String, Int]()
var i = 0
for(sb <- sbs) {
//rules(sb) = i
rules.put(sb, i)
i += 1
}
//返回分区的数量(下一个RDD有多少分区)
override def numPartitions: Int = sbs.length
//根据传入的key计算分区标号
//key是一个元组(String, String)
override def getPartition(key: Any): Int = {
//获取学科名称
val subject = key.asInstanceOf[(String, String)]._1
//根据规则计算分区编号,相当于执行apply方法
rules(subject)
}
}