文章目录
回顾
一个分区对应一个Task,一个分区对应的Task只能在一台机器里面(在Executor),一台机器上可以有多个分区对应的Task。
分组TopN
1.聚合后按照学科进行分组,然后在每个分组中进行排序(调用的是Scala集合的排序)
2.先按学科进行过滤,然后调用RDD的方法进行排序(多台机器+内存+磁盘),需要将任务提交多次。
3.自定义分区,然后在每个分区中进行排序(partitionBy,mapPartitions)
4.在聚合时就应用自定义的分区,可以减少Shuffle
5.自定义分区,mapPartitions,在调用mapPartitions方法里面定义一个可排序的集合(6)
WordCount的执行流程
生成6个RDD,
2个Stage,
生成的Task的个数=分区数*2
RDD相关方法
RDD的cache
将数据缓存
val cached = lines.cache //缓存到Executer
第一次加载用时叫最长4s,后把数据写入本地磁盘,加载的数据大小为:20015B;
第二次用时较长,因为是从本地磁盘中读取数据进行计算。
第三次加载将数据缓存到内存中,用时较长;
第四次直接从内存中计算,用时最短,加载数据大小为:4.5KB,比第一次大了2倍,因为数据缓存的内存中需要进行序列化。
注意:第一次使用缓存并不会比不使用缓存快,应为需要将数据缓存到内存中,多花消一定的时间;如果内存中放不下索要缓存的数据,那么会将部分数据进行缓存。cache并不会产生新的RDD,只是标记了数据会放到内存中。
应用场景:处理的数据被反复使用。
清空缓存
cached.unpersist(true)
什么时候进行cache
1.要求的计算速度快
2.集群的资源要足够大
3.重要:cache的数据会反复的触发Action
4.先进行过滤,然后将缩小范围的数据再cache到内存。
val filtered = lines.filter(_.contains(“bigdata”))
val cached = filtered.cache()
val cached = reduced.cache()
def cache(): this.type = persist()
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) //缓存级别
persist()存储级别更为灵活。
缓存级别
//第一个参数表示:放到磁盘
//第二个参数表示:放到内存
//第三个参数表示:磁盘中的数据是否以Java对象保存。
//第四个参数表示:内存中的数据是否以Java对象保存(默认序列化机制保存)。
val NONE = new StorageLevel(false, false, false, false) //不缓存
val DISK_ONLY = new StorageLevel(true, false, false, false) //磁盘缓存
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) //磁盘缓存2份
val MEMORY_ONLY = new StorageLevel(false, true, false, true) //内存缓存
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2) //内存缓存2份
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) //内存缓存使用默认序列化机制
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2) //内存缓存使用默认序列化机制2份
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) //内存+磁盘
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2) //内存+磁盘2份
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false) //内存+磁盘使用默认序列化机制
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2) //内存+磁盘使用默认序列化机制2份
val OFF_HEAP = new StorageLevel(true, true, true, false, 1) //堆外内存(Tachyon,分布式内存存储系统)
RDD做Checkpoint
作用:做一些复杂的逻辑计算为保证中间结果不丢失,将中间结果保存到分布式文件系统中。
什么时候做checkpoint
1.迭代计算,要求保证数据安全
2.对速度要求不高(跟cache到内存进行对比)
3.将中间结果保存到hdfs
chekpoint的制作方法:
1.设置checkpoint目录(分布式文件系统的目录hdfs目录)
sc.setCheckpointDir("hdfs://node1:9000/ck2019")
2.经过复杂进行,得到中间结果
val filtered = lines.filter(_.contains("bigdata"))
3.将中间结果checkpoint到指定的hdfs目录
filtered.checkpoint
4.后续计算,就可以使用前面ck的数据了
/hdfs dfs -ls /ck2019/....
广播变量
单机程序计算IP归属地
需求:根据访问日志的IP地址计算出用户的归属地,并且按照省份,计算出访问次数,然后将计算好的结果写入到MySql中。
1.整理数据,切分出IP字段,然后将IP地址转换成十进制,
2.加载规则,整理规则,取出有用的字段,然后将数据缓存到内存中(Executor)
3.将访问log与IP规则相匹配(二分法查找)
4.取出对应的省份名称,然后将其和一组合在一起
5.按省份名称进行聚合
6.将聚合后的数据写到Mysql中。
package day4
import scala.io.{BufferedSource, Source}
object TestIP {
//将IP地址转换成十进制
def ip2Long(ip: String): Long = {
val fragments = ip.split("[.]")
var ipNum = 0L
for(i <- 0 until fragments.length){
ipNum = fragments(i).toLong | ipNum << 8L
}
ipNum
}
//将规则读入内存中
def readRules(path: String): Array[(Long, Long, String)] = {
//读取IP规则
val bf: BufferedSource = Source.fromFile(path)
val lines: Iterator[String] = bf.getLines()
//对IP规则进行整理,放入内存中
val rules: Array[(Long, Long, String)] = lines.map(line => {
val fileds: Array[String] = line.split("[|]")
val startNum = fileds(2).toLong
val endNum = fileds(3).toLong
val province = fileds(6)
(startNum, endNum, province)
}).toArray
rules
}
//二分法查找
def binarySearch(lines: Array[(Long, Long, String)], ip: Long): Int = {
var low = 0
var high = lines.length - 1
while(low <= high){
val middle = (low + high) / 2
if((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
return middle
if(ip < lines(middle)._1)
high = middle -1
else
low = middle + 1
}
-1
}
def main(args: Array[String]): Unit = {
//将规则读入内存中
val rules: Array[(Long, Long, String)] = readRules("C:\\Users\\Desktop\\课件与代码\\ip\\ip.txt")
//将IP地址转换成十进制
val ipNum = ip2Long("114.215.43.42")
//查找
val index = binarySearch(rules, ipNum)
//根据角标到Rules中查找对应的数据
val tp = rules(index)
val province = tp._3
println(province)
}
}
本地模式实现
package day4
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.{BufferedSource, Source}
object IPLoaction1 {
//缺点:只能应用本地的规则
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("IpLocation1").setMaster("local[4]")
val sc = new SparkContext(conf)
//在Driver端获取到全部的IP规则数据(全部的IP规则数据在某一台机器上,跟Driver在同一台机器上)
//全部的IP规则在Driver端
val rules: Array[(Long, Long, String)] = MyUtils.readRules(args(0))
//将Driver端的数据广播到Executor中
//调用sc上的关播方法
//广播变量的引用(还在Driver端)
val broadcastRef: Broadcast[Array[(Long, Long, String)]] = sc.broadcast(rules)
//创建RDD,读取访问日志
val accessLines: RDD[String] = sc.textFile(args(1))
//这个函数是在那一端定义的(Driver)
val func = (line: String) => {
val fields = line.split("[|]")
val ip = fields(1)
//将IP转换成十进制
val ipNum = MyUtils.ip2Long(ip)
//进行二分法查找
//通过Driver端的引用获取到Executor中的广播变量
//该函数中的代码是在Executor中调用执行的,通过广播变量的引用就可以拿到广播变量中的规则了。
val rulesInExecutor: Array[(Long, Long, String)] = broadcastRef.value
//查找
var province = "未知"
val index = MyUtils.binarySearch(rulesInExecutor, ipNum)
if(index != -1){
province = rulesInExecutor(index)._3
}
(province, 1)
}
//数据整理
val provinceAndOne = accessLines.map(func)
//聚合
val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_+_)
//将结果打印
val r = reduced.collect()
println(r.toBuffer)
sc.stop()
}
}
object MyUtils{
//将IP地址转换成十进制
def ip2Long(ip: String): Long = {
val fragments = ip.split("[.]")
var ipNum = 0L
for(i <- 0 until fragments.length){
ipNum = fragments(i).toLong | ipNum << 8L
}
ipNum
}
//将规则读入内存中
def readRules(path: String): Array[(Long, Long, String)] = {
//读取IP规则
val bf: BufferedSource = Source.fromFile(path)
val lines: Iterator[String] = bf.getLines()
//对IP规则进行整理,放入内存中
val rules: Array[(Long, Long, String)] = lines.map(line => {
val fileds: Array[String] = line.split("[|]")
val startNum = fileds(2).toLong
val endNum = fileds(3).toLong
val province = fileds(6)
(startNum, endNum, province)
}).toArray
rules
}
//二分法查找
def binarySearch(lines: Array[(Long, Long, String)], ip: Long): Int = {
var low = 0
var high = lines.length - 1
while(low <= high){
val middle = (low + high) / 2
if((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
return middle
if(ip < lines(middle)._1)
high = middle -1
else
low = middle + 1
}
-1
}
}
优化后
package day4
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.{BufferedSource, Source}
//可以从分布式文件系统中获取规则
object IPLoaction2 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("IpLocation2").setMaster("local[4]")
val sc = new SparkContext(conf)
//取到HDFS中的IP规则
val rulesLines:RDD[String] = sc.textFile(args(0))
//整理IP规则数据
val ipRulesRDD: RDD[(Long, Long, String)] = rulesLines.map(line => {
val fields = line.split("[|]")
val startNum = fields(2).toLong
val endNum = fields(3).toLong
val province = fields(6)
(startNum, endNum, province)
})
//将分散在多个Executor中的部分数据IP规则收集到Driver端
val rulesInDriver: Array[(Long, Long, String)] = ipRulesRDD.collect()
//将Driver端的数据广播到Executor中
//调用sc上的关播方法
//广播变量的引用(还在Driver端)
val broadcastRef: Broadcast[Array[(Long, Long, String)]] = sc.broadcast(rulesInDriver)
//创建RDD,读取访问日志
val accessLines: RDD[String] = sc.textFile(args(1))
//数据整理
val provinceAndOne = accessLines.map(log =>{
val fields = log.split("[|]")
val ip = fields(1)
//将IP转换成十进制
val ipNum = MyUtils.ip2Long(ip)
//进行二分法查找
//通过Driver端的引用获取到Executor中的广播变量
//该函数中的代码是在Executor中调用执行的,通过广播变量的引用就可以拿到广播变量中的规则了。
//Driver端的广播变量如何跑到Executor中的呢?
//Task是Driver端产生的,广播变量的引用是伴随着Task被发送到Executor中的。
val rulesInExecutor: Array[(Long, Long, String)] = broadcastRef.value
//查找
var province = "未知"
val index = MyUtils.binarySearch(rulesInExecutor, ipNum)
if(index != -1){
province = rulesInExecutor(index)._3
}
(province, 1)
})
//聚合
val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_+_)
//将结果打印
val r = reduced.collect()
println(r.toBuffer)
sc.stop()
}
}
object MyUtils{
//将IP地址转换成十进制
def ip2Long(ip: String): Long = {
val fragments = ip.split("[.]")
var ipNum = 0L
for(i <- 0 until fragments.length){
ipNum = fragments(i).toLong | ipNum << 8L
}
ipNum
}
//将规则读入内存中
def readRules(path: String): Array[(Long, Long, String)] = {
//读取IP规则
val bf: BufferedSource = Source.fromFile(path)
val lines: Iterator[String] = bf.getLines()
//对IP规则进行整理,放入内存中
val rules: Array[(Long, Long, String)] = lines.map(line => {
val fileds: Array[String] = line.split("[|]")
val startNum = fileds(2).toLong
val endNum = fileds(3).toLong
val province = fileds(6)
(startNum, endNum, province)
}).toArray
rules
}
//二分法查找
def binarySearch(lines: Array[(Long, Long, String)], ip: Long): Int = {
var low = 0
var high = lines.length - 1
while(low <= high){
val middle = (low + high) / 2
if((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
return middle
if(ip < lines(middle)._1)
high = middle -1
else
low = middle + 1
}
-1
}
}
数据写入MySql
完整版:不要忘了添加Mysql-Driver依赖
package day4
import java.sql.{Connection, DriverManager}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
import scala.io.{BufferedSource, Source}
object IPLoaction2 {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("IpLocation2").setMaster("local[4]")
val sc = new SparkContext(conf)
//取到HDFS中的IP规则
val rulesLines:RDD[String] = sc.textFile(args(0))
//整理IP规则数据
val ipRulesRDD: RDD[(Long, Long, String)] = rulesLines.map(line => {
val fields = line.split("[|]")
val startNum = fields(2).toLong
val endNum = fields(3).toLong
val province = fields(6)
(startNum, endNum, province)
})
//将分散在多个Executor中的部分数据IP规则收集到Driver端
val rulesInDriver: Array[(Long, Long, String)] = ipRulesRDD.collect()
//将Driver端的数据广播到Executor中
//调用sc上的关播方法
//广播变量的引用(还在Driver端)
val broadcastRef: Broadcast[Array[(Long, Long, String)]] = sc.broadcast(rulesInDriver)
//创建RDD,读取访问日志
val accessLines: RDD[String] = sc.textFile(args(1))
//数据整理
val provinceAndOne = accessLines.map(log =>{
val fields = log.split("[|]")
val ip = fields(1)
//将IP转换成十进制
val ipNum = MyUtils.ip2Long(ip)
//进行二分法查找
//通过Driver端的引用获取到Executor中的广播变量
//该函数中的代码是在Executor中调用执行的,通过广播变量的引用就可以拿到广播变量中的规则了。
//Driver端的广播变量如何跑到Executor中的呢?
//Task是Driver端产生的,广播变量的引用是伴随着Task被发送到Executor中的。
val rulesInExecutor: Array[(Long, Long, String)] = broadcastRef.value
//查找
var province = "未知"
val index = MyUtils.binarySearch(rulesInExecutor, ipNum)
if(index != -1){
province = rulesInExecutor(index)._3
}
(province, 1)
})
//聚合
val reduced: RDD[(String, Int)] = provinceAndOne.reduceByKey(_+_)
//一次拿一个分区(一个分区拿一个连接,可以将一个分区中的多条数据写完再释放JDBC连接,这样更节省资源)
//使用foreach();在写入大量数据的时候,有问题?大量的创建连接,资源浪费。
/*reduced.foreachPartition(it => {
//是在Executor中的Task获取的JDBC连接
val conn: Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8","root","123")
val pstm = conn.prepareStatement("insert into access_log values (?,?)")
//将一个分区中的每一条数据拿出来
it.foreach(tp => {
pstm.setString(1, tp._1)
pstm.setInt(2, tp._2)
pstm.executeUpdate()
})
pstm.close()
conn.close()
})*/
reduced.foreachPartition(MyUtils.data2MySQL _)
sc.stop()
}
}
object MyUtils{
def data2MySQL(it: Iterator[(String, Int)]): Unit = {
val conn: Connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?characterEncoding=UTF-8","root","123")
val pstm = conn.prepareStatement("insert into access_log values (?,?)")
//将一个分区中的每一条数据拿出来
it.foreach(tp => {
pstm.setString(1, tp._1)
pstm.setInt(2, tp._2)
pstm.executeUpdate()
})
pstm.close()
conn.close()
}
//将IP地址转换成十进制
def ip2Long(ip: String): Long = {
val fragments = ip.split("[.]")
var ipNum = 0L
for(i <- 0 until fragments.length){
ipNum = fragments(i).toLong | ipNum << 8L
}
ipNum
}
//将规则读入内存中
def readRules(path: String): Array[(Long, Long, String)] = {
//读取IP规则
val bf: BufferedSource = Source.fromFile(path)
val lines: Iterator[String] = bf.getLines()
//对IP规则进行整理,放入内存中
val rules: Array[(Long, Long, String)] = lines.map(line => {
val fileds: Array[String] = line.split("[|]")
val startNum = fileds(2).toLong
val endNum = fileds(3).toLong
val province = fileds(6)
(startNum, endNum, province)
}).toArray
rules
}
//二分法查找
def binarySearch(lines: Array[(Long, Long, String)], ip: Long): Int = {
var low = 0
var high = lines.length - 1
while(low <= high){
val middle = (low + high) / 2
if((ip >= lines(middle)._1) && (ip <= lines(middle)._2))
return middle
if(ip < lines(middle)._1)
high = middle -1
else
low = middle + 1
}
-1
}
}