文章目录
Spark必备知识点记录
一、Spark广播变量
广播变量用来高效的分发较大的对象。向所有工作结点发送一个较大的只读值,以供其他的RDD使用。
- 理解记忆
- Code
package com.zxy.spark.Streaming.day005
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object demo3 {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("demo3").setMaster("local[*]"))
//定义广播变量类型
val mapBroad: Map[String, String] = Map(
"mi" -> "小米",
"apple" -> "苹果",
"lenovo" -> "联想"
)
//定义广播变量
val brandBC: Broadcast[Map[String, String]] = sc.broadcast(mapBroad)
val ElecRDD: RDD[Electronic_brand] = sc.parallelize(List(
Electronic_brand(0, "apple"),
Electronic_brand(1, "mi"),
Electronic_brand(2, "lenovo")
))
ElecRDD.map(info => {
val brand: String = brandBC.value.getOrElse(info.EID,"NULL")
Electronic_brand(info.id,brand)
}).collect().foreach(println)
}
}
case class Electronic_brand(id:Int,EID:String)
二、Spark数据持久化操作
1)cache:将数据临时存储在内存中进行数据重用
涉及内用溢出问题,数据不安全
会在血缘关系中添加新的依赖
一旦出现问题,可以重头读取文件
2)persist:将数据临时存储在磁盘文件中进行数据重用
涉及磁盘IO,性能较低,但是数据安全
如果作业执行完毕,临时保存的数据文件就会丢失
3)checkpoint:将数据长久的保存在磁盘文件中进行数据重用
涉及磁盘IO,性能较低,但是数据安全
为了提高效率,和cache联合使用
执行过程中,会切断血缘关系,重新建立血缘关系
checkpoint等同于切换数据源
- Code
package com.zxy.spark.Streaming.day005
import org.apache.spark.rdd.RDD
import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}
object demo2 {
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setAppName("demo2").setMaster("local[*]"))
//1.checkpoint
sc.setCheckpointDir("date/")
val list = List("hello word","hello world")
val rdd: RDD[String] = sc.makeRDD(list)
val flatRDD: RDD[String] = rdd.flatMap(_.split("\\s+"))
val mapRDD: RDD[(String, Int)] = flatRDD.map(word => {
println("test")
(word, 1)
})
//2.cache
mapRDD.cache()
//3.persist
mapRDD.persist(StorageLevel.MEMORY_ONLY)
//聚合操作
val reduceRDD: RDD[(String, Int)] = mapRDD.reduceByKey(_+_)
reduceRDD.collect().foreach(println)
println("***********************")
//分组操作
val groupRDD: RDD[(String, Iterable[Int])] = mapRDD.groupByKey()
groupRDD.collect().foreach(println)
}
}
三、Spark-Kryo序列化框架
Java的序列化可以序列化任何的类。但是字节多,序列化后,对象提交的也比较大,在后期读取序列化文件的时候就相应的处理速度慢。
Spark出于性能的考虑,Spark2.0开始支持另一种叫Kyro的序列化机制。
据说Kyro的速度是Java序列化速度的十倍,但是Kyro支持的类型少。
当RDD在进行shuffle数据的时候,简单数据类型、数组和字符型类型已经在Spark内部使用Kyro序列化了。
使用Kyro序列化接口需要注册
- Code
package com.zxy.spark.Streaming.day005
import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
object demo1 {
def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf()
.setAppName("demo1")
.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.registerKryoClasses(Array(classOf[Man], classOf[Woman], classOf[People]))
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df: DataFrame = spark.createDataFrame(Seq(new People(new Man("z"),new Woman("x"))))
df.show()
spark.close()
}
case class Man(name:String)
case class Woman(name:String)
case class People(man:Man,woman:Woman)
}
- 测试结果
+---+-----+
|man|woman|
+---+-----+
|[z]| [x]|
+---+-----+
四、Spark-Core的交、并、差、拉链
object num6{
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("num5"))
val rdd1: RDD[Int] = sc.makeRDD(List(1,2,3,4))
val rdd2: RDD[Int] = sc.makeRDD(List(3,4,5,6))
//交集【3,4】
val rdd3: RDD[Int] = rdd1.intersection(rdd2)
println(rdd3.collect().mkString("[",",","]"))
//并集【1,2,3,4,3,4,5,6】
val rdd4: RDD[Int] = rdd1.union(rdd2)
println(rdd4.collect().mkString("[",",","]"))
//差集【1,2】
val rdd5: RDD[Int] = rdd1.subtract(rdd2)
println(rdd5.collect().mkString("[",",","]"))
//差集【5,6】
val rdd6: RDD[Int] = rdd2.subtract(rdd1)
println(rdd6.collect().mkString("[",",","]"))
//拉链【(1,3),(2,4),(3,5),(4,6)】
val rdd7: RDD[(Int, Int)] = rdd1.zip(rdd2)
println(rdd7.collect().mkString("[",",","]"))
}
}
- 运行结果
[3,4]
[1,2,3,4,3,4,5,6]
[1,2]
[5,6]
[(1,3),(2,4),(3,5),(4,6)]
五、转换算子Sample介绍及底层相关算法代码
- Spark-Core
object num5{
def main(args: Array[String]): Unit = {
val sc = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("num5"))
val RDD: RDD[Int] = sc.makeRDD(List(1,2,3,4,5,6,7,8,9,10))
/**
* 抽取算法不放回(伯努利算法)
* 伯努利算法:又叫0,1分布,例如仍硬币,要么正面,要么反面
* 具体实现:根据种子和随机算法算出一个数和第二个参数设置几率比较,
* 小于第二个参数,大于不要
* 第一个参数:抽取的数据是否放回,false:不放回
* 第二个参数:抽取的纪律,范围在[0,1]之间,0:全不取;1,全取
* 第三个参数:随机数种子
*/
val dateRDD1: RDD[Int] = RDD.sample(false,0.5)
dateRDD1.foreach(println)
/**
* 抽取数据放回(泊松算法)
* 第一个参数:抽取的数据是否放回,true:放回
* 第二个参数:重复数据的几率,范围大于等于0,表示没有一个元素被期望抽取到的次数
* 第三个参数:随机数种子
*/
val dateRDD2: RDD[Int] = RDD.sample(true,2)
dateRDD2.foreach(println)
}
}
- Sample底层代码
if (withReplacement) {
//第一个参数为true,泊松算法
new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
} else {
//第一个参数为false,伯努利算法
new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
}
六、Spark共享变量之广播变量broadcast
package com.zxy.spark.core.Day04
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
/**
* file:broadcast
* author:zxy
* date:2021-06-10
* desc:共享变量之广播变量
*/
object Demo11_share {
def main(args: Array[String]): Unit = {
val sc: SparkContext = new SparkContext(new SparkConf().setMaster("local[*]").setAppName("share"))
//共享变量性别
val genderMap = Map(
"0" -> "小姐姐",
"1" -> "小哥哥"
)
//共享变量名字
val nameMap = Map(
"1" -> "鸣人",
"2" -> "博人",
"3" -> "佐助",
"4" -> "凯",
"5" -> "卡卡西"
)
val listRDD: RDD[(String, String, Int)] = sc.parallelize(List(
("1", "1", 30),
("2", "0", 18),
("3", "1", 31),
("4", "0", 40),
("5", "2", 41)
))
val genders: Broadcast[Map[String, String]] = sc.broadcast(genderMap)
val names: Broadcast[Map[String, String]] = sc.broadcast(nameMap)
listRDD.map(stu => {
val id: String = stu._1
val sex: String = stu._2
val age: Int = stu._3
val gender: String = genders.value.getOrElse(sex,"未知")
val name: String = names.value.getOrElse(id,"未知")
Student(id,name,gender,age)
}).foreach(println)
}
}
case class Student(id:String,name:String,sex:String,age:Int){}
七、Spark -> WordCount
1.经典项目:WordCount
WordCount方式一
类似于Scala的写法
package com.zxy.SparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount{
def main(args: Array[String]): Unit = {
//建立和Spark框架的连接
val wordCount: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
val context: SparkContext = new SparkContext(wordCount)
//读取指定文件目录的数据
val lines: RDD[String] = context.textFile("spark-core\\dates")
//数据切割
val words: RDD[String] = lines.flatMap(_.split("\\s+"))
//数据分组
val map: RDD[(String, Iterable[String])] = words.groupBy(word => word)
//数据格式化
val WordToCount: RDD[(String, Int)] = map.map {
case (word, list) => (word, list.size)
}
//数据收集
val array: Array[(String, Int)] = WordToCount.collect()
//数据打印
array.foreach(println)
//关闭连接
context.stop()
}
}
WordCount方式一简化版
package com.zxy.SparkCore
import org.apache.spark.{SparkConf, SparkContext}
object WordCount{
def main(args: Array[String]): Unit = {
//建立和Spark框架的连接
val wordCount: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
val context: SparkContext = new SparkContext(wordCount)
//函数式编程特点
context.textFile("spark-core\\dates").flatMap(_.split("\\s+")).groupBy(word => word).map(kv => (kv._1,kv._2.size)).collect().foreach(println)
//关闭连接
context.stop()
}
}
WordCount方式二
采用了Spark特有方法的写法
package com.zxy.SparkCore
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount2{
def main(args: Array[String]): Unit = {
//建立和Spark框架的连接
val wordCount: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
val context: SparkContext = new SparkContext(wordCount)
//读取指定文件目录数据
val lines: RDD[String] = context.textFile("spark-core\\dates")
//切分数据
val words: RDD[String] = lines.flatMap(_.split("\\s+"))
//数据分组
val WordToOne: RDD[(String, Int)] = words.map(
word => (word, 1)
)
//spark提供的方法,将分组和聚合通过一个方法实现
//reduceByKey:相同的饿数据,可以对value进行reduce聚合
val WordToCount: RDD[(String, Int)] = WordToOne.reduceByKey(_ + _)
//数据收集
val array: Array[(String, Int)] = WordToCount.collect()
//数据打印
array.foreach(println)
//关闭连接
context.stop()
}
}
WordCount方式二简化版
package com.zxy.SparkCore
import org.apache.spark.{SparkConf, SparkContext}
object WordCount4{
def main(args: Array[String]): Unit = {
//建立和Spark框架的连接
val wordCount: SparkConf = new SparkConf().setMaster("local").setAppName("WordCount")
val context: SparkContext = new SparkContext(wordCount)
context.textFile("spark-core\\dates").flatMap(_.split("\\s+")).map(word => (word,1)).reduceByKey(_ + _).collect().foreach(println)
//关闭连接
context.stop()
}
}
控制台效果
2.Maven的POM文件
我这里采用的Scala2.11.8
使用的Spark2.4.7
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scalap</artifactId>
<version>2.11.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.7</version>
</dependency>
</dependencies>
八、Spark Streaming->WordCount
1.Consumer
package com.zxy.spark.Streaming
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
object wordCount {
def main(args: Array[String]): Unit = {
// 创建配置文件对象,一般设计local[*],即有多少用多少
val conf: SparkConf = new SparkConf().setAppName("WordCount").setMaster("local[*]")
// 创建Spark Streaming对象
val scc = new StreamingContext(conf, Seconds(3))
// 从端口中获取数据源,这个9999端口就是与Linux端的Producer数据传输的端口
val socketDS: ReceiverInputDStream[String] = scc.socketTextStream("192.168.130.110", 9999)
val reduceDS: DStream[(String, Int)] = socketDS.flatMap(_.split(" ")).map((_,1)).reduceByKey(_ + _)
/**
* 对上步操作细化
*
* // 对获取到的数据进行扁平化处理,即把输入的数据以空格为切割方式
* val flatMapDS: DStream[String] = socketDS.flatMap(_.split(" "))
*
* // 对数据进行结构上的转换
* val mapDS: DStream[(String, Int)] = flatMapDS.map((_, 1))
*
* // 对上述的数据进行聚合处理
* val reduceDS: DStream[(String, Int)] = mapDS.reduceByKey(_ + _)
*/
// 输出结果 注意:调用的是 DS的 print 函数
reduceDS.print()
// 启动采集器
scc.start()
// 默认情况下,上下文对象不能关闭
// scc.stop()
// 等待采集器结束,终止上下文环境对象
scc.awaitTermination()
}
}
2.POM依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>1.1.1</version>
</dependency>
tips:
provided
如果在依赖中加入了这一行,会出现这种问题
NoClassDefFoundError: org/apache/spark/streaming/StreamingContext
这时候只需要删除这一行就可以
3.Producer
端口传输主要有三种方式:
nc nmap telnet
[root@hadoop ~]# yum install -y nc
[root@hadoop ~]# nc -lk 9999
然后就可以在这边发送数据,在Consumer端接收数据,并进行WordCount统计,Consumer端每3S都会统计一次,不管这边有没有发送数据
Time: 1624336719000 ms
(hive,1)
(word,1)
(hello,4)
(java,1)
(spark,1)