2024年最新大数据（8i）Spark练习之TopN_第一行输入为topn和文章数m(1)，2024年最新一个大数据开发程序员的腾讯面试心得

2401_84186067

于 2024-05-10 10:37:31 发布

阅读量970

点赞数 25

分类专栏：程序员文章标签：大数据面试学习

本文链接：https://blog.csdn.net/2401_84186067/article/details/138655066

版权

程序员专栏收录该内容

112 篇文章 0 订阅

订阅专栏

网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。

需要这份系统化资料的朋友，可以戳这里获取

一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！

(“2020”, “foshan”, “Blade Master”, “B”),
(“2020”, “foshan”, “Warden”, “B”),
(“2020”, “shenzhen”, “Archmage”, “D”),
(“2020”, “guangzhou”, “Lich”, “C”),
(“2020”, “foshan”, “Mountain King”, “B”),
(“2021”, “guangzhou”, “Demon Hunter”, “A”),
(“2021”, “foshan”, “Blade Master”, “C”),
(“2021”, “foshan”, “Warden”, “C”),
(“2021”, “shenzhen”, “Death Knight”, “D”),
(“2021”, “guangzhou”, “Paladin”, “D”),
(“2021”, “foshan”, “Blade Master”, “D”),
(“2021”, “foshan”, “Wind Runner”, “C”),
(“2021”, “guangzhou”, “Crypt Lord”, “D”),
).toDF(“time”, “city”, “user”, “advertisement”).createTempView(“t0”)
// 按城市和广告分组
spark.sql(
“”"
|SELECT city,advertisement,count(0) clicks FROM t0
|GROUP BY city,advertisement
|“”“.stripMargin).createTempView(“t1”)
// 使用窗口函数，按城市分区，分区内按点击数排名
spark.sql(
“””
|SELECT
| city,
| advertisement,
| clicks,
| RANK() OVER(PARTITION BY city ORDER BY clicks DESC)AS r
|FROM t1
|“”".stripMargin).createTempView(“t2”)
// 取排名前2
spark.sql(“SELECT city,advertisement,clicks FROM t2 WHERE r<3”).show()



> 
> 打印结果  
>  ![](https://img-blog.csdnimg.cn/20210720132741585.png)
> 
> 
> 


## 需求：省份点击数Top2


### 数据

// 创建SparkConf对象，并设定配置
import org.apache.spark.{SparkConf, SparkContext}
val conf = new SparkConf().setAppName(“A”).setMaster(“local[8]”)
// 创建SparkContext对象，Spark通过该对象访问集群
val sc = new SparkContext(conf)
// 创建数据
val r0 = sc.makeRDD(Seq(
4401, 4401, 4401, 4401, 4401, 4401, 4401,
4401, 4401, 4401, 4401, 4401, 4401,
4406, 4406, 4406, 4406, 4406, 4406, 4406, 4406,
4602, 4602, 4601,
4301, 4301,
))


### 方法1：reduceBy省份

// 省份汇总统计
val r1 = r0.map(a => (a.toString.slice(0, 2), 1)).reduceByKey(_ + )
// 查看各分区元素
r1.mapPartitionsWithIndex((pId, iter) => {
println(“分区” + pId + “元素：” + iter.toList)
iter
}).collect
// 省份TopN
r1.sortBy(-._2).take(2).foreach(println)


### 方法2：先reduceBy城市，再reduceBy省份


reduceBy城市可以使并行更充分，缓解数据倾斜

// reduceBy城市
val r1 = r0.map((, 1)).reduceByKey( + _)
// 查看各分区元素
r1.mapPartitionsWithIndex((pId, iter) => {
println(“分区” + pId + “元素：” + iter.toList)
iter
}).collect
// reduceBy省份
val r2 = r1.map(t => (t._1.toString.slice(0, 2), t.2)).reduceByKey( + )
// 查看各分区元素
r2.mapPartitionsWithIndex((pId, iter) => {
println(“分区” + pId + “元素：” + iter.toList)
iter
}).collect
// 省份TopN
r2.sortBy(-._2).take(2).foreach(println)


### 打印


reduceBy城市各分区元素

分区4元素：List()
分区3元素：List()
分区7元素：List()
分区0元素：List()
分区2元素：List((4602,2))
分区6元素：List((4406,8))
分区5元素：List((4301,2))
分区1元素：List((4401,13), (4601,1))


reduceBy省份各分区元素

分区5元素：List()
分区3元素：List()
分区1元素：List()
分区4元素：List()
分区6元素：List()
分区7元素：List((43,2))
分区0元素：List((44,21))
分区2元素：List((46,3))


结果

(44,21)
(46,3)


## 自定义分区器 求TopN


自定义分区器可以缓解数据倾斜，后面需要二次聚合

import org.apache.spark.{HashPartitioner, Partitioner, SparkConf, SparkContext}

import scala.util.Random

class MyPartitioner extends Partitioner {
val random: Random = new Random
// 总的分区数
override def numPartitions: Int = 8
// 按key分区，此处假设44数据倾斜
override def getPartition(key: Any): Int = key match {
case “44” => random.nextInt(7)
case _ => 7
}
}

object Hello {
def main(args: Array[String]): Unit = {
// 创建SparkConf对象，并设定配置
val conf = new SparkConf().setAppName(“A”).setMaster(“local[8]”)
// 创建SparkContext对象，Spark通过该对象访问集群
val sc = new SparkContext(conf)
// 创建数据
val r0 = sc.makeRDD(Seq(
44, 44, 44, 44, 44, 44, 44,
44, 44, 44, 44, 44, 44,
44, 44, 44, 44, 44, 44, 44, 44,
46, 46, 46,
43, 43,
))
// 省份汇总统计
val r1 = r0.map(a => (a.toString.slice(0, 2), 1))
// 自定义分区
val r2 = r1.reduceByKey(partitioner = new MyPartitioner, func = _ + _)
// 查看各分区元素
r2.mapPartitionsWithIndex((pId, iter) => {
println(“分区” + pId + “元素：” + iter.toList)
iter
}).collect
// 二次聚合
val r3 = r2.reduceByKey(partitioner = new HashPartitioner(1), func = _ + _)
// 查看各分区元素
r3.mapPartitionsWithIndex((pId, iter) => {
println(“分区” + pId + “元素：” + iter.toList)