Spark数据分析案例之求TopN
思路
将数据读成DataFrame,并将DataFrame映射成临时表,然后用sparkSession.sql的方式用sql语句来求TopN。
数据和目标
数据是如下的json数据,可以直接读成DataFrame,求每个clazz中score的前两名
{“name”:“a”,“clazz”:1,“score”:80}
{“name”:“b”,“clazz”:1,“score”:78}
{“name”:“c”,“clazz”:1,“score”:95}
{“name”:“d”,“clazz”:2,“score”:74}
{“name”:“e”,“clazz”:2,“score”:92}
{“name”:“f”,“clazz”:3,“score”:99}
{“name”:“g”,“clazz”:3,“score”:99}
{“name”:“h”,“clazz”:3,“score”:45}
{“name”:“i”,“clazz”:3,“score”:55}
{“name”:“j”,“clazz”:3,“score”:78}
代码实现
import org.apache.spark.sql.SparkSession
object ScoreAnalysis {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("ScoreAnalysis")
.master("local").getOrCreate()
val sc = sparkSession.sparkContext
sc.setLogLevel("WARN")
val jsonDF = sparkSession.read.json("score.txt")
jsonDF.createOrReplaceTempView("student_score")
sparkSession.sql("select t.name,t.clazz,t.score,t.drp from\n" +
"(select name,clazz,score,\n" +
"dense_rank() over(partition by clazz order by score desc ) drp\n" +
"from student_score) t where t.drp<=2").show()
sc.stop()
sparkSession.close()
}
}
Spark的udf函数
实现自定义一个udf将小写字母转为大写
import org.apache.spark.sql.SparkSession
import org.apache.s