原始数据如下:
需求:按天统计uid。
main方法:
object TopNStatJob {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("TopNStatJob")
.config("spark.sql.sources.partitionColumnTypeInference.enabled","false")
.master("local[2]").getOrCreate()
val accessDF = spark.read.format("parquet").load("file:///E:/test/clean")
// accessDF.printSchema()
accessDF.show(false)
// 最受欢迎的TopN netType
netTypeAccessTopNStat(spark, accessDF)
spark.stop
}
}
方式一:使用DataFrame
/**
* 最受欢迎的TopN netType
* @param spark
* @param accessDF
*/
def netTypeAccessTopNStat(spark: SparkSession, accessDF: DataFrame): Unit = {
val wifiAccessTopNDF = accessDF.filter(accessDF.col("day") === "20190702" && accessDF.col("netType") === "wifi")
.groupBy("day", "uid").agg(count("uid").as("times")).orderBy(desc("times"))
wifiAccessTopNDF.show(false)
}
方式二:使用sparkSQL
def netTypeAccessTopNStat(spark: SparkSession, accessDF: DataFrame): Unit = {
accessDF.createOrReplaceTempView("access_logs")
val wifiAccessTopNDF = spark.sql("select day,uid,count(1) as times from access_logs where day='20190702' and netType='wifi' group by day,uid order by times desc")
wifiAccessTopNDF.show(false)
}
两种方式都可以实现TopN,控制台打印结果如下: