需求1:查找电影评分个数超过50,且平均评分较高的前十部电影名称及其对应的平均评分,结果以下面字段结构输出,保存成任意文本格式(txt\csv)或者存储到数据库。
val sql1 =
"""
|select
| r.movieId movie_id, count(r.movieId) rating_num, first(m.title) title, avg(r.rating) as rating_avg,from_unixtime(first(r.timestamp), 'yyyy-MM-dd HH:mm:ss') update_time
|from
| ratings r left join movies m
|where
| r.movieId=m.movieId
|group by
| r.movieId having rating_num > 50 sort by rating_avg desc limit 10
|""" . stripMargin
需求2:查找每个电影类别及其对应的平均评分,结果以下面字段结构输出,保存成任意文本格式(txt\csv)或者存储到数据库。
val sql2 =
"""
|select
| first(m.genres) genre, avg(r.rating) rating_avg,from_unixtime(first(r.timestamp), 'yyyy-MM-dd HH:mm:ss') update_time
|from
| ratings r left join movies m
|where
| r.movieId=m.movieId
|group by r.movieId
|""" . stripMargin
需求3:查找被评分次数较多的前十部电影,结果以下面字段结构输出,保存成任意文本格式(txt\csv)或者存储到数据库。
val sql3 =
"""
|select
| r.movieId movie_id, first(m.title) title, count(r.userId) rating_num, from_unixtime(first(r.timestamp), 'yyyy-MM-dd HH:mm:ss') update_time
|from
| ratings r left join movies m
|where
| r.movieId=m.movieId
|group by r.movieId sort by rating_num desc limit 10
|""" . stripMargin
运行环境准备
def main( args: Array[ String ] ) : Unit = {
val conf: SparkConf = new SparkConf( ) . setAppName( "exercise2" ) . setMaster( "local[*]" )
val spark: SparkSession = SparkSession. builder( ) . config( conf) . getOrCreate( )
val dfRating: DataFrame = spark. read. option( "header" , true ) . csv( "exercise2/ratings.csv" )
val dfMovie: DataFrame = spark. read. option( "header" , true ) . csv( "exercise2/movies.csv" )
dfRating. createOrReplaceTempView( "ratings" )
dfMovie. createOrReplaceTempView( "movies" )
}