1、 看过”Father of the Bride Part II (1995) ” 用户年龄和性别分布
2、 年龄段在”18-24” 的男性年轻人, 最喜欢看哪10部电影
3、 得分最高的10部电影
4、 看过电影最多的前10个人
5、 女性看过最多的10部电影
6、 男性看过最多的10部电影
7、 年龄段在”45-49” 的男性观众中, 得分最高的10部电影名称
8、 1995年最受欢迎的前3部电影的用户年龄和性别数量分布
9、moiveid=2116这部电影各年龄段的平均影评(年龄段,平均影评分)
10、统计最喜欢看电影的那位女士最喜欢的10部电影,观众对他们的平均影评分(电影名,影评)
三、实现:
importorg.apache.spark.sql.{DataFrame, Dataset, SparkSession}//写三个样本类,等会儿将三个文件中的数据弄成对象caseclass Movies(mid:String,mtitle:String,mTicai:String)caseclass Users(uid:String,usex:String,uage:Long,uzhiye:String,uzip:String)caseclass Ratings(uid:String,mid:String,rating:String,times:String)object Work {def main(args: Array[String]):Unit={val spark: SparkSession = SparkSession.builder().master("local[*]").appName("homework").getOrCreate()/*
* 数据:现在有三个数据文件
*
* 1,movies.dat :电影号::标题::题材
* 2,users.dat :用户号::性别:年龄:职业:邮政编码
* 3,ratings.dat :UerID::MoviesID::Rating::Timestamp
*
* *///1.首先导入数据val movie: Dataset[String]= spark.read.textFile("f://data//movie_data//movies.dat")val ratings: Dataset[String]= spark.read.textFile("f://data//movie_data//ratings.dat")val users: Dataset[String]= spark.read.textFile("f://data//movie_data//users.dat")//导入隐式类importspark.implicits._
//2.将数据转换成一张张类似于SQL的表格val moviesDS: Dataset[Movies]= movie.map(v =>{val arr: Array[String]= v.split("::")new Movies(arr(0), arr(1), arr(2))})
moviesDS.createTempView("t_movies")
spark.sql("select * from t_movies").show()val ratingDS: Dataset[Ratings]= ratings.map(v =>{val arr: Array[String]= v.split("::")new Ratings(arr(0), arr(1), arr(2), arr(3))})
ratingDS.createTempView("t_rating")
spark.sql("select * from t_rating").show()val userDS: Dataset[Users]= users.map(v =>{val arr: Array[String]= v.split("::")new Users(arr(0), arr(1), arr(2).toLong, arr(3),arr(4))})
userDS.createTempView("t_user")
spark.sql("select * from t_user").show()//问题1:看过”Lord of the Rings,The(1978) ” 用户年龄和性别分布//先根据电影的名字求出电影的ID,在根据电影ID求出看过这个电影的用户ID,在根据用户ID求出用户的性别和年龄分布var sql="""
|select case usex when 'M' then '男' else '女' end as usex,c.uage age
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where b.mtitle = 'Father of the Bride Part II (1995)'
|""".stripMargin
spark.sql(sql).show()// 2、 年龄段在”18-24” 的男性年轻人, 最喜欢看哪10部电影var sql2="""
|select b.mtitle,count(b.mtitle) count
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where a.uid in (select u.uid from t_user u where uage>=18 and uage <=24 and usex='M')
|group by a.mid,b.mtitle
|order by count desc
|limit 10
|""".stripMargin
spark.sql(sql2).show()//3、 得分最高的10部电影var sql3 ="""
|select b.mtitle,avg(rating) rate
|from t_rating a left join t_movies b on a.mid=b.mid
|group by a.mid,b.mtitle
|order by rate desc
|limit 10
|""".stripMargin
spark.sql(sql3).show()// 4、 看过电影最多的前10个人var sql4="""
|select uid,count(*) num
|from t_rating
|group by uid
|order by num desc
|limit 10
|""".stripMargin
spark.sql(sql4).show()// 女性看过最多的10部电影var sql5 ="""
|select b.mtitle,count(b.mtitle) count
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where a.uid in (select u.uid from t_user u where usex='F')
|group by a.mid,b.mtitle
|order by count desc
|limit 10
|""".stripMargin
spark.sql(sql5).show()// 6、 男性看过最多的10部电影var sql6 ="""
|select b.mtitle,count(b.mtitle) count
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where a.uid in (select u.uid from t_user u where usex='M')
|group by a.mid,b.mtitle
|order by count desc
|limit 10
|""".stripMargin
spark.sql(sql6).show()// 7、 年龄段在”45-49” 的男性观众中,得分最高的10部电影名称var sql7 ="""
|select b.mtitle,sum(rating) rate
|from t_rating a left join t_movies b on a.mid=b.mid
|where a.mid in (
| select a.mid
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where a.uid in (select u.uid from t_user u where uage>=45 and uage <=49 and usex='M')
|)
|group by a.mid,b.mtitle
|order by rate desc
|limit 10
|""".stripMargin
spark.sql(sql7).show()// 8、 1995年最受欢迎的前3部电影的用户年龄和性别数量分布var sql8="""
|select case usex when 'M' then '男' else '女' end as usex,c.uage age
|from t_rating a left join t_movies b on a.mid=b.mid left join t_user c on a.uid=c.uid
|where b.mid in (
|select a.mid
|from t_rating a left join t_movies b on a.mid=b.mid
|where substr(a.times,-5,4) = 1970
|group by a.mid
|order by avg(a.rating) desc
|limit 3
|)
|""".stripMargin
spark.sql(sql8).show()// 9、moiveid=2116这部电影各年龄段的平均影评(年龄段,平均影评分)var sql9 ="""
|select a.uage age, avg(b.rating) rate
|from t_user a join t_rating b on a.uid=b.uid
|where b.mid=2116
|group by a.uage
|order by uage
|""".stripMargin
spark.sql(sql9).show()// 10、统计最喜欢看电影的那位女士最喜欢的10部电影,观众对他们的平均影评分(电影名,影评)var sql10 ="""
|select mtitle,avg(Rating) rate
|from t_rating a left join t_movies b on a.mid=b.mid
|where a.mid in (
|select mid
|from t_rating
|where uid in (
|select uid
|from t_rating
|group by uid
|order by count(*) desc
|limit 1
|)
|group by mid
|order by count(*) desc
|limit 10
|)
|group by mtitle
|
|""".stripMargin
spark.sql(sql10).show()}}