spark实现 通过ratings.dat和movies.dat两个文件得到平均得分超过4.0的电影列表

两个测试集下载地址:http://grouplens.org/datasets/movielens

记住: 采用的数据集是ml-1m,自己在上面链接里面找

数据集示例:

ratings.dat:(id,电影id,评分,时间戳)用::符号分割

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968
1::3408::4::978300275
1::2355::5::978824291
1::1197::3::978302268
1::1287::5::978302039
1::2804::5::978300719
2::2628::3::978300051
2::1103::3::978298905
2::2916::3::978299809
2::3468::5::978298542
3::2167::5::978297600
3::1580::3::978297663
3::3619::2::978298201
3::260::5::978297512
3::2858::4::978297039
4::1214::4::978294260
4::1036::4::978294282
4::260::5::978294199
4::2028::5::978294230

movies.dat:(电影id,电影名,电影类型)用::符号分割

1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama

代码示例:

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("movie").setMaster("local")
    val sc = new SparkContext(conf)
    val movieRate = sc.textFile("D://spark/ml-1m/ratings.dat")

    //求每个电影id的影评平均数,平均数大于四
    val movieScore = movieRate.filter(line=>line.split("::").length==4).map(line=>{
      val st = line.split("::")
      (st(1).toInt,st(2).toInt) //截取电影id和评分
    }).groupByKey().map(line=>{   //计算每个电影的平均分
      var num = 0
      var total = 0.0
      for(i <- line._2){
        total = total + i
        num = num + 1
      }
      val avg = total/num
      (line._1,avg)

    }).sortByKey().filter(line=>line._2>4)  //过滤出平均分大于4.0的电影

    //获取电影id和电影名
    val movieinfo = sc.textFile("D://spark/ml-1m/movies.dat")
    val moviessss = movieinfo.filter(line=>line.split("::").length==3).map(line=>{
      val ss = line.split("::")
      (ss(0).toInt,ss(1))
    })
    //使用join连接两个数据集的信息
    val reslut = moviessss.join(movieScore).map(line=>{
      (line._2._1,line._2._2)
    })
  
    reslut.collect().foreach(println)
    sc.stop()
  }


结果展示:

(American Movie (1999),4.013559322033898)
(Harmonists, The (1997),4.142857142857143)
(Bells, The (1926),4.5)
(Toy Story (1995),4.146846413095811)
(Ayn Rand: A Sense of Life (1997),4.125)
(Nights of Cabiria (Le Notti di Cabiria) (1957),4.207207207207207)
(Maya Lin: A Strong Clear Vision (1994),4.101694915254237)
(My Life as a Dog (Mitt liv som hund) (1985),4.1454545454545455)
(West Side Story (1961),4.057818659658344)
(Three Days of the Condor (1975),4.040752351097178)
(Crumb (1994),4.063136456211812)
(Raging Bull (1980),4.1875923190546525)
(Three Colors: Red (1994),4.227544910179641)
(Manon of the Spring (Manon des sources) (1986),4.259090909090909)
(Who's Afraid of Virginia Woolf? (1966),4.074074074074074)
(Wallace & Gromit: The Best of Aardman Animation (1996),4.426940639269406)
(Body Heat (1981),4.031746031746032)
(Shall We Dance? (1937),4.1657142857142855)
(Red Sorghum (Hong Gao Liang) (1987),4.015384615384615)
(Sunset Blvd. (a.k.a. Sunset Boulevard) (1950),4.491489361702127)
(Farewell My Concubine (1993),4.082677165354331)
(Callej�n de los milagros, El (1995),4.5)
(Return with Honor (1998),4.4)
(Eighth Day, The (Le Huiti�me jour ) (1996),4.25)
(Ponette (1996),4.068493150684931)
(Double Indemnity (1944),4.415607985480944)
(To Kill a Mockingbird (1962),4.425646551724138)
(M*A*S*H (1970),4.124658780709736)
(Untouchables, The (1987),4.007985803016859)
(General, The (1927),4.368932038834951)
(Godfather, The (1972),4.524966261808367)
(Star Wars: Episode V - The Empire Strikes Back (1980),4.292976588628763)
(Matewan (1987),4.141242937853107)
此程序用于spark训练,望博友多多指教。



  • 1
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值