Spark影评数据分析
一、数据来源
网址:https://grouplens.org/datasets/movielens/
二、数据结构分析
本次所分析的数据有用户、电影、评分三个表,结构如下:
1.用户表
字段名 类型 备注
用户ID String
性别 String “M” for male and “F” for female
年龄 Int * 1: “Under 18”
* 18: “18-24”
* 25: “25-34”
* 35: “35-44”
* 45: “45-49”
* 50: “50-55”
* 56: “56+”
职业 String * 0: “other” or not specified
* 1: “academic/educator”
* 2: “artist”
* 3: “clerical/admin”
* 4: “college/grad student”
* 5: “customer service”
* 6: “doctor/health care”
* 7: “executive/managerial”
* 8: “farmer”
* 9: “homemaker”
* 10: “K-12 student”
* 11: “lawyer”
* 12: “programmer”
* 13: “retired”
* 14: “sales/marketing”
* 15: “scientist”
* 16: “self-employed”
* 17: “technician/engineer”
* 18: “tradesman/craftsman”
* 19: “unemployed”
* 20: “writer”
邮编 String
样例数据: 2::M::56::16::70072
2.电影表
字段名称 类型 备注
电影ID String
电影名称 String
电影类型 String * Action
* Adventure
* Animation
* Children’s
* Comedy
* Crime
* Documentary
* Drama
* Fantasy
* Film-Noir
* Horror
* Musical
* Mystery
* Romance
* Sci-Fi
* Thriller
* War
* Western
样例数据:2::Jumanji (1995)::Adventure|Children’s|Fantasy
3. 评分表
字段名称 类型 备注
用户Id String UserIDs range between 1 and 6040
电影ID String MovieIDs range between 1 and 3952
评分 Double Ratings are made on a 5-star scale (whole-star ratings only)
时间 String Timestamp is represented in seconds since the epoch as returned by time(2)
样例数据:1::1193::5::978300760
三、需求及实现
3:分别求男性,女性看过最多的 10 部电影(性别,电影名
import org.apache.spark.{
SparkConf, SparkContext}
object man {
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setAppName("movie fg").setMaster("local");
val sc=new SparkContext(conf);
val rdd1=sc.textFile("C:\\Users\\Wikipedia\\Desktop\\dddd\\Spark\\实验作业\\Test\\ml-1m\\movies.dat");
val rdd2=sc.textFile("C:\\Users\\Wikipedia\\Desktop\\dddd\\Spark\\实验作业\\Test\\ml-1m\\ratings.dat");
val rdd3=sc.textFile(path = "C:\\Users\\Wikipedia\\Desktop\\dddd\\Spark\\实验作业\\Test\\ml-1m\\users.dat");
val movie=rdd1.map{
x=>val line=x.split("::"); (line(0),line(1))}; //电影ID,电影名
val rate=rdd2.map{
x=>val line=x.split("::"); (line(0),(line(1),1))}; //用户ID,(电影ID,次数)
val user=rdd3.map{
x=>val line