数据准备
1 下载数据
链接:https://pan.baidu.com/s/165de8xKYl0QBq8lGzYGW6g 密码:brb9
链接:https://pan.baidu.com/s/1-jxcAYoybNV5TYL7xbzi9A 密码:id59
2 上传hdfs
[root@node1 data]# hdfs dfs -put ml-1m/ input
[root@node1 data]# hdfs dfs -ls input/ml-1m
3 数据格式说明
(1)users.dat
UserID::Gender::Age:Occupation:Zip-code
用户信息表结构
用户号::性别:年龄:职业:邮政编码
(2)movies.dat
MovieID::Title::Genres
电影信息表结构
电影号::标题::流派
(3)ratings.dat
UerID::MoviesID::Rating::Timestamp
评级表结构
UerID:: MoviesID::评级::时间戳
4 spark加载数据
scala> val usersRdd =sc.textFile("input/ml-1m/users.dat")
scala> usersRdd.first res7: String = 1::F::1::10::48067
scala> usersRdd.count
scala> val moviesRdd=sc.textFile("input/ml-1m/movies.dat")
scala> moviesRdd.first
scala> moviesRdd.count
scala> val ratingsRdd=sc.textFile("input/ml-1m/ratings.dat")
scala> ratingsRdd.first
scala> ratingsRdd.count
5 查看某个电影和性别分布
1)首先,为了避免三表查询,这里我们需要提前确定电影《Lord of the Rings,The(1978)》的编号。
通过movies.dat文件查询可知,该电影编号是2116。
这样可以定义一个常量:
scala> val MOVIE_ID="2116"
MOVIE_ID: String = 2116
(2)对于用户表,我们只需要年龄和性别,用户ID用于关联。所以对于用户表,需要过滤出前三个字段即可,用户ID可以作为Key,年龄和性别可以作为Value。
scala> val users=usersRdd.map(_.split("::")).map{x => (x(0),(x(1),x(2)))}
users: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[9] at map at <console>:26
scala> users.take(10)
res5: Array[(String, (String, String))] = Array((1,(F,1)), (2,(M,56)), (3,(M,25)), (4,(M,45)), (5,(M,25)), (6,(F,50)), (7,(M,35)), (8,(M,25)), (9,(M,25)), (10,(F,35)))
3)对于评级表ratings,需要过滤出用户ID和电影ID即可,然后再通过常量MOVIE_ID=”2116”进行过滤。
scala> val rating =ratingsRdd.map(_.split("::"))
rating: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[10] at map at <console>:26
scala> rating.first
res9: Array[String] = Array(1, 1193, 5, 978300760)
scala> val userMovie=rating.map{x=>(x(0),x(1))}.filter(_._2.equals(MOVIE_ID))
userMovie: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[12] at filter at <console>:30
scala> userMovie.first
res0: (String, String) = (17,2116)
scala>
备注:对于Key/Value型的RDD,_._1
表示key,_._2
表示value。
(4)将处理后的评级表和处理后的用户表进行join操作。注意,rdd1[key,value1] join rdd2[key,value2]的结果是[key,(value1,value2)],也就是key是关联字段,value是两个RDD组合形式。
scala> val userRating =userMovie.join(users)
userRating: org.apache.spark.rdd.RDD[(String, (String, (String, String)))] = MapPartitionsRDD[11] at join at <console>:36
scala> userRating.first
res3: (String, (String, (String, String))) = (749,(2116,(M,35)))
(5)对连接结果进行处理。
scala> val userDistribution=userRating.map{x=>(x._2._2,1)}.reduceByKey(_+_)
userDistribution: org.apache.spark.rdd.RDD[((String, String), Int)] = ShuffledRDD[13] at reduceByKey at <console>:38
scala> userDistribution.foreach(println)
6 spark部分代码
import org.apache.spark._
/**
* 看过“Lord of the Rings, The (1978)”用户和年龄性别分布
*/
object MovieUserAnalyzer {
def main(args: Array[String]) {
if (args.length < 1){
println("Usage:MovieUserAnalyzer dataPath")
System.exit(1)
}
val conf = new SparkConf().setAppName("MovieUserAnalyzer")
val sc = new SparkContext(conf)
//1.加载数据,创建RDD
val MOVIE_ID = "2116"
val usersRdd = sc.textFile(args(0) + "/users.dat")
val ratingsRdd = sc.textFile(args(0) + "/ratings.dat")
//2.解析用户表 RDD[(userID, (gender, age))]
val users = usersRdd.map(_.split("::")).map { x =>
(x(0), (x(1), x(2)))
}
//3.解析评级表 RDD[Array(userID, movieID, ratings, timestamp)]
val rating = ratingsRdd.map(_.split("::"))
//usermovie: RDD[(userID, movieID)]
val usermovie = rating.map { x =>(x(0), x(1))}.filter(_._2.equals(MOVIE_ID))
//4.join RDDs
//useRating: RDD[(userID, (movieID, (gender, age))]
val userRating = usermovie.join(users)
//movieuser: RDD[(movieID, (movieTile, (gender, age))]
val userDistribution = userRating.map { x =>(x._2._2, 1)}.reduceByKey(_ + _)
//5.输出结果
userDistribution.collect.foreach(println)
sc.stop()
}
}
7 查询年龄段在“18-24”的男人,最喜欢看10部电影
import org.apache.spark._
import scala.collection.immutable.HashSet
/**
* .年龄段在“18-24”的男人,最喜欢看10部电影
*/
object PopularMovieAnalyzer {
def main(args: Array[String]) {
val masterUrl = "local[1]"
var dataPath = "d:\\data\\ml-1m"
val conf = new SparkConf().setMaster(masterUrl).setAppName("PopularMovieAnalyzer")
if (args.length > 0) {
dataPath = args(0)
} else {
conf.setMaster("local[1]")
}
val sc = new SparkContext(conf)
//1.加载数据,创建RDD
val USER_AGE = "18"
val usersRdd = sc.textFile(dataPath + "\\users.dat")
val moviesRdd = sc.textFile(dataPath + "\\movies.dat")
val ratingsRdd = sc.textFile(dataPath + "\\ratings.dat")
//2.从RDD提取列
//2.1 users: RDD[(userID, age)]
val users = usersRdd.map(_.split("::"))
.map { x => (x(0), x(2)) }
.filter(_._2.equals(USER_AGE))
//2.2 过滤出userID
val userlist = users.map(_._1).collect()
//2.3 用++来批量添加元素
val userSet = HashSet() ++ userlist
//2.4 广播userSet
val broadcastUserSet = sc.broadcast(userSet)
//3. map-side join RDDs
val topKmovies = ratingsRdd.map(_.split("::"))
.map { x => (x(0), x(1)) } //RDD[UerID,MoviesID]
.filter { x => broadcastUserSet.value.contains(x._1) } //过滤符合广播userSet
.map { x => (x._2, 1) } //RDD[MoviesID,1]
.reduceByKey(_ + _) //RDD[MoviesID,n]
.map { x => (x._2, x._1) }//RDD[n,MoviesID]
.sortByKey(false) //降序排序
.map { x => (x._2, x._1) }//RDD[MoviesID,n]
.take(10)
//4. 将filmID转换fileName
//4.1 过滤出RDD[MovieID,Title]
val movieID2Name = moviesRdd.map(_.split("::"))
.map { x => (x(0), x(1)) }
.collect()
.toMap
//4.2 getOrElse(key,default)获取key对应的value,如果不存在则返回一个默认值。
topKmovies.map(x => (movieID2Name.getOrElse(x._1, null), x._2))
.foreach(println)
sc.stop()
}
}
8 得分最高的10部电影;看过电影最好的前10人;女性看最多的10部电影;男性看过最多的10部电影。
import org.apache.spark._
/**
* 得分最高的10部电影;看过电影最多的前10个人;女性看多最多的10部电影;男性看过最多的10部电影
*/
object TopKMovieAnalyzer {
def main(args: Array[String]) {
var dataPath = "d:\\data\\ml-1m"
val conf = new SparkConf().setAppName("TopKMovieAnalyzer")
if(args.length > 0) {
dataPath = args(0)
} else {
conf.setMaster("local[1]")
}
val sc = new SparkContext(conf)
/**
* Step 1: Create RDDs
*/
val DATA_PATH = dataPath
val ratingsRdd = sc.textFile(DATA_PATH + "\\ratings.dat")
/**
* Step 2: Extract columns from RDDs
*/
//users: RDD[(userID, movieID, Rating)]
val ratings = ratingsRdd.map(_.split("::"))
.map { x =>(x(0), x(1), x(2))}
.cache
/**
* Step 3: analyze result
* reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行reduce,
* 因此,Key相同的多个元素的值被reduce为一个值,然后与原RDD中的Key组成一个新的KV对
*/
//得分最高的10部电影
ratings.map { x =>(x._2, (x._3.toInt, 1))} //RDD[movieID,(Rating,1)]
.reduceByKey { (v1, v2) =>(v1._1 + v2._1, v1._2 + v2._2)}//RDD[movieID,(n_Rating,n)]
.map { x =>(x._2._1.toFloat / x._2._2.toFloat, x._1)}//RDD[avgRating,movieID]
.sortByKey(false)
.take(10)
.foreach(println)
//看过电影最多的前10个人
ratings.map { x =>(x._1, 1)}//RDD[userID,1]
.reduceByKey(_ + _) //RDD[userID,n]
.map(x => (x._2, x._1))//RDD[n,userID]
.sortByKey(false)
.take(10)
.foreach(println)
sc.stop()
}
}