一.环境的准备
1.MovieLens 100K数据集
数据链接的地址
2.查看对应的数据
u.user数据类型
#对应的数据分别为 user_id,age,gender,occupation,ZIP code(邮编)
sulei@sulei:~/下载/ml-100k$ head -5 u.user
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
u.item数据类型
#movie_id,title,release data以及若干与IMDB link和电影分类相关的属性
sulei@sulei:~/下载/ml-100k$ head -5 u.item
1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0
u.datad的数据类型
#user_id movie_id,rating和timestamp属性
sulei@sulei:~/下载/ml-100k$ head -5 u.data
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
二.利用spark做简单的数据分析
1.打开pyspark
2.
user_data=sc.textFile("/home/sulei/下载/ml-100k/u.user")
user_data.first()
user_fields=user_data.map(lambda line:line.split("|"))
num_users=user_fields.map(lambda fields:fields[0]).count()
num_genders=user_fields.map(lambda fields:fields[2]).distinct().count()
num_occupations=user_fields.map(lambda fields:fields[3]).distinct().count()
num_zipcodes=user_fields.map(lambda fields:fields[4]).distinct().count()
print"User: %d,genders:%d,occupations:%d,ZIP codes:%d"%(num_users,num_genders,num_occupations,num_zipcodes)
输出的结果如下:
User: 943,genders:2,occupations:21,ZIP codes:795