把用户信息,评分数据,电影数据创建出来
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('users.dat', sep='::', header=None, names=unames) #用此函数来读取数据,用::进行分割,表头是空的
rating_names = ['user_id', 'movie_id', 'rating', 'timestamp'] #评分的时间戳
ratings = pd.read_table('ratings.dat', sep='::', header=None,names=rating_names )
查看总共的记录
print(len(users))
6040
查看前6条记录
print(users.head(6))
将三张表合并起来,才好分析,用户和评分表先合并,在与电影表合并
data = pd.merge(pd.merge(users, ratings), movies)
查看用户id为1的所有数据
print(data[data.user_id == 1])
分析不同性别对电影的评分,将书名作为行索引,性别作为列索引,对评分求平均值
rating_by_gender = data.pivot_table(values='rating', index='title', columns='genres', aggfunc='mean')
然后取前5位数据
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
看哪个电影 男女评分差异最大,加一列数据,女生评分减男生评分
gender F M diff
title
$1,000,000 Duck (1971) 3.375000 2.761905 0.613095
'Night Mother (1986) 3.388889 3.352941 0.035948
'Til There Was You (1997) 2.675676 2.733333 -0.057658
'burbs, The (1989) 2.793478 2.962085 -0.168607
...And Justice for All (1979) 3.828571 3.689024 0.139547
将差异最大的数据进行排序
print*(rating_by_gender.sort_values(by='diff', ascending=False).head(5))
二、
groupby函数主要的作用是进行数据的分组以及分组后地组内运算
先找出电影的评分总的次数
rating_by_title = data.groupby('title').size()
print(rating_by_title.head(5))
最热门的电影按降序排列
print(rating_by_title.sort_values(ascending=False).head(5))
title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
dtype: int64
算出每部电影的平均得分
mean_rating = data.pivot_table(values='rating', index='title', aggfunc='mean')
对每部电影的平均得分排序
print(mean_rating.sort_values(by='rating', ascending=False).head(20))
前10大热门电影
top_10_hot = print(rating_by_title.sort_values(ascending=False).head(10))
找出热门足够高的电影就是电影评分的次数,找到大于1000次评分的电影
hot_movies = rating_by_title [rating_by_title >1000]
print(len(hot_movies ))
print(hot_movies.head(10))
207
title
2001: A Space Odyssey (1968) 1716
Abyss, The (1989) 1715
African Queen, The (1951) 1057
Air Force One (1997) 1076
Airplane! (1980) 1731
Aladdin (1992) 1351
Alien (1979) 2024
Aliens (1986) 1820
Amadeus (1984) 1382
American Beauty (1999) 3428
hot_movies_rating = mean_rating[hot_movies.index]
top_10_good_movies = hot_movies_rating.sort_values(ascending=False).head(10)