密码:j1gl
第一部分 求出最受女性欢迎的五部电影
思路:利用read_table载入数据,通过merge函数进行连接,过滤评价数≤250条的电影,按性别计算每部电影的平均得分
step1
import pandas as pd
movies=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',
header=None,names=['movie_id','title','genres'])
users=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',
header=None,names=['user_id','gender','age','occupation','zip'])
ratings=pd.read_table(r'C:\Users\Administrator\Downloads\pydata-book-2nd-edition\datasets\movielens\movies.dat',sep='::',
header=None,names=['user_id','movie_id','rating','timestamp'])
#原dat数据以'::'为分隔符,通过read_table()加载,没有列名行,传入相应的列名
step2
movies.info()
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 3 columns):
movie_id 3883 non-null int64
title 3883 non-null object
genres 3883 non-null object
dtypes: int64(1), object(2)
memory usage: 91.1+ KB
users.info()
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 5 columns):
user_id 3883 non-null int64
gender 3883 non-null object
age 3883 non-null object
occupation 0 non-null float64
zip 0 non-null float64
dtypes: float64(2), int64(1), object(2)
memory usage: 151.8+ KB
ratings.info()
RangeIndex: 3883 entries, 0 to 3882
Data columns (total 4 columns):
user_id 3883 non-null int64
movie_id 3883 non-null object
rating 3883 non-null object
timestamp 0 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 121.4+ KB
#获得每张表的基本信息
step3
data=pd.merge(pd.merge(users,ratings,how='inner'),movies,how='inner')
data[:5]
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo's Nest (1975) Drama
step4
rating_counts=data['title'].value_counts()
#统计每部电影对应的评分条数
index=rating_counts.index[rating_counts>250]
#筛选评分条数大于250的电影
index[:5]
Index(['American Beauty (1999)', 'Star Wars: Episode IV - A New Hope (1977)',
'Star Wars: Episode V - The Empire Strikes Back (1980)',
'Star Wars: Episode VI - Return of the Jedi (1983)',
'Jurassic Park (1993)'])
step5
table_rating=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean').loc[index]
#以title为轴索引(过滤掉评分人数少于等于250),gender为列,计算rating的平均数据
rating_order=table_rating.sort_values(by='F',ascending=False)
#根据女性评分高低进行从高到低排序
rating_order.head()
gender F M
Close Shave, A (1995) 4.644444 4.473795
Wrong Trousers, The (1993) 4.588235 4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
Schindler's List (1993) 4.562602 4.491415
#得到最受女性欢迎的五部电影
第二部分 找出男女分歧最大,且更受男观众喜爱的电影
思路:新增一列‘diff’,为同一部电影的男女评分差,差值越大表明越受男观众喜欢
step1
rating_order['diff']=rating_order['M']-rating_order['F']
rating_order.head()
gender F M diff
Close Shave, A (1995) 4.644444 4.473795 0.170650
Wrong Trousers, The (1993) 4.588235 4.478261 0.109974
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589 0.108060
Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075 0.178032
Schindler's List (1993) 4.562602 4.491415 0.071187
step2
data4.sort_values(by='diff',ascending=False)
gender F M diff
Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
Dumb & Dumber (1994) 2.697987 3.336595 0.638608
Longest Day, The (1962) 3.411765 4.031447 0.619682
Cable Guy, The (1996) 2.250000 2.863787 0.613787
#差值越大表示分歧越大,且更为男性喜欢
第三部分 找出分歧最小的电影(不考虑性别因素)
思路:利用std()函数,求出每部电影的标准差,排序后,获得标准差最小的电影
step1
rating_std=data.groupby('title')['rating'].std().loc[index]
#对每部电影求评分的标准差(过滤掉评分人数少于等于250的部分)
step2
rating_std.sort_values()[:5]
title
Close Shave, A (1995) 0.667143
Rear Window (1954) 0.688946
Great Escape, The (1963) 0.692585
Shawshank Redemption, The (1994) 0.700443
Wrong Trousers, The (1993) 0.708666
#选出分歧最小的五部电影