准备工作
打开网址:https://grouplens.org/datasets/movielens/
下载:
数据分析
要求:
- 将数据导入到pandas,并合并三张表格
- 查看用户1所有评分电影信息
- 判断每部电影男性平均得分和女性平均得分
- 判断那部电影男女生评分差异最大
- 判断哪个电影观看次数最多,热门电影
- 评分最高的电影
- 评分前10的电影的热度
- 热度前10的电影的评分
- 排名前10的好电影,播放量1000+,评分高
Python实现:
import pandas as pd
import numpy as np
import matplotlib as plt
"""读取文件"""
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
ratings_names = ['user_id', 'movie_id', 'rating', 'timestamp']
movie_names = ['movie_id', 'title', 'genres']
movies_data = pd.read_table('F:\\ml-1m\\movies.dat', sep='::', header=None, names=movie_names)
ratings_data = pd.read_table('F:\\ml-1m\\ratings.dat', sep='::', header=None, names=ratings_names)
users_data = pd.read_table('F:\\ml-1m\\users.dat', sep='::', header=None, names=unames)
# print(users_data.head())
# print(ratings_data.head())
# print(movies_data.head())
"""合并表格便于分析"""
data = pd.merge(pd.merge(users_data, ratings_data), movies_data)
print(data.head())
print('用户1看过的所有电影及评分', data[data.user_id == 1])
"""判断每部电影男性平均得分和女性平均得分"""
ratings_by_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
print(ratings_by_gender)
"""判断那部电影男女生评分差异最大"""
ratings_by_gender = data.pivot_table(values='rating', index='title', columns='gender', aggfunc='mean')
ratings_by_gender['diff'] = ratings_by_gender.F - ratings_by_gender.M
print(ratings_by_gender.sort_values(by='diff', ascending=True))
"""那部电影看的人最多,热度最高"""
ratings_by_data = data.groupby('title').size()
print(ratings_by_data.sort_values(ascending=False).head())
"""评分最高"""
rating_high = data.pivot_table(values='rating', index='title', aggfunc='mean')
print(rating_high.sort_values(by='rating', ascending=False).head())
"""热度前10的电影的评分"""
ratings_by_data = data.groupby('title').size()
top10_hot = ratings_by_data.sort_values(ascending=False).head(10)
mean_ratings = data.pivot_table(values='rating', index='title', aggfunc='mean').sort_values(by='rating', ascending=False)
print(mean_ratings.loc[top10_hot.index])
"""评分前10的电影的热度"""
mean_ratings=data.pivot_table(values='rating',index='title',aggfunc='mean').sort_values(by='rating',ascending=False)
mean_ratings_top10=mean_ratings.head(10)
ratings_by_data=data.groupby('title').size()
print(ratings_by_data.loc[mean_ratings_top10.index])
"""排名前10的好电影,播放量1000+,评分高"""
hot_movies=ratings_by_data[ratings_by_data>1000]
mean_ratings=data.pivot_table(values='rating',index='title',aggfunc='mean').sort_values(by='rating',ascending=False)
hot_movie_rating=mean_ratings.loc[hot_movies.index]
top10_good_movies=hot_movie_rating.sort_values(by='rating',ascending=False).head()
print(top10_good_movies)
运行结果:
-
读取数据并合并
-
用户1
-
男女评分
-
男女差异:
-
热门电影
-
评分最高
- 评分前10的电影的热度
- 热度前10的电影的评分
- 排名前10的好电影:
ipython notebook代码:http://localhost:8888/notebooks/movies_anlysis.ipynb#