数据分析实例:MovieLens电影数据分析
数据准备
数据集来源:grouplens.org/datasets/movielens/
下载 ml-1m.zip,read me 中有电影评分介绍
MovieLens 1M电影分级。 稳定的基准数据集。 6000个用户观看4000部电影时获得100万个评分。 发布2/2003。
数据读取
环境:ipython notebook
- 读取
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
user_names=['user_id','gender','age','occupation','zip']
users=pd.read_table('users.dat',sep='::',header=None,names=user_names,engine='python')
rating_names=['user_id','movie_id','rating','timestamp']
ratings=pd.read_table('ratings.dat',sep='::',header=None,names=rating_names,engine='python')
movie_names=['movie_id','title','genres']
movies=pd.read_table('movies.dat',sep='::',header=None,names=movie_names,engine='python')
users.head(5)
users.head(5)
Out[32]:
use_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455
In [33]:
ratings.head(5)
Out[33]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
In [38]:
movies.head(5)
Out[38]:
movie_id title genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
- 合并
data = pd.merge(pd.merge(ratings, users), movies)
Out[45]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama
data[data.user_id==1]
Out[46]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10