用pandas分析电影数据
Lift is short, use Python.
用Python做数据分析,pandas是Python数据分析的重要包,其他重要的包:numpy、matplotlib .
安装pandas(Linux, Mac, Windows皆同):
pip install pandas
电影数据来源:http://grouplens.org/datasets/movielens/
下载数据文件解压,包含如下4个文件:
- users.dat 用户数据
- movies.dat 电影数据
- ratings.dat 评分数据
- README 文件解释
查看README文件,可知源数据文件的格式:
- users.dat (UserID::Gender::Age::Occupation::Zip-code)
- movies.dat (MovieID::Title::Genres)
- ratings.dat (UserID::MovieID::Rating::Timestamp)
特别解释:Occupation用户职业,Zip-code邮编, Timestamp时间戳, Genres电影类型(更多解释可以查看README文件).
文件中各每条数据的分割符是 ::
环境:
- OS:Windows
- Language:Python3.4
- 编辑器:Jupyter
用pandas读取数据.
导入必要的头文件:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
读取数据,先定义字段名,因为源数据中无字段名,只有用’::’分割的每条数据.
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip'] #用户表的数据字段名
读取数据,注意源文件的地址.
users = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\users.dat', sep='::', header=None, names=user_names)
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
if __name__ == '__main__':
上面有个警告,可以不管,即:加载数据是用的python engine 而不是 c engine.(更多请google)
查看有多少个数据.
前5行数据.
print(len(users))
users.head()
6040
user_id | gender | age | occupation | zip | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
同理将movies,ratings数据读进来.
ratings_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\ratings.dat', sep='::', header=None, names=ratings_names)
movies_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\movies.dat', sep='::', header=None, names=movies_names)
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
from ipykernel import kernelapp as app
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
加载数据需要一点点时间,应为数据有上百万条.
查看ratings表,movies表.
print(len(ratings))
ratings.head()
1000209
user_id | movie_id | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
print(len(movies))
movies.head()
3883
movie_id | title | genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children’s|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
电影的评分的数据有1百万多个.
将3个表合并为一个表data .
data = pd.merge(pd.merge(users, ratings), movies)
print(len(data))
data.head()
1000209
user_id | gender | age | occupation | zip | movie_id | rating | timestamp | title | genres | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 | 1193 | 5 | 978300760 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
1 | 2 | M | 56 | 16 | 70072 | 1193 | 5 | 978298413 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
2 | 12 | M | 25 | 12 | 32793 | 1193 | 4 | 978220179 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
3 | 15 | M | 25 | 7 | 22903 | 1193 | 4 | 978199279 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
4 | 17 | M | 50 | 1 | 95350 | 1193 | 5 | 978158471 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
查看用户id为1,对所有电影的评分.
data[data.user_id==1]
user_id | gender | age | occupation | zip | movie_id | rating | timestamp | title | genres | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 | 1193 | 5 | 978300760 | One Flew Over the Cuckoo’s Nest (1975) | Drama |
1725 | 1 | F | 1 | 1 |