用pandas分析百万电影数据

用pandas分析电影数据

Lift is short, use Python.

用Python做数据分析,pandas是Python数据分析的重要包,其他重要的包:numpy、matplotlib .

安装pandas(Linux, Mac, Windows皆同):

pip install pandas

电影数据来源:http://grouplens.org/datasets/movielens/

下载数据文件解压,包含如下4个文件:

  • users.dat 用户数据
  • movies.dat 电影数据
  • ratings.dat 评分数据
  • README 文件解释

查看README文件,可知源数据文件的格式:

  • users.dat (UserID::Gender::Age::Occupation::Zip-code)
  • movies.dat (MovieID::Title::Genres)
  • ratings.dat (UserID::MovieID::Rating::Timestamp)

特别解释:Occupation用户职业,Zip-code邮编, Timestamp时间戳, Genres电影类型(更多解释可以查看README文件).

文件中各每条数据的分割符是 ::


环境:

  • OS:Windows
  • Language:Python3.4
  • 编辑器:Jupyter

用pandas读取数据.

导入必要的头文件:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
读取数据,先定义字段名,因为源数据中无字段名,只有用’::’分割的每条数据.
user_names = ['user_id', 'gender', 'age', 'occupation', 'zip'] #用户表的数据字段名
读取数据,注意源文件的地址.
users = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\users.dat', sep='::', header=None, names=user_names)
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

上面有个警告,可以不管,即:加载数据是用的python engine 而不是 c engine.(更多请google)
查看有多少个数据.
前5行数据.

print(len(users))
users.head()
6040
user_id gender age occupation zip
0 1 F 1 10 48067
1 2 M 56 16 70072
2 3 M 25 15 55117
3 4 M 45 7 02460
4 5 M 25 20 55455

同理将movies,ratings数据读进来.

ratings_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\ratings.dat', sep='::', header=None, names=ratings_names)
movies_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\movies.dat', sep='::', header=None, names=movies_names)
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  from ipykernel import kernelapp as app
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.

加载数据需要一点点时间,应为数据有上百万条.
查看ratings表,movies表.

print(len(ratings))
ratings.head()
1000209
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
print(len(movies))
movies.head()
3883
movie_id title genres
0 1 Toy Story (1995) Animation|Children’s|Comedy
1 2 Jumanji (1995) Adventure|Children’s|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy

电影的评分的数据有1百万多个.
将3个表合并为一个表data .

data = pd.merge(pd.merge(users, ratings), movies)
print(len(data))
data.head()
1000209
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo’s Nest (1975) Drama
1 2 M 56 16 70072 1193 5 978298413 One Flew Over the Cuckoo’s Nest (1975) Drama
2 12 M 25 12 32793 1193 4 978220179 One Flew Over the Cuckoo’s Nest (1975) Drama
3 15 M 25 7 22903 1193 4 978199279 One Flew Over the Cuckoo’s Nest (1975) Drama
4 17 M 50 1 95350 1193 5 978158471 One Flew Over the Cuckoo’s Nest (1975) Drama

查看用户id为1,对所有电影的评分.

data[data.user_id==1]
user_id gender age occupation zip movie_id rating timestamp title genres
0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo’s Nest (1975) Drama
1725 1 F 1 1
  • 0
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值