用pandas分析百万电影数据

最新推荐文章于 2024-02-13 21:11:01 发布

原创

最新推荐文章于 2024-02-13 21:11:01 发布 · 1.4w 阅读

12 ·

CC 4.0 BY-SA版权

文章标签：

#python #数据分析 #pandas

本文介绍如何利用Python的pandas库对百万级别的电影数据进行分析，包括读取数据、数据预处理和探索性分析。通过加载电影、用户和评分数据，观察数据量并展示部分样本。进一步分析不同性别用户对电影的平均评分差异，找出评分差异最大的电影，以及评分次数最多的电影。

用pandas分析电影数据

Lift is short, use Python.

用Python做数据分析，pandas是Python数据分析的重要包，其他重要的包：numpy、matplotlib .

安装pandas(Linux, Mac, Windows皆同)：

pip install pandas

电影数据来源：http://grouplens.org/datasets/movielens/

下载数据文件解压，包含如下4个文件:

users.dat 用户数据
movies.dat 电影数据
ratings.dat 评分数据
README 文件解释

查看README文件，可知源数据文件的格式：

users.dat (UserID::Gender::Age::Occupation::Zip-code)
movies.dat (MovieID::Title::Genres)
ratings.dat (UserID::MovieID::Rating::Timestamp)

特别解释：Occupation用户职业，Zip-code邮编， Timestamp时间戳， Genres电影类型（更多解释可以查看README文件）.

文件中各每条数据的分割符是 ::

环境：

OS:Windows
Language:Python3.4
编辑器:Jupyter

用pandas读取数据.

导入必要的头文件：

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

读取数据，先定义字段名，因为源数据中无字段名，只有用’::’分割的每条数据.

user_names = ['user_id', 'gender', 'age', 'occupation', 'zip'] #用户表的数据字段名

读取数据，注意源文件的地址.

users = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\users.dat', sep='::', header=None, names=user_names)

D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  if __name__ == '__main__':

上面有个警告，可以不管，即:加载数据是用的python engine 而不是 c engine.(更多请google)
查看有多少个数据.
前5行数据.

print(len(users))
users.head()

6040

	user_id	gender	age	occupation	zip
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

同理将movies,ratings数据读进来.

ratings_names = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\ratings.dat', sep='::', header=None, names=ratings_names)
movies_names = ['movie_id', 'title', 'genres']
movies = pd.read_table('C:\\Users\\Administrator\\Downloads\\ml-1m\\movies.dat', sep='::', header=None, names=movies_names)

D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.
  from ipykernel import kernelapp as app
D:\Anaconda3\lib\site-packages\ipykernel\__main__.py:4: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.

加载数据需要一点点时间，应为数据有上百万条.
查看ratings表，movies表.

print(len(ratings))
ratings.head()

1000209

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

print(len(movies))
movies.head()

3883

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

电影的评分的数据有1百万多个.
将3个表合并为一个表data .

data = pd.merge(pd.merge(users, ratings), movies)
print(len(data))
data.head()

1000209

	user_id	gender	age	occupation	zip	movie_id	rating	timestamp	title	genres
0	1	F	1	10	48067	1193	5	978300760	One Flew Over the Cuckoo’s Nest (1975)	Drama
1	2	M	56	16	70072	1193	5	978298413	One Flew Over the Cuckoo’s Nest (1975)	Drama
2	12	M	25	12	32793	1193	4	978220179	One Flew Over the Cuckoo’s Nest (1975)	Drama
3	15	M	25	7	22903	1193	4	978199279	One Flew Over the Cuckoo’s Nest (1975)	Drama
4	17	M	50	1	95350	1193	5	978158471	One Flew Over the Cuckoo’s Nest (1975)	Drama

查看用户id为1，对所有电影的评分.

data[data.user_id==1]

	user_id	gender	age	occupation	zip	movie_id	rating	timestamp	title	genres
0	1	F	1	10	48067	1193	5	978300760	One Flew Over the Cuckoo’s Nest (1975)	Drama
1725	1	F	1	10	48067	661	3	978302109	James and the Giant Peach (1996)	Animation\|Children’s\|Musical