Pandas项目实战1——好莱坞百万级电影评论数据分析_python 统计不同职业最喜欢的电影类型-CSDN博客

本文链接：https://blog.csdn.net/kilotwo/article/details/97271590

本项目使用Pandas对好莱坞百万级电影评论数据进行分析，包括数据集成、平均分电影、性别评分差异、热门电影等。通过数据透视表、分组运算和数据可视化，揭示电影评分趋势和用户行为。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

好莱坞百万级电影评论数据分析
总结

好莱坞百万级电影评论数据分析

经过Pandas的入门学习，急需要通过一些简单的项目来将所学知识和用法融会贯通，这里选择对好莱坞百万级电影评论数据进行分析处理，下面就开始吧~

Pandas 知识点

数据读取
数据集成
透视表
数据聚合与分组运算
分段统计
数据可视化

任务需求

数据加载和集成
平均分较高电影
不同性别对电影平均评分
不同性别争议最大电影
评分次数最多热门的电影
不同年龄段争议最大的电影
优化与总结

本文所使用的所有数据链接：

链接: https://pan.baidu.com/s/1KBphl8o-YEFXVp8N1IlsgA 提取码: 8daa

操作环境：Jupyter Notebook

1.导入所需库

import numpy as np
import pandas as pd
# draw
import matplotlib.pyplot as plt
%matplotlib inline

2.导入数据

读取user

通过查看README可以得到USER数据的格式如下：

USERS FILE DESCRIPTION User information is in the file “users.dat” and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

此处索引命名不一定非要一致,自己明白即可

# shift + Tab 查看函数提示
# 创建索引列表
labels = ['UserID','Gender','Age','Occupation','Zip-code']
# 以此输入路径，分隔符，不作为头部，赋值索引
users = pd.read_csv('./users.dat',sep = '::', header= None, names =labels)
# 读取后查看维度
users.shape

(6040, 5)

若有红色输出则即可当做log日志，不用惊慌

users.head()

	UserID	Gender	Age	Occupation	Zip-code
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

读取Movie

MOVIES FILE DESCRIPTION

Movie information is in the file “movies.dat” and is in the following
format:

MovieID::Title::Genres

labels2 = ['MovieID','Title','Genres']
movie =pd.read_csv('./movies.dat',sep='::',header = None,names=labels2)
# display同时显示两个
display(movie.head(),movie.shape)

MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

(3883, 3)

读取RATINGS

RATINGS FILE DESCRIPTION

All ratings are contained in the file “ratings.dat” and are in the
following format:

UserID::MovieID::Rating::Timestamp

labels3 = ['UserID','MovieID','Rating','Time']
ratings =pd.read_csv('./ratings.dat',sep='::',header = None,names=labels3)
# display()同时显示两组数据
display(ratings.head(),ratings.shape)

这里读取百万级数据可能需要稍作等待。。。

	UserID	MovieID	Rating	Time
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

(1000209, 4)

3. 数据合并

由于数据分布在三个表，所以需要对数据进行数据集成，首先将三张表简单展示在一起，查看各自特征。

display(users.head(),movie.head(),ratings.head())

	UserID	Gender	Age	Occupation	Zip-code
0	1	F	1	10	48067
1	2	M	56	16	70072
2	3	M	25	15	55117
3	4	M	45	7	02460
4	5	M	25	20	55455

	MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children’s\|Comedy
1	2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

	UserID	MovieID	Rating	Time