文章目录
好莱坞百万级电影评论数据分析
经过Pandas的入门学习,急需要通过一些简单的项目来将所学知识和用法融会贯通,这里选择对好莱坞百万级电影评论数据进行分析处理,下面就开始吧~
Pandas 知识点
- 数据读取
- 数据集成
- 透视表
- 数据聚合与分组运算
- 分段统计
- 数据可视化
任务需求
- 数据加载和集成
- 平均分较高电影
- 不同性别对电影平均评分
- 不同性别争议最大电影
- 评分次数最多热门的电影
- 不同年龄段争议最大的电影
- 优化与总结
本文所使用的所有数据链接:
链接: https://pan.baidu.com/s/1KBphl8o-YEFXVp8N1IlsgA 提取码: 8daa
操作环境:Jupyter Notebook
1.导入所需库
import numpy as np
import pandas as pd
# draw
import matplotlib.pyplot as plt
%matplotlib inline
2.导入数据
读取user
通过查看README可以得到USER数据的格式如下:
USERS FILE DESCRIPTION User information is in the file “users.dat” and is in the following format:
UserID::Gender::Age::Occupation::Zip-code
此处索引命名不一定非要一致,自己明白即可
# shift + Tab 查看函数提示
# 创建索引列表
labels = ['UserID','Gender','Age','Occupation','Zip-code']
# 以此输入路径,分隔符,不作为头部,赋值索引
users = pd.read_csv('./users.dat',sep = '::', header= None, names =labels)
# 读取后查看维度
users.shape
(6040, 5)
若有红色输出则即可当做log日志,不用惊慌
users.head()
UserID | Gender | Age | Occupation | Zip-code | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
读取Movie
MOVIES FILE DESCRIPTION
Movie information is in the file “movies.dat” and is in the following
format:MovieID::Title::Genres
labels2 = ['MovieID','Title','Genres']
movie =pd.read_csv('./movies.dat',sep='::',header = None,names=labels2)
# display同时显示两个
display(movie.head(),movie.shape)
MovieID | Title | Genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children’s|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
(3883, 3)
读取RATINGS
RATINGS FILE DESCRIPTION
All ratings are contained in the file “ratings.dat” and are in the
following format:UserID::MovieID::Rating::Timestamp
labels3 = ['UserID','MovieID','Rating','Time']
ratings =pd.read_csv('./ratings.dat',sep='::',header = None,names=labels3)
# display()同时显示两组数据
display(ratings.head(),ratings.shape)
这里读取百万级数据可能需要稍作等待。。。
UserID | MovieID | Rating | Time | |
---|---|---|---|---|
0 | 1 | 1193 | 5 | 978300760 |
1 | 1 | 661 | 3 | 978302109 |
2 | 1 | 914 | 3 | 978301968 |
3 | 1 | 3408 | 4 | 978300275 |
4 | 1 | 2355 | 5 | 978824291 |
(1000209, 4)
3. 数据合并
由于数据分布在三个表,所以需要对数据进行数据集成,首先将三张表简单展示在一起,查看各自特征。
display(users.head(),movie.head(),ratings.head())
UserID | Gender | Age | Occupation | Zip-code | |
---|---|---|---|---|---|
0 | 1 | F | 1 | 10 | 48067 |
1 | 2 | M | 56 | 16 | 70072 |
2 | 3 | M | 25 | 15 | 55117 |
3 | 4 | M | 45 | 7 | 02460 |
4 | 5 | M | 25 | 20 | 55455 |
MovieID | Title | Genres | |
---|---|---|---|
0 | 1 | Toy Story (1995) | Animation|Children’s|Comedy |
1 | 2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
2 | 3 | Grumpier Old Men (1995) | Comedy|Romance |
3 | 4 | Waiting to Exhale (1995) | Comedy|Drama |
4 | 5 | Father of the Bride Part II (1995) | Comedy |
UserID | MovieID | Rating | Time | |
---|---|---|---|---|