04_好莱坞百万级电影评论数据分析


博文配套视频课程:24小时实现从零到AI人工智能


好莱坞百万级电影评论数据分析

经过Pandas的入门学习,急需要通过一些简单的项目来将所学知识和用法融会贯通,这里选择对好莱坞百万级电影评论数据进行分析处理,下面就开始吧~

Pandas 知识点

  • 数据读取
  • 数据集成
  • 透视表
  • 数据聚合与分组运算
  • 分段统计
  • 数据可视化

任务需求

  1. 数据加载和集成
  2. 平均分较高电影
  3. 不同性别对电影平均评分
  4. 不同性别争议最大电影
  5. 评分次数最多热门的电影
  6. 不同年龄段争议最大的电影
  7. 优化与总结

学习大礼包中有影评的测试数据:

链接: https://pan.baidu.com/s/1oyfrI7h0y2o3u_KTXh0hSQ 提取码: kahn

操作环境:Jupyter Notebook

1.导入所需库

import numpy as np
import pandas as pd
# draw
import matplotlib.pyplot as plt
%matplotlib inline

 
 
  • 1
  • 2
  • 3
  • 4
  • 5

2.导入数据

读取user

通过查看README可以得到USER数据的格式如下:

USERS FILE DESCRIPTION User information is in the file “users.dat” and is in the following format:

UserID::Gender::Age::Occupation::Zip-code

此处索引命名不一定非要一致,自己明白即可

# shift + Tab 查看函数提示
# 创建索引列表
labels = ['UserID','Gender','Age','Occupation','Zip-code']
# 以此输入路径,分隔符,不作为头部,赋值索引
users = pd.read_csv('./users.dat',sep = '::', header= None, names =labels)
# 读取后查看维度
users.shape

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
(6040, 5)

 
 
  • 1

若有红色输出则即可当做log日志,不用惊慌

users.head()

 
 
  • 1
UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455

读取Movie

MOVIES FILE DESCRIPTION

Movie information is in the file “movies.dat” and is in the following
format:

MovieID::Title::Genres

labels2 = ['MovieID','Title','Genres']
movie =pd.read_csv('./movies.dat',sep='::',header = None,names=labels2)
# display同时显示两个
display(movie.head(),movie.shape)

 
 
  • 1
  • 2
  • 3
  • 4
MovieIDTitleGenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
(3883, 3)

 
 
  • 1

读取RATINGS

RATINGS FILE DESCRIPTION

All ratings are contained in the file “ratings.dat” and are in the
following format:

UserID::MovieID::Rating::Timestamp

labels3 = ['UserID','MovieID','Rating','Time']
ratings =pd.read_csv('./ratings.dat',sep='::',header = None,names=labels3)
# display()同时显示两组数据
display(ratings.head(),ratings.shape)

 
 
  • 1
  • 2
  • 3
  • 4

这里读取百万级数据可能需要稍作等待。。。

UserIDMovieIDRatingTime
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291
(1000209, 4)

 
 
  • 1

3. 数据合并

由于数据分布在三个表,所以需要对数据进行数据集成,首先将三张表简单展示在一起,查看各自特征。

display(users.head(),movie.head(),ratings.head())

UserIDGenderAgeOccupationZip-code
01F11048067
12M561670072
23M251555117
34M45702460
45M252055455
MovieIDTitleGenres
01Toy Story (1995)Animation|Children’s|Comedy
12Jumanji (1995)Adventure|Children’s|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama
45Father of the Bride Part II (1995)Comedy
UserIDMovieIDRatingTime
0111935978300760
116613978302109
219143978301968
3134084978300275
4123555978824291

经过观察发现后两张表关于MovieID有重合,可以进行数据合并

# 关于MovieID可以合并
df1 = pd.merge(left = movie ,right=ratings)
df1.head(10)

 
 
  • 1
  • 2
  • 3
MovieIDTitleGenresUserIDRatingTime
01Toy Story (1995)Animation|Children’s|Comedy15978824268
11Toy Story (1995)Animation|Children’s|Comedy64978237008
21Toy Story (1995)Animation|Children’s|Comedy84978233496
31Toy Story (1995)Animation|Children’s|Comedy95978225952
41Toy Story (1995)Animation|Children’s|Comedy105978226474
51Toy Story (1995)Animation|Children’s|Comedy184978154768
61Toy Story (1995)Animation|Children’s|Comedy195978555994
71Toy Story (1995)Animation|Children’s|Comedy213978139347
81Toy Story (1995)Animation|Children’s|Comedy234978463614
91Toy Story (1995)Animation|Children’s|Comedy263978130703

merge()在并没有指定在哪一列进行连接时,连接键信息没有确定,此时merge()会自动将表中重叠列名作为连接的键,但是一般显式的设定链接键是好的习惯。

另外在合并后有可能出现缺少数据的情况,这是因为默认是内连接方式,即为两张表的交集部分进行合并,若是外连接方式则是键的并集,所以在数据合并后检查数据总量是好的习惯。

movie_data = pd.merge(df1,users,how="outer")
movie_data.shape
#检查数据没少

 
 
  • 1
  • 2
  • 3
(1000209, 10)

 
 
  • 1

movie_data.head(10)

MovieIDTitleGenresUserIDRatingTimeGenderAgeOccupationZip-code
01Toy Story (1995)Animation|Children’s|Comedy15978824268F11048067
148Pocahontas (1995)Animation|Children’s|Musical|Romance15978824351F11048067
2150Apollo 13 (1995)Drama15978301777F11048067
3260Star Wars: Episode IV - A New Hope (1977)Action|Adventure|Fantasy|Sci-Fi14978300760F11048067
4527Schindler’s List (1993)Drama|War15978824195F11048067
5531Secret Garden, The (1993)Children’s|Drama14978302149F11048067
6588Aladdin (1992)Animation|Children’s|Comedy|Musical14978824268F11048067
7594Snow White and the Seven Dwarfs (1937)Animation|Children’s|Musical14978302268F11048067
8595Beauty and the Beast (1991)Animation|Children’s|Musical15978824268F11048067
9608Fargo (1996)Crime|Drama|Thriller14978301398F11048067
# 通过标题查看有多少部电影
movie_data['Title'].unique().size

 
 
  • 1
  • 2
3706

 
 
  • 1

4.平均分较高电影

既要显示电影名称又要有分数 考虑作透视表

以名称为索引 以分数 并且计算平均分为数值来作透视表

movie_rate_mean = pd.pivot_table(movie_data,values=['Rating'],index=['Title'],aggfunc='mean')
movie_rate_mean.head()

 
 
  • 1
  • 2
Rating
Title
$1,000,000 Duck (1971)3.027027
'Night Mother (1986)3.371429
'Til There Was You (1997)2.692308
'burbs, The (1989)2.910891
…And Justice for All (1979)3.713568

按照分数排个序 inplace = True 节省内存不会打印输出 直接在原数据进行排序

movie_rate_mean.sort_values(by='Rating',ascending = False,inplace = True)
movie_rate_mean.head(20)

 
 
  • 1
  • 2
Rating
Title
Ulysses (Ulisse) (1954)5.000000
Lured (1947)5.000000
Follow the Bitch (1998)5.000000
Bittersweet Motel (2000)5.000000
Song of Freedom (1936)5.000000
One Little Indian (1973)5.000000
Smashing Time (1967)5.000000
Schlafes Bruder (Brother of Sleep) (1995)5.000000
Gate of Heavenly Peace, The (1995)5.000000
Baby, The (1973)5.000000
I Am Cuba (Soy Cuba/Ya Kuba) (1964)4.800000
Lamerica (1994)4.750000
Apple, The (Sib) (1998)4.666667
Sanjuro (1962)4.608696
Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)4.560510
Shawshank Redemption, The (1994)4.554558
Godfather, The (1972)4.524966
Close Shave, A (1995)4.520548
Usual Suspects, The (1995)4.517106
Schindler’s List (1993)4.510417
# 查看评分最低的电影 倒序查看
movie_rate_mean[-20:]

 
 
  • 1
  • 2
Rating
Title
Lotto Land (1995)1.0
Nueba Yol (1995)1.0
Even Dwarfs Started Small (Auch Zwerge haben klein angefangen) (1971)1.0
Get Over It (1996)1.0
Venice/Venice (1992)1.0
Sleepover (1995)1.0
Silence of the Palace, The (Saimt el Qusur) (1994)1.0
Waltzes from Vienna (1933)1.0
Wirey Spindell (1999)1.0
Kestrel’s Eye (Falkens 鰃a) (1998)1.0
Spring Fever USA (a.k.a. Lauderdale) (1989)1.0
Loves of Carmen, The (1948)1.0
Underworld (1997)1.0
Low Life, The (1994)1.0
Santa with Muscles (1996)1.0
Fantastic Night, The (La Nuit Fantastique) (1949)1.0
Cheetah (1989)1.0
Torso (Corpi Presentano Tracce di Violenza Carnale) (1973)1.0
Mutters Courage (1995)1.0
Windows (1980)1.0

5. 不同性别对电影评分

依然是通过性别索引来建立透视表分析数据

movie_gender_rating = pd.pivot_table(movie_data,values=['Rating'],index=['Title','Gender'],aggfunc='mean')
movie_gender_rating.head(10)

 
 
  • 1
  • 2
Rating
TitleGender
$1,000,000 Duck (1971)F3.375000
M2.761905
'Night Mother (1986)F3.388889
M3.352941
'Til There Was You (1997)F2.675676
M2.733333
'burbs, The (1989)F2.793478
M2.962085
…And Justice for All (1979)F3.828571
M3.689024

这样分析数据似乎不是非常直观,且由于只分析分值所以可以不显示Rating数据

# 换种透视方法 去掉values中括号->去掉rating标题
movie_gender_rating2 = pd.pivot_table(movie_data,values='Rating',index=['Title'],columns=['Gender'],aggfunc='mean')
movie_gender_rating2.head()

 
 
  • 1
  • 2
  • 3
GenderFM
Title
$1,000,000 Duck (1971)3.3750002.761905
'Night Mother (1986)3.3888893.352941
'Til There Was You (1997)2.6756762.733333
'burbs, The (1989)2.7934782.962085
…And Justice for All (1979)3.8285713.689024

6.不同性别争议最大的电影

查看列索引

movie_gender_rating2.columns

 
 
  • 1
Index(['F', 'M'], dtype='object', name='Gender')

 
 
  • 1

创建新的列diff用来存放男女评分差异

#男女评分差异
movie_gender_rating2['diff'] = movie_gender_rating2.F - movie_gender_rating2.M
movie_gender_rating2.head()  

 
 
  • 1
  • 2
  • 3
GenderFMdiff
Title
$1,000,000 Duck (1971)3.3750002.7619050.613095
'Night Mother (1986)3.3888893.3529410.035948
'Til There Was You (1997)2.6756762.733333-0.057658
'burbs, The (1989)2.7934782.962085-0.168607
…And Justice for All (1979)3.8285713.6890240.139547

要分析差异最大,可以对数据进行正向排序

movie_gender_rating2.sort_values(by="diff",ascending=False,inplace=True)
movie_gender_rating2.head()

 
 
  • 1
  • 2
GenderFMdiff
Title
James Dean Story, The (1957)4.0000001.0000003.000000
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)4.0000001.0000003.000000
Country Life (1994)5.0000002.0000003.000000
Babyfever (1994)3.6666671.0000002.666667
Woman of Paris, A (1923)5.0000002.4285712.571429

因为是女减去男 所以差异最大的,是女性最喜欢的

f = movie_gender_rating2[:10]
f

 
 
  • 1
  • 2
GenderFMdiff
Title
James Dean Story, The (1957)4.0000001.0000003.000000
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)4.0000001.0000003.000000
Country Life (1994)5.0000002.0000003.000000
Babyfever (1994)3.6666671.0000002.666667
Woman of Paris, A (1923)5.0000002.4285712.571429
Cobra (1925)4.0000001.5000002.500000
Other Side of Sunday, The (S鴑dagsengler) (1996)5.0000002.9285712.071429
Theodore Rex (1995)3.0000001.0000002.000000
For the Moment (1994)5.0000003.0000002.000000
Separation, The (La S閜aration) (1994)4.0000002.0000002.000000

此时排在最后的就是男减女差异最大的也就是男性最喜欢的

# 最后十个就男性最喜欢的
m = movie_gender_rating2[-10:]
#处理一下 去掉存在的许多NAN
m = movie_gender_rating2.dropna()[-10:]
m

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
GenderFMdiff
Title
Jamaica Inn (1939)1.03.142857-2.142857
Flying Saucer, The (1950)1.03.300000-2.300000
Rosie (1998)1.03.333333-2.333333
In God’s Hands (1998)1.03.333333-2.333333
Dangerous Ground (1997)1.03.333333-2.333333
Killer: A Journal of Murder (1995)1.03.428571-2.428571
Stalingrad (1993)1.03.593750-2.593750
Enfer, L’ (1994)1.03.750000-2.750000
Neon Bible, The (1995)1.04.000000-3.000000
Tigrero: A Film That Was Never Made (1994)1.04.333333-3.333333

将男女最喜欢的电影合并成一张表,便于作图观察

diff = pd.concat([f,m])
diff

 
 
  • 1
  • 2
GenderFMdiff
Title
James Dean Story, The (1957)4.0000001.0000003.000000
Spiders, The (Die Spinnen, 1. Teil: Der Goldene See) (1919)4.0000001.0000003.000000
Country Life (1994)5.0000002.0000003.000000
Babyfever (1994)3.6666671.0000002.666667
Woman of Paris, A (1923)5.0000002.4285712.571429
Cobra (1925)4.0000001.5000002.500000
Other Side of Sunday, The (S鴑dagsengler) (1996)5.0000002.9285712.071429
Theodore Rex (1995)3.0000001.0000002.000000
For the Moment (1994)5.0000003.0000002.000000
Separation, The (La S閜aration) (1994)4.0000002.0000002.000000
Jamaica Inn (1939)1.0000003.142857-2.142857
Flying Saucer, The (1950)1.0000003.300000-2.300000
Rosie (1998)1.0000003.333333-2.333333
In God’s Hands (1998)1.0000003.333333-2.333333
Dangerous Ground (1997)1.0000003.333333-2.333333
Killer: A Journal of Murder (1995)1.0000003.428571-2.428571
Stalingrad (1993)1.0000003.593750-2.593750
Enfer, L’ (1994)1.0000003.750000-2.750000
Neon Bible, The (1995)1.0000004.000000-3.000000
Tigrero: A Film That Was Never Made (1994)1.0000004.333333-3.333333

分析结果 进行数据可视化

# barh水平柱状图
diff.plot(y='diff',kind='barh',figsize=(16,9))

 
 
  • 1
  • 2

图1

7.评论次数最多热门的电影

次数最多考虑利用groupby按照title来做电影的数据聚合,采用size得出每种电影的次数进行排序

# 用groupby 按照title 来做数据聚合
rating_count = movie_data.groupby(['Title']).size()
rating_count.sort_values(ascending = False)

 
 
  • 1
  • 2
  • 3
Title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

8.查看不同年龄段争议最大电影

分析不同年龄段需要对用户的年龄分布图作图分析

movie_data['Age'].plot(kind = 'hist',bins = 20)

 
 
  • 1

在这里插入图片描述
用pandas.cut 函数将用户年龄分组

# 建立分组索引
labels4 = ['0-9','10-19','20-29','30-39','40-49','50-59']
# range是左闭右开 所以取到61
movie_data['Age_range']=pd.cut(movie_data.Age,bins=range(0,61,10),labels=labels4)
movie_data.head()

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
MovieIDTitleGenresUserIDRatingTimeGenderAgeOccupationZip-codeAge_range
01Toy Story (1995)Animation|Children’s|Comedy15978824268F110480670-9
148Pocahontas (1995)Animation|Children’s|Musical|Romance15978824351F110480670-9
2150Apollo 13 (1995)Drama15978301777F110480670-9
3260Star Wars: Episode IV - A New Hope (1977)Action|Adventure|Fantasy|Sci-Fi14978300760F110480670-9
4527Schindler’s List (1993)Drama|War15978824195F110480670-9

9.每个年龄段用户评分人数和打分偏好

# agg 用来计算多个数据
movie_data.groupby('Age_range').agg({'Rating':[np.size,np.mean]})

 
 
  • 1
  • 2
Rating
sizemean
Age_range
0-9272113.549520
10-191835363.507573
20-293955563.545235
30-391990033.618162
40-491561233.673559
50-59387803.766632

10.优化数据分析,结果真实可靠

由于评分次数相差悬殊,导致有的电影评分人数少,却得到很高的分数

解决方案: 加入评分次数考核限制

10.1 加入评分次数限制来分析不同性别对电影的平均分

# 评论最多的50部电影排行
top_movie_title = movie_data.groupby('Title').size().sort_values()[::-1][:50].index
top_movie_title

 
 
  • 1
  • 2
  • 3
Index(['American Beauty (1999)', 'Star Wars: Episode IV - A New Hope (1977)',
       'Star Wars: Episode V - The Empire Strikes Back (1980)',
       'Star Wars: Episode VI - Return of the Jedi (1983)',
       'Jurassic Park (1993)', 'Saving Private Ryan (1998)',
       'Terminator 2: Judgment Day (1991)', 'Matrix, The (1999)',
       'Back to the Future (1985)', 'Silence of the Lambs, The (1991)',
       'Men in Black (1997)', 'Raiders of the Lost Ark (1981)', 'Fargo (1996)',
       'Sixth Sense, The (1999)', 'Braveheart (1995)',
       'Shakespeare in Love (1998)', 'Princess Bride, The (1987)',
       'Schindler's List (1993)', 'L.A. Confidential (1997)',
       'Groundhog Day (1993)', 'E.T. the Extra-Terrestrial (1982)',
       'Star Wars: Episode I - The Phantom Menace (1999)',
       'Being John Malkovich (1999)', 'Shawshank Redemption, The (1994)',
       'Godfather, The (1972)', 'Forrest Gump (1994)', 'Ghostbusters (1984)',
       'Pulp Fiction (1994)', 'Terminator, The (1984)', 'Toy Story (1995)',
       'Alien (1979)', 'Total Recall (1990)', 'Fugitive, The (1993)',
       'Gladiator (2000)', 'Aliens (1986)', 'Blade Runner (1982)',
       'Who Framed Roger Rabbit? (1988)', 'Stand by Me (1986)',
       'Usual Suspects, The (1995)', 'Babe (1995)', 'Airplane! (1980)',
       'Independence Day (ID4) (1996)', 'Galaxy Quest (1999)',
       'One Flew Over the Cuckoo's Nest (1975)', 'Wizard of Oz, The (1939)',
       '2001: A Space Odyssey (1968)', 'Abyss, The (1989)',
       'Bug's Life, A (1998)', 'Jaws (1975)',
       'Godfather: Part II, The (1974)'],
      dtype='object', name='Title')

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

这里需要得到这50部电影的布尔索引,用来刷选男女差异最大的电影

# 获取布尔索引
flag = movie_gender_rating2.index.isin(top_movie_title)
flag

 
 
  • 1
  • 2
  • 3
df1 = movie_gender_rating2[flag].sort_values(by='diff')
df1.plot(kind ='barh',figsize =(12,9))

 
 
  • 1
  • 2

图3

10.2 加入评分次数限制分析平均分高的电影

movie_rating_mean = pd.pivot_table(movie_data,values='Rating',index='Title')
# 这里依然首先获取最受欢迎布尔索引
index = movie_data.groupby('Title').size().sort_values()[::-1][:50].index
flag2 = movie_rating_mean.index.isin(index)
# 利用布尔索引筛选平均分数组中最受欢迎的电影
movie_rating_top_mean = movie_rating_mean[flag2]
# 进行排序
movie_rating_top_mean.sort_values(by='Rating',ascending = False).head(10)

 
 
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
Rating
Title
Shawshank Redemption, The (1994)4.554558
Godfather, The (1972)4.524966
Usual Suspects, The (1995)4.517106
Schindler’s List (1993)4.510417
Raiders of the Lost Ark (1981)4.477725
Star Wars: Episode IV - A New Hope (1977)4.453694
Sixth Sense, The (1999)4.406263
One Flew Over the Cuckoo’s Nest (1975)4.390725
Godfather: Part II, The (1974)4.357565
Silence of the Lambs, The (1991)4.351823

总结

在数据处理过程中,合并、透视、分组、排序最为常用,通过此项目,熟悉了Pandas在处理百万级数据时的基本操作和一些常用API调用方法,了解到数据分析处理工作的流程,为后续深入学习打下基础。

  • 0
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
好莱坞电影整体利润表数据包括票房收入、制片成本、宣传费用、营销费用等,可以通过数据可视化分析来更好地理解和展示这些数据的关系和趋势。 以下是可能的数据可视化分析方法: 1. 柱状图或饼图展示不同电影类型的票房收入占比,以及制片成本和宣传费用的占比。这可以帮助我们了解哪些类型的电影更受欢迎,哪些类型的电影的制作成本较高。 2. 折线图展示每年的票房收入和制片成本。这可以帮助我们了解好莱坞电影市场的趋势,例如票房收入是否稳定增长,制片成本是否有变化。 3. 散点图展示电影的制片成本和票房收入之间的关系。这可以帮助我们了解电影制片成本和票房收入之间的相关性。 4. 热力图展示不同国家或地区的电影票房收入。这可以帮助我们了解哪些国家或地区对好莱坞电影市场的贡献最大。 5. 地图展示不同地区的电影票房收入和影院数量。这可以帮助我们了解不同地区的电影市场规模和发展情况。 6. 桑基图展示电影的制片成本和宣传费用等费用之间的关系。这可以帮助我们了解不同费用项之间的关系和权重。 7. 雷达图展示电影的票房收入、口碑评分、制片成本、宣传费用等指标之间的对比。这可以帮助我们了解电影的整体表现和评价。 以上是一些可能的数据可视化分析方法,可以根据具体情况选择合适的方法来分析好莱坞电影整体利润表数据。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值