【Udacity项目】TMDb电影数据集探索分析

本文链接：https://blog.csdn.net/weixin_41409944/article/details/102785941

该项目深入分析了TMDb电影数据集，探讨了票房最高、评分最高电影、电影类别、演员、制片公司与档期之间的关系。结果显示，评分高的电影票房表现通常较好，冒险、动作和科幻类电影近年来票房突出，Harrison Ford总票房最高但并非所有电影都卖座，Walt Disney Pictures总票房领先，9月发行竞争激烈但票房表现一般，5月和6月为较好的上映时机。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

简介

本项目对包含10,000+条电影信息的数据集进行了探索与分析，围绕电影票房与用户评分、电影类别、演员、制片公司、档期等因素之间的相关性，探究了高票房电影的各种特征。数据来源于"电影数据库”（TMDb，The Movie Database）。项目过程包括数据整理、探索性数据分析、结论三个部分。项目中利用Python的Pandas库来评估和清洗数据，通过Pandas的内置函数进行快速的可视化和探索性分析，最后通过描述性统计和Matplotlib可视化的结果，来分析电影的票房趋势，并解答问题。

#导入需要的库
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

探索问题：

1. 票房最高的十部电影？评分最高的十部电影？评分和票房之间是否有一定关联？
2. 每年发行数最多的电影类别？每年总票房最高的电影类别？近些年来哪些类别较受票房欢迎？
3. 总票房排名前十的演员？总票房收入最高的演员所有的电影都卖座吗？
4. 近年来哪个制片公司最能赚钱？
5. 五十多年来，电影发行量的趋势？几月上映更易获得高票房？

数据整理

导入数据

导入数据集，并打印前五行以检验数据文件正确读取

df = pd.read_csv("tmdb-movies.csv")

df.head()

	id	imdb_id	popularity	budget	revenue	original_title	cast	homepage	director	tagline	...	overview	runtime	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	tt0369610	32.985763	150000000	1513528810	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	http://www.jurassicworld.com/	Colin Trevorrow	The park is open.	...	Twenty-two years after the events of Jurassic ...	124	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09
1	76341	tt1392190	28.419936	150000000	378436354	Mad Max: Fury Road	Tom Hardy\|Charlize Theron\|Hugh Keays-Byrne\|Nic...	http://www.madmaxmovie.com/	George Miller	What a Lovely Day.	...	An apocalyptic story set in the furthest reach...	120	Action\|Adventure\|Science Fiction\|Thriller	Village Roadshow Pictures\|Kennedy Miller Produ...	5/13/15	6185	7.1	2015	1.379999e+08	3.481613e+08
2	262500	tt2908446	13.112507	110000000	295238201	Insurgent	Shailene Woodley\|Theo James\|Kate Winslet\|Ansel...	http://www.thedivergentseries.movie/#insurgent	Robert Schwentke	One Choice Can Destroy You	...	Beatrice Prior must confront her inner demons ...	119	Adventure\|Science Fiction\|Thriller	Summit Entertainment\|Mandeville Films\|Red Wago...	3/18/15	2480	6.3	2015	1.012000e+08	2.716190e+08
3	140607	tt2488496	11.173104	200000000	2068178225	Star Wars: The Force Awakens	Harrison Ford\|Mark Hamill\|Carrie Fisher\|Adam D...	http://www.starwars.com/films/star-wars-episod...	J.J. Abrams	Every generation has a story.	...	Thirty years after defeating the Galactic Empi...	136	Action\|Adventure\|Science Fiction\|Fantasy	Lucasfilm\|Truenorth Productions\|Bad Robot	12/15/15	5292	7.5	2015	1.839999e+08	1.902723e+09
4	168259	tt2820852	9.335014	190000000	1506249360	Furious 7	Vin Diesel\|Paul Walker\|Jason Statham\|Michelle ...	http://www.furious7.com/	James Wan	Vengeance Hits Home	...	Deckard Shaw seeks revenge against Dominic Tor...	137	Action\|Crime\|Thriller	Universal Pictures\|Original Film\|Media Rights ...	4/1/15	2947	7.3	2015	1.747999e+08	1.385749e+09

5 rows × 21 columns

常规属性

查看行数和列数

df.shape

(10866, 21)

原始数据集共有10866条记录，21个属性。

查看统计数据

df.describe()

	id	popularity	budget	revenue	runtime	vote_count	vote_average	release_year	budget_adj	revenue_adj
count	10866.000000	10866.000000	1.086600e+04	1.086600e+04	10866.000000	10866.000000	10866.000000	10866.000000	1.086600e+04	1.086600e+04
mean	66064.177434	0.646441	1.462570e+07	3.982332e+07	102.070863	217.389748	5.974922	2001.322658	1.755104e+07	5.136436e+07
std	92130.136561	1.000185	3.091321e+07	1.170035e+08	31.381405	575.619058	0.935142	12.812941	3.430616e+07	1.446325e+08
min	5.000000	0.000065	0.000000e+00	0.000000e+00	0.000000	10.000000	1.500000	1960.000000	0.000000e+00	0.000000e+00
25%	10596.250000	0.207583	0.000000e+00	0.000000e+00	90.000000	17.000000	5.400000	1995.000000	0.000000e+00	0.000000e+00
50%	20669.000000	0.383856	0.000000e+00	0.000000e+00	99.000000	38.000000	6.000000	2006.000000	0.000000e+00	0.000000e+00
75%	75610.000000	0.713817	1.500000e+07	2.400000e+07	111.000000	145.750000	6.600000	2011.000000	2.085325e+07	3.369710e+07
max	417859.000000	32.985763	4.250000e+08	2.781506e+09	900.000000	9767.000000	9.200000	2015.000000	4.250000e+08	2.827124e+09

由上面的统计数据可以得出一些基本统计信息：

电影的预算平均值（通胀后）是17,551,040美元，至少有超过50%的电影预算为0。
电影的票房平均值（通胀后）是51,364,360美元，同样，有超过50%的电影票房收入为0。

—— 虽然’revenue_adj’和’budget_adj’两列都没有非空值，但是预算和票房中可能有超过半数的的记录都存在数据缺失，在进行票房的相关分析时，需要删除这些行。

电视的时长主要集中在90~111分钟，最长的电影可达900分钟。时长为0的电影属于异常值，可能是由于信息缺失造成。
电影的平均评分在6分左右，最低有1.5分，最高可达9.2分。少于25%的电影在6.6分以上，可见评分高的电影还是占少数。
数据集中的电影从1960年~2015年，包含了56年的电影信息。

注意：分析中电影的预算和票房都会采用考虑了通货膨胀之后的数额（以2010年美元的价值来计算），即’budget_adj’和’revenue_adj’两列。

查看列的基本信息和数据类型

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
id                      10866 non-null int64
imdb_id                 10856 non-null object
popularity              10866 non-null float64
budget                  10866 non-null int64
revenue                 10866 non-null int64
original_title          10866 non-null object
cast                    10790 non-null object
homepage                2936 non-null object
director                10822 non-null object
tagline                 8042 non-null object
keywords                9373 non-null object
overview                10862 non-null object
runtime                 10866 non-null int64
genres                  10843 non-null object
production_companies    9836 non-null object
release_date            10866 non-null object
vote_count              10866 non-null int64
vote_average            10866 non-null float64
release_year            10866 non-null int64
budget_adj              10866 non-null float64
revenue_adj             10866 non-null float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

由上面可以看出：

数据缺失方面，数据集中有9列存在数据缺失。
数据类型方面，'id’列应转化为字符串（string）类型，'release_date’列应转化为时间日期（datetime)类型。

数据清理

在数据清理过程中，将会对数据集进行去重、删除无关列、修改数据类型、重命名列、缺失值处理，以及整理和添加数据的操作。

删除重复行

# 检查数据集中是否有重复行
df.duplicated().sum()

数据集中有一行重复，直接进行删除。

df.drop_duplicates(inplace=True)

# 验证是否已经不存在重复行
df.duplicated().sum()

删除无关列

#查看数据集中所有列名
df.columns

Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
       'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
       'runtime', 'genres', 'production_companies', 'release_date',
       'vote_count', 'vote_average', 'release_year', 'budget_adj',
       'revenue_adj'],
      dtype='object')

数据集中与分析无关的列包括’imdb_id’, ‘popularity’, ‘budget’, ‘revenue’, ‘homepage’,‘tagline’, ‘keywords’, ‘overview’, ‘runtime’，直接删除。

注意：

popularity是TMDb用于判断电影受欢迎程度的一个指标，用于搜索和推荐。电影的受欢迎度由每日评分数、每日观看数、每日收藏数等多个因素综合计算所得。在本次分析中不把popularity作为一个普适的衡量指标。（资料来源：https://developers.themoviedb.org/3/getting-started/popularity ）
预算和票房会使用考虑通货膨胀后的数值，所以删除原有的’revenue’和’budget’两列

#删除无关列
df.drop(['imdb_id', 'popularity', 'budget', 'revenue', 'homepage','tagline', 'keywords', 'overview', 'runtime'], axis=1, inplace=True)

#检查删除后的数据
df.head(1)

	id	original_title	cast	director	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	6/9/15	5562	6.5	2015	1.379999e+08	1.392446e+09

修改数据类型

将’id’列修改为字符串类型；'release_date’列修改为时间日期类型；取消科学计数法，使’budget’和’revenue’列以float型显示数据

#把id列由int型转换为str型
df['id'] = df['id'].astype(str)
df['id'].dtype

dtype('O')

#把release_date列由str型转换为datetime型
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date'].dtype

dtype('<M8[ns]')

#取消科学计数法
pd.options.display.float_format = '{:20,.2f}'.format
df.head(1)

	id	original_title	cast	director	genres	production_companies	release_date	vote_count	vote_average	release_year	budget_adj	revenue_adj
0	135397	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	2015-06-09	5562	6.50	2015	137,999,939.28	1,392,445,892.52

最后，验证是否所有数据列的数据类型均已正确

df.dtypes

id                              object
original_title                  object
cast                            object
director                        object
genres                          object
production_companies            object
release_date            datetime64[ns]
vote_count                       int64
vote_average                   float64
release_year                     int64
budget_adj                     float64
revenue_adj                    float64
dtype: object

重命名列

把’budget_adj’和’revenue_adj’的列名改为’budget’和’revenue’

df.rename(columns={
   'budget_adj':'budget','revenue_adj':'revenue'}, inplace=True)
df.head(1)

	id	original_title	cast	director	genres	production_companies	release_date	vote_count	vote_average	release_year	budget	revenue
0	135397	Jurassic World	Chris Pratt\|Bryce Dallas Howard\|Irrfan Khan\|Vi...	Colin Trevorrow	Action\|Adventure\|Science Fiction\|Thriller	Universal Studios\|Amblin Entertainment\|Legenda...	2015-06-09	5562	6.50	2015	137,999,939.28	1,392,445,892.52

缺失值处理

查看数据集中所有列的缺失值情况

在评估数据时我们发现，数据集中有超过50%的电影的预算和票房值为0，这部分数据在分析票房时可能会严重影响数据的准确性。我们先查看一下整个数据集的分布情况。

# 查看整个数据集的分布情况
df.hist(figsize=(10,10));

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3dGPEWEr-1572257445509)(output_34_0.png)]

'budget’和’revenue’的直方图验证了我们对数据的判断，大量的预算和票房缺失影响到了核心有数值部分的展示。所以，我们首先要清理数据集中预算和票房为0的记录。

清理预算和票房为0的记录

先分别查看预算为0和票房为0两组的分布情况，检查这些记录是否存在一些共同特征。

# 用原始数据集的备份来进行清洗
df_clean = df.copy()

# 查看预算为0的所有记录分布情况
df_budget_zero = df_clean[(df_clean.budget == 0)]
df_budget_zero.hist(figsize=(10,10));

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AbXloJ4a-1572257445512)(output_38_0.png)]

# 查看票房为0的所有记录分布情况
df_revenue_zero = df_clean[(df_clean.revenue == 0)]
df_revenue_zero.hist(figsize=(10,10));

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-xb73w0RZ-1572257445514)(output_39_0.png)]

通过以上两组分布图，可以发现发行年、电影评分和评分人数的分布情况相差不多，所以从原始数据中删除这部分，对整体应该不会造成严重偏颇。

但是，从预算为0的图中也可以发现，有一部分电影实际上有很高的票房。因为整个分析主要是围绕票房进行，所以我决定只删除票房为0的电影，保留这其中一些预算为0的记录。

# 删除票房为0的所有记录
df_clean.drop(df_revenue_zero.index, inplace=True)

# 查看清理之后的统计数据
df_clean.describe()

	vote_count	vote_average	release_year	budget	revenue
count	4,849.00	4,849.00	4,849.00	4,849.00	4,849.00
mean	436.28	6.15	2,000.92	35,162,081.53	115,100,887.63
std	806.49	0.80	11.57	43,761,166.89	198,855,667.98
min	10.00	2.10	1,960.00	0.00	2.37
25%	46.00	5.60	1,994.00	2,329,409.26	10,465,848.09
50%	147.00	6.20	2,004.00	20,328,008.68	43,956,661.16
75%	435.00	6.70	2,010.00	49,735,160.27	131,648,235.91
max	9,767.00	8.40	2,015.00	425,000,000.00	2,827,123,750.41

清理后，票房和预算的数据分布情况有所改善，主体25%~75%之间的预算和票房值较合理。

接下来，查看df_clean数据集中其它缺失值情况。

# 查看所有列的缺失值总数
df_clean.isnull().sum()

id                       0
original_title           0
cast                     5
director                 1
genres                   0
production_companies    96
release_date             0
vote_count               0
vote_average             0
release_year             0
budget                   0
revenue                  0
dtype: int64

数据集中还有3列存在数据缺失的情况，其中’cast’，'director’列缺失值较少，考虑补齐资料。 'production_companies’列缺失值相对较多，但是对后面的主体分析没有太大影响，先按NaN处理。当分析制片公司时，再去掉这部分缺失值。

补充演员和导演数据

# 查看演员缺失行
df_clean[df_clean.cast.isnull()]

	id	original_title	cast	director	genres	production_companies	release_date	vote_count	vote_average	release_year	budget	revenue
1088	169607	Finding Vivian Maier	NaN	John Maloof\|Charlie Siskel	Documentary	NaN	2014-03-28	70	7.80	2014	0.00	1,384,967.24
4127	21925	Naqoyqatsi	NaN	Godfrey Reggio	Documentary\|Drama\|Music\|Thriller	Qatsi Productions	2002-09-02	20	6.00	2002	3,636,784.18	16,132.77
4889	126509	2016: Obama's America	NaN	Dinesh D'Souza\|John Sullivan	Documentary	NaN	2012-07-13	11	4.70	2012	2,374,360.70	31,721,459.00
7813	22887	Loose Change: Final Cut	NaN	Dylan Avery	Documentary	Louder Than Words	2007-11-11	12	5.10	2007	6,310.01	6,310.01
9564	24348	Powaqqatsi	NaN	Godfrey Reggio	Documentary\|Drama\|Music	NaN	1988-04-29	18	7.20	1988	4,609,727.56	1,086,501.72

采用和数据集的演员列相同的格式来补充缺失的演员名称。

# 把要补充的演员名称存到数组cast里
cast = np.array(['Vivian Maier|John Maloof','Belladonna|Marlon Brando','Jay Bastian|Joe Biden','Dylan Avery|Mahmoud Ahmad','Christie Brinkley|David Brinkley'])

# 添加演员列表到df_clean里
df_clean.loc[[1088,4127,4889,7813,9564],'cast'] = cast

# 查看补充后的结果
df_clean.loc[