TMDB电影数据分析

1.简介

数据基本信息

本数据集中包含 1 万余条电影信息,信息来源为“电影数据库”(TMDb,The Movie Database),包括21个特征指标:id、热度、票房、预算、片名、演职人员、导演、类型、用户评分、评分人数、发行时间等21个特征。*
时间:1960-2015年

主要分析数据

热度(popularity)、预算(budget),电影类型(genre),上映时间(release_date),票房收入(revenue ),平均评分(vote_average),评分次数(vote_count)*

分析内容

在这里插入图片描述
分析工具:python、Power BI

2.分析结果

2.1 电影类型(市场分布)

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

2.2 票房

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

2.3 档期

在这里插入图片描述
在这里插入图片描述

2.4 观众评价(评分&评价人数)

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

2.5 观众喜好(热度)

在这里插入图片描述
在这里插入图片描述

2.6 盈利情况

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

3. 结论

1、电影类型:

  • 总体来看,电影数量从2000年开始快速增长,其中剧情片、喜剧片、惊悚片、动作片、恐怖片、科幻片、纪录片增长最快。
  • 2005-2015年,喜剧、奇幻、爱情、家庭片的比例呈下降趋势,音乐类、恐怖片、纪录片、科幻片、惊悚片的比例上升。
  • 在20种电影类型中,数量排名前六的电影分别是:剧情、喜剧、惊悚、动作、爱情、恐怖。数量最少的是西部片、非院线电影、外国电影、历史片、战争片。

2、票房:

  • 票房总体呈上升趋势,1986年后票房增长加快,之后变化较平稳,2006后快速增长。
  • 票房排名前六的电影类型分别是:动画、奇幻、冒险、家庭、科幻、动作,其变化趋势较平缓。其中冒险、动作呈下降趋势。 而纪录片、恐怖片、外国片、剧情片、历史片的票房较低,应该谨慎拍摄。
  • 剧情片的票房增长最快,其次是惊悚片、恐怖片、喜剧、犯罪片。
  • 电影票房与受欢迎度,评价次数,电影预算均成正相关。增加预算可更好地保证电影质量和后期宣传,有助于获得更多票房。

3、档期:

  • 票房排名前五的月份:6月、5月、11月、7月。票房较低的月份是9月、1月、8月。
    9月、10月、12月、1月发行的影片最多,竞争比较激烈。
  • 根据各月份的票房和竞争度,5月、6月、7月、11月是电影发行的最佳档期。8月、9月、1月须谨慎发行。12月虽然票房较高,但竞争激烈,挑战与机遇并存。
  • 周五上映的影片最多,星期一、星期二与星期三上映的票房最高,远高于休息日。

4、评分:

  • 评分人数整体呈上升趋势,从2007年开始增加,2013年后略有下降。
  • 70年代开始,随着电影产出量的增加,影片层次不齐,低分影片开始出现,平均分也有略微下降,人们对影片的要求越来越高。
  • 各类型电影的平均评分在5-7分,差距不大。可以看出,电影类型对评分影响不是很大。同时,类型、评价数量、票房、热度等对电影评分的影响都不大,电影评分的高低主要还是取决于观众的主观评价、电影的质量等。

5、观众喜好(热度)

  • 热度排名前六的类型分别为:科幻、冒险、奇幻、动画、动作、家庭。想象题材或非现实题材的电影(科幻、冒险、奇幻、动画)的受欢迎度要明显高于其他类别的电影。
  • 电影热度与评分数、票房、预算成正相关,但也有预算小、热度高的电影。增加预算可更好地保证电影质量和后期宣传,有助于增加热度。

6、电影盈利情况:

  • 利润:排名前六的类型:动画、奇幻、冒险、家庭、科幻、动作。纪录片、西部片、恐怖片、历史片的利润最低。
  • 投资回报率:排名前六的类型:纪录片、音乐类电影、爱情片、家庭片、恐怖片。西部片、历史片的投资回报率较低,亏损风险较高,应该谨慎拍摄。

7、分析局限性:

  • 60%的数据存在票房、预算缺失,去除异常值后数据集仅有3000多条,分析准确度可能有所下降。

4.数据处理过程(python)

# 导入模块
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')    

3.1 数据清洗

常规属性

# 加载数据并打印几行。进行这几项操作,来检查数据
df = pd.read_csv('tmdb_movies.csv')
df.head()
idimdb_idpopularitybudgetrevenueoriginal_titlecasthomepagedirectortagline...overviewruntimegenresproduction_companiesrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
0135397tt036961032.9857631500000001513528810Jurassic WorldChris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...http://www.jurassicworld.com/Colin TrevorrowThe park is open....Twenty-two years after the events of Jurassic ...124Action|Adventure|Science Fiction|ThrillerUniversal Studios|Amblin Entertainment|Legenda...6/9/1555626.520151.379999e+081.392446e+09
176341tt139219028.419936150000000378436354Mad Max: Fury RoadTom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...http://www.madmaxmovie.com/George MillerWhat a Lovely Day....An apocalyptic story set in the furthest reach...120Action|Adventure|Science Fiction|ThrillerVillage Roadshow Pictures|Kennedy Miller Produ...5/13/1561857.120151.379999e+083.481613e+08
2262500tt290844613.112507110000000295238201InsurgentShailene Woodley|Theo James|Kate Winslet|Ansel...http://www.thedivergentseries.movie/#insurgentRobert SchwentkeOne Choice Can Destroy You...Beatrice Prior must confront her inner demons ...119Adventure|Science Fiction|ThrillerSummit Entertainment|Mandeville Films|Red Wago...3/18/1524806.320151.012000e+082.716190e+08
3140607tt248849611.1731042000000002068178225Star Wars: The Force AwakensHarrison Ford|Mark Hamill|Carrie Fisher|Adam D...http://www.starwars.com/films/star-wars-episod...J.J. AbramsEvery generation has a story....Thirty years after defeating the Galactic Empi...136Action|Adventure|Science Fiction|FantasyLucasfilm|Truenorth Productions|Bad Robot12/15/1552927.520151.839999e+081.902723e+09
4168259tt28208529.3350141900000001506249360Furious 7Vin Diesel|Paul Walker|Jason Statham|Michelle ...http://www.furious7.com/James WanVengeance Hits Home...Deckard Shaw seeks revenge against Dominic Tor...137Action|Crime|ThrillerUniversal Pictures|Original Film|Media Rights ...4/1/1529477.320151.747999e+081.385749e+09

5 rows × 21 columns

#类型,以及是否有缺失数据或错误数据的情况
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 
 15  release_date          10866 non-null  object 
 16  vote_count            10866 non-null  int64  
 17  vote_average          10866 non-null  float64
 18  release_year          10866 non-null  int64  
 19  budget_adj            10866 non-null  float64
 20  revenue_adj           10866 non-null  float64
dtypes: float64(4), int64(6), object(11)
memory usage: 1.7+ MB

数据清理(清除多余列、丢空、去重。)

#清除多余列
df.drop(['imdb_id','cast','homepage', 'director', 'tagline', 'keywords', 'overview','production_companies'], axis = 1, inplace = True)
df.head(2)
idpopularitybudgetrevenueoriginal_titleruntimegenresrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
013539732.9857631500000001513528810Jurassic World124Action|Adventure|Science Fiction|Thriller6/9/1555626.520151.379999e+081.392446e+09
17634128.419936150000000378436354Mad Max: Fury Road120Action|Adventure|Science Fiction|Thriller5/13/1561857.120151.379999e+083.481613e+08
#查看缺失值数量
df.isnull().sum()
id                 0
popularity         0
budget             0
revenue            0
original_title     0
runtime            0
genres            23
release_date       0
vote_count         0
vote_average       0
release_year       0
budget_adj         0
revenue_adj        0
dtype: int64
#去除含有任何空值的行
df.dropna(inplace = True)
#检查任何列是否还有空值
df.isnull().sum().any()
False
#查看数据重复数量
sum(df.duplicated())
1
#去除重复行
df.drop_duplicates(inplace = True)
#确认重复数据是否删除
sum(df.duplicated())
0
df.describe()
idpopularitybudgetrevenueruntimevote_countvote_averagerelease_yearbudget_adjrevenue_adj
count10842.00000010842.0000001.084200e+041.084200e+0410842.00000010842.00000010842.00000010842.0000001.084200e+041.084200e+04
mean65870.6755210.6474611.465531e+073.991138e+07102.138443217.8236495.9740642001.3147941.758712e+075.147797e+07
std91981.3557521.0010323.093971e+071.171179e+0831.294612576.1809930.93425712.8136173.433437e+071.447723e+08
min5.0000000.0000650.000000e+000.000000e+000.00000010.0000001.5000001960.0000000.000000e+000.000000e+00
25%10589.2500000.2082100.000000e+000.000000e+0090.00000017.0000005.4000001995.0000000.000000e+000.000000e+00
50%20557.0000000.3845320.000000e+000.000000e+0099.00000038.0000006.0000002006.0000000.000000e+000.000000e+00
75%75186.0000000.7153931.500000e+072.414118e+07111.000000146.0000006.6000002011.0000002.092507e+073.387838e+07
max417859.00000032.9857634.250000e+082.781506e+09900.0000009767.0000009.2000002015.0000004.250000e+082.827124e+09

3.2 探索性数据分析

电影类型分析:各类型电影数量分布及随时间变化趋势如何?
(1)获取电影类型

# 创建一个集合,有去重功能
genres_set = set()
#切割genres列,创建一个循环
for i in df['genres']:
    genres_set.update(i.split('|'))
genres_set.discard(' ')
genres_set

{'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western'}
#新建一个数据框
genres_df = pd.DataFrame()
#判断每部电影的类型,电影包含某个类型就返回1,否则返回0
for gen in genres_set:
    genres_df[gen] = df['genres'].str.contains(gen).map(lambda x: 1 if x else 0)
genres_df.head(10)    
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
001100000100000010000
101100000100000010000
200100000100000010000
301000100100000010000
401100000000100000000
500100000000001011000
601100000100000010000
700000000100001010000
800000001010010010000
900000001010010000000
#查看各类型电影数量
g1 = pd.DataFrame(index = [0,1])
for gen in genres_set:
    g1[gen] =genres_df[gen].value_counts()
    
g1
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
010434845879359205105729926106759611961370491003294881014360821065493711067710508103229130
14082384290716372709161671231122937938101354699476018814711653345201712
#在数据框中加入年份
genres_df['release_year'] = df['release_year']
genres_df['release_year'] = df['release_year']
#数据框按年份分组,求和
gen_year = genres_df.groupby('release_year').sum()
gen_year.head()

MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
release_year
1960186722033802013156506
19612703220541012116163307
1962187531022543021173405
196304109120321364113272408
196455963404416410220151309

(2)1960-2015年各类型电影数量变化


plt.figure(figsize = (12, 6))
plt.plot(gen_year, label = gen_year.columns)
plt.title('Number of movies by genres and year')
plt.xticks(range(1950, 2020, 5))
plt.xlabel('Year')
plt.ylabel('Moies Number')
plt.legend(gen_year)#图例

在这里插入图片描述

#每年各类型电影总数
genres_sum_year = gen_year.sum( axis=1)
#每年各类型电影占比
gen_proportion_year = pd.DataFrame()
for i in gen_year.columns:
    gen_proportion_year[i] = gen_year[i]/genres_sum_year
gen_proportion_year.head()
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
release_year
19600.0128210.1025640.0769230.0897440.0256410.0256410.00.0384620.0384620.1025640.0000000.0256410.0000000.1666670.0128210.0641030.0769230.0641030.00.076923
19610.0266670.0933330.0000000.0400000.0266670.0266670.00.0666670.0533330.1333330.0133330.0266670.0133330.2133330.0133330.0800000.0400000.0400000.00.093333
19620.0121950.0975610.0853660.0609760.0365850.0121950.00.0243900.0243900.0609760.0487800.0365850.0000000.2560980.0121950.0853660.0365850.0487800.00.060976
19630.0000000.0439560.1098900.0989010.0109890.0219780.00.0329670.0219780.1428570.0659340.0439560.0109890.1428570.0219780.0769230.0219780.0439560.00.087912
19640.0450450.0450450.0810810.0540540.0270270.0360360.00.0360360.0360360.1441440.0360360.0900900.0180180.1801800.0090090.0450450.0090090.0270270.00.081081
#2005-2015年各类型电影占比
gen_proportion_year1 = gen_proportion_year.loc[2005:,:]
gen_proportion_year1
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
release_year
20050.0139040.0748660.0983960.0598930.0053480.0374330.0032090.0609630.0267380.1529410.0310160.0481280.0278070.1946520.0106950.0566840.0021390.0106950.0181820.066310
20060.0114940.0766280.1091950.0536400.0067050.0325670.0076630.0622610.0287360.1484670.0287360.0507660.0373560.1886970.0095790.0526820.0009580.0114940.0153260.067050
20070.0133210.0843690.1110120.0701600.0053290.0417410.0053290.0399640.0364120.1341030.0301950.0577260.0284190.1749560.0150980.0532860.0044400.0115450.0168740.065719
20080.0161420.0799030.1025020.0613400.0145280.0347050.0040360.0451980.0419690.1364000.0234060.0500400.0266340.1880550.0145280.0508470.0016140.0193700.0209850.067797
20090.0125180.0795290.1156110.0662740.0088370.0360820.0058910.0441830.0522830.1458030.0375550.0382920.0353460.1649480.0125180.0530190.0000000.0117820.0184090.061119
20100.0082580.0883570.1106520.0644100.0057800.0363340.0066060.0454170.0371590.1395540.0264240.0412880.0412880.1734100.0107350.0487200.0049550.0115610.0289020.070190
20110.0133020.0899840.1142410.0610330.0070420.0359940.0078250.0563380.0438180.1345850.0297340.0375590.0359940.1674490.0109550.0485130.0023470.0062600.0383410.058685
20120.0172550.0776470.1254900.0815690.0078430.0266670.0109800.0329410.0423530.1380390.0258820.0423530.0313730.1819610.0047060.0392160.0031370.0101960.0384310.061961
20130.0216920.0874910.1265370.0737530.0050610.0282000.0072310.0347070.0441070.1265370.0274770.0513380.0303690.1829360.0000000.0484450.0021690.0101230.0448300.046999
20140.0190480.0877550.1217690.0714290.0156460.0244900.0095240.0292520.0421770.1258500.0244900.0442180.0244900.1931970.0000000.0455780.0040820.0102040.0496600.057143
20150.0238100.0772010.1233770.0901880.0064940.0238100.0144300.0317460.0620490.1168830.0303030.0367970.0281390.1875900.0000000.0497840.0043290.0108230.0411260.041126

(3)2005-2015年各类型电影占比变化

#2005-2015年各类型电影占比变化
plt.figure(figsize = (12, 6))
plt.plot(gen_proportion_year.loc[2005:, :], label = gen_year.columns)
plt.title('Proportion of movies by genres and year(2005-2015)')
plt.xticks(range(2003, 2016, 1))
plt.xlabel('Year')
plt.ylabel('Moies Number')
plt.legend(gen_proportion_year)#图例

在这里插入图片描述

(4)各类型电影比例分布

#各电影类型数量
genres_sum = genres_df.sum().sort_values(ascending = False).drop('release_year')
genres_sum
Drama              4760
Comedy             3793
Thriller           2907
Action             2384
Romance            1712
Horror             1637
Adventure          1471
Crime              1354
Family             1231
Science Fiction    1229
Fantasy             916
Mystery             810
Animation           699
Documentary         520
Music               408
History             334
War                 270
Foreign             188
TV Movie            167
Western             165
dtype: int64
genres_total = genres_sum.sum()
#各电影类型比例
genres_proportion = genres_sum/ genres_total
genres_proportion
Drama              0.176591
Comedy             0.140716
Thriller           0.107846
Action             0.088444
Romance            0.063513
Horror             0.060731
Adventure          0.054572
Crime              0.050232
Family             0.045669
Science Fiction    0.045595
Fantasy            0.033983
Mystery            0.030050
Animation          0.025932
Documentary        0.019291
Music              0.015136
History            0.012391
War                0.010017
Foreign            0.006975
TV Movie           0.006196
Western            0.006121
dtype: float64
#绘制柱形图
genres_proportion.plot.barh(label = 'genre', figsize = (12, 6))
plt.title('Proportion of Gennres')
plt.xlabel('Proportion')
plt.ylabel('Genre')

在这里插入图片描述

进一步做数据清洗,将清洗后的数据放在df2数据集中:

筛选评分人数大于50的数据。评分人数(vote_count)过低的电影,其评分(vote_average)不具有统计意义。
筛去票房、预算等为0的数据。

#筛选评分人数大于50的数据
df2 = df.query('vote_count > 50')
#筛去票房、预算等为0的数据
df2=df2[~df2['budget'].isin([0.000000e+00])]
df2=df2[~df2['revenue'].isin([0.000000e+00])]
df2.head()
idpopularitybudgetrevenueoriginal_titleruntimegenresrelease_datevote_countvote_averagerelease_yearbudget_adjrevenue_adj
013539732.9857631500000001513528810Jurassic World124Action|Adventure|Science Fiction|Thriller6/9/1555626.520151.379999e+081.392446e+09
17634128.419936150000000378436354Mad Max: Fury Road120Action|Adventure|Science Fiction|Thriller5/13/1561857.120151.379999e+083.481613e+08
226250013.112507110000000295238201Insurgent119Adventure|Science Fiction|Thriller3/18/1524806.320151.012000e+082.716190e+08
314060711.1731042000000002068178225Star Wars: The Force Awakens136Action|Adventure|Science Fiction|Fantasy12/15/1552927.520151.839999e+081.902723e+09
41682599.3350141900000001506249360Furious 7137Action|Crime|Thriller4/1/1529477.320151.747999e+081.385749e+09
df2.describe()
idpopularitybudgetrevenueruntimevote_countvote_averagerelease_yearbudget_adjrevenue_adj
count3123.0000003123.0000003.123000e+033.123000e+033123.0000003123.0000003123.0000003123.0000003.123000e+033.123000e+03
mean42688.8927311.3922944.261919e+071.296105e+08110.037464644.7233436.2543712002.2920274.968112e+071.629720e+08
std71354.9061961.5699724.461333e+071.891287e+0819.588370939.8913100.76051810.9353654.697721e+072.311534e+08
min5.0000000.0103351.000000e+002.000000e+0026.00000051.0000003.3000001960.0000009.693980e-012.861934e+00
25%4000.0000000.6178881.300000e+072.496112e+0796.000000135.0000005.7000001997.0000001.657964e+073.145014e+07
50%10585.0000000.9766122.800000e+076.556987e+07106.000000299.0000006.3000002005.0000003.463336e+078.285793e+07
75%44927.5000001.5932795.950000e+071.556332e+08120.000000717.5000006.8000002011.0000006.911341e+071.967146e+08
max336004.00000032.9857634.250000e+082.781506e+09248.0000009767.0000008.4000002015.0000004.250000e+082.827124e+09
df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3123 entries, 0 to 10822
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              3123 non-null   int64  
 1   popularity      3123 non-null   float64
 2   budget          3123 non-null   int64  
 3   revenue         3123 non-null   int64  
 4   original_title  3123 non-null   object 
 5   runtime         3123 non-null   int64  
 6   genres          3123 non-null   object 
 7   release_date    3123 non-null   object 
 8   vote_count      3123 non-null   int64  
 9   vote_average    3123 non-null   float64
 10  release_year    3123 non-null   int64  
 11  budget_adj      3123 non-null   float64
 12  revenue_adj     3123 non-null   float64
dtypes: float64(4), int64(6), object(3)
memory usage: 341.6+ KB

2、票房分析:票房和哪些特征有关?

(1)电影票房的影响因素(数值型变量)

#绘制相关系数热力图
plt.subplots(figsize=(8,8))#调节图像大小
sns.heatmap(df2.corr(), annot = True, vmax = 1, square = True, cmap = 'Reds' )

在这里插入图片描述

可以看出,电影票房和评价次数(0.74)、预算(0.67)、受欢迎度(0.59)相关性较强。

绘制电影票房与受欢迎度,评价次数,电影预算的相关性散点图及其线性回归线

#创建票房与受欢迎度,评价次数,电影预算的数据框
revenue = df2[['vote_count', 'budget', 'popularity', 'revenue']]
plt.figure(figsize = (18,6))

#电影票房与受欢迎度的相关性散点图及其线性回归线
ax1 = plt.subplot(1, 3, 1)
ax1 = sns.regplot(x = 'popularity', y = 'revenue', data = revenue, color = 'y')
ax1.text(0, 2.5, 'r = 0.59')
plt.title('Revenue and Popularity')
plt.xlabel('Popularity')
plt.ylabel('Revenue')

#电影票房与评价次数的相关性散点图及其线性回归线
ax2 = plt.subplot(1, 3, 2)
ax2 = sns.regplot(x ='vote_count' , y = 'revenue', data = revenue, color = 'r')
ax2.text(0, 2.5, 'r = 0.74')
plt.title('Revnnue and Vote count')
plt.xlabel('Vote count')
plt.ylabel('Revenue')

#电影票房与预算的相关性散点图及其线性回归线
ax3 = plt.subplot(1, 3, 3)
ax3 = sns.regplot(x = 'budget' , y ='revenue', data = revenue, color = 'B')
ax3.text(0, 2.5, 'r = 0.67')
plt.title('Revnnue and Budget ')
plt.xlabel('Budget')
plt.ylabel('Revenue')

在这里插入图片描述
(2)票房和电影类型

#新建一个数据框
genres_df2 = pd.DataFrame()
#判断每部电影的类型,电影包含某个类型就返回1,否则返回0
for gen in genres_set:
    genres_df2[gen] = df2['genres'].str.contains(gen).map(lambda x: 1 if x else 0)
genres_df2.head()    
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
001100000100000010000
101100000100000010000
200100000100000010000
301000100100000010000
401100000000100000000
#查看各类型电影数量
g0 = pd.DataFrame(index = [0,1])
for gen in genres_set:
    g0[gen] =genres_df2[gen].value_counts()
    
g0
MusicActionThrillerHorrorWarFantasyTV MovieFamilyScience FictionComedyMysteryCrimeAnimationDramaForeignAdventureWesternHistoryDocumentaryRomance
03023219820952747302327693123.02758267120582836257029371768312224463084301431082625
11009251028376100354NaN3654521065287553186135516773910915498
#删掉样本数量少的类型

genres_df2.drop(['TV Movie','Foreign'],axis = 1, inplace = True)
genres_df2.head()
MusicActionThrillerHorrorWarFantasyFamilyScience FictionComedyMysteryCrimeAnimationDramaAdventureWesternHistoryDocumentaryRomance
0011000010000010000
1011000010000010000
2001000010000010000
3010001010000010000
4011000000010000000
#创建电影类型与票房、评分、预算、热度、评价数量数据框
genres_df3 = pd.DataFrame()
genres_df3 = pd.concat([genres_df2, df2.iloc[:, [1, 2, 3, 7, 8, 9, 10]]], axis = 1)
genres_df3.head()
MusicActionThrillerHorrorWarFantasyFamilyScience FictionComedyMystery...HistoryDocumentaryRomancepopularitybudgetrevenuerelease_datevote_countvote_averagerelease_year
00110000100...00032.98576315000000015135288106/9/1555626.52015
10110000100...00028.4199361500000003784363545/13/1561857.12015
20010000100...00013.1125071100000002952382013/18/1524806.32015
30100010100...00011.173104200000000206817822512/15/1552927.52015
40110000000...0009.33501419000000015062493604/1/1529477.32015

5 rows × 25 columns

genres_set.remove('TV Movie')
genres_set.remove('Foreign')
genres_set
{'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'Thriller',
 'War',
 'Western'}

分别计算不同电影的平均评分、平均受欢迎度、平均票房

#创建三个数组,
vote_by_genre = pd.Series(index = genres_set)
pop_by_genre = pd.Series(index = genres_set)
rev_by_genre = pd.Series(index = genres_set)
bud_by_genre = pd.Series(index = genres_set)

#分别计算不同电影的平均评分、平均受欢迎度、平均票房
for gen in genres_set:
    vote_by_genre[gen] = genres_df3.groupby(gen, as_index = False).mean().loc[1, 'vote_average']
    pop_by_genre[gen] = genres_df3.groupby(gen, as_index = False).mean().loc[1, 'popularity']
    rev_by_genre[gen] = genres_df3.groupby(gen, as_index = False).mean().loc[1, 'revenue']
    bud_by_genre[gen] = genres_df3.groupby(gen, as_index = False).mean().loc[1, 'budget']
#合并三个数组
movie_by_genre = pd.DataFrame({ 'vote_average': vote_by_genre, 'popularity': pop_by_genre, 'revenue': rev_by_genre, 'budget': bud_by_genre})

movie_by_genre

vote_averagepopularityrevenuebudget
Music6.3850001.1012701.047237e+083.013560e+07
Action6.0995681.7773691.802467e+086.309883e+07
Thriller6.1430931.4132001.123881e+084.111380e+07
Horror5.8348400.9748646.870872e+072.116128e+07
War6.6910001.4129211.238682e+084.698285e+07
Fantasy6.1358761.9162092.413179e+087.649616e+07
Family6.2101371.6380982.320525e+086.898803e+07
Science Fiction6.1075222.0973701.865328e+086.238452e+07
Comedy6.1153051.2001851.200283e+083.844007e+07
Mystery6.2832751.3004131.071573e+083.850171e+07
Crime6.3660041.2614409.649637e+073.647491e+07
Animation6.4548391.8215272.740895e+088.015984e+07
Drama6.5191881.1977039.213394e+073.224194e+07
Adventure6.1774002.0274862.392040e+087.520216e+07
Western6.5410261.4073281.119227e+086.475754e+07
History6.6596331.0851349.390964e+074.473635e+07
Documentary6.7266670.4490324.494842e+078.142680e+06
Romance6.3174701.1632301.097685e+083.171751e+07

不同类型电影的平均票房图

movie_by_genre.sort_values(by = ['revenue'])['revenue'].plot.barh(figsize = (10, 5))
plt.title('Average revenue by Genres')
plt.xlabel('Revenue')
plt.ylabel('Genres')

在这里插入图片描述
(3)电影票房随时间变化趋势

plt.figure(figsize = (10, 5))
revenue_by_year = df2.groupby('release_year').mean().sort_values(by = 'release_year')['revenue']
plt.plot(revenue_by_year)

plt.title('Average revenue by Release year')
plt.xlabel('Release year')
plt.ylabel('Average revenue')

在这里插入图片描述

(4)各类型票房的变化趋势

#每年各类型电影的票房
revnue_genre_year = pd.DataFrame(index = gen_proportion_year.index)

for gen in genres_set:
    revnue_genre_year[gen] = genres_df3.groupby(['release_year', gen]).revenue.mean().xs(0, level = 1)
    
revnue_genre_year.head()    
MusicActionThrillerHorrorWarFantasyFamilyScience FictionComedyMysteryCrimeAnimationDramaAdventureWesternHistoryDocumentaryRomance
release_year
196030476250.02.850000e+072.996833e+072.996833e+073.047625e+0730476250.030476250.003.047625e+073.230167e+0730476250.03.047625e+073.047625e+074.905000e+063.900000e+073.900000e+072.063500e+073.047625e+073.230167e+07
196166070003.56.975921e+076.158737e+076.158737e+076.975921e+0761587367.223014205.506.158737e+072.751894e+0761587367.26.607000e+072.301421e+072.158800e+082.105227e+076.158737e+077.448421e+076.158737e+077.460921e+07
196237682461.53.037662e+073.037662e+073.768246e+072.690995e+0737682461.537682461.503.768246e+073.768246e+0737682461.54.586667e+073.768246e+073.380000e+071.056492e+074.757662e+072.690995e+073.768246e+073.768246e+07
196333305376.42.190703e+073.457676e+073.878084e+074.038172e+0733305376.433305376.403.330538e+073.826307e+0738263073.53.330538e+073.330538e+073.459229e+072.754271e+073.330538e+073.459229e+073.330538e+073.176743e+07
196447113424.03.834271e+074.617839e+075.276892e+076.143466e+0742868164.235585205.255.276892e+076.796667e+0761922709.66.192271e+075.276892e+077.605776e+073.834271e+075.276892e+075.276892e+075.276892e+074.892271e+07
#用0填充nan值
revnue_genre_year = revnue_genre_year.fillna(0)
  

plt.figure(figsize = (12, 6))
plt.plot(revnue_genre_year, label = revnue_genre_year.columns)
plt.title('Average revenue by genres and year')
plt.xticks(range(1950, 2020, 5))
plt.xlabel('Year')
plt.ylabel('Average revenue')
plt.legend(revnue_genre_year)#图例

在这里插入图片描述

3.档期

(1)不同月份的发行数量和票房

genres_df3['release_date'] = pd.to_datetime(genres_df3['release_date'])


genres_df3['month'] = genres_df3['release_date'].dt.month
genres_df3['day'] = genres_df3['release_date'].dt.weekday
genres_df3.head()
MusicActionThrillerHorrorWarFantasyFamilyScience FictionComedyMystery...Romancepopularitybudgetrevenuerelease_datevote_countvote_averagerelease_yearmonthday
00110000100...032.98576315000000015135288102015-06-0955626.5201561
10110000100...028.4199361500000003784363542015-05-1361857.1201552
20010000100...013.1125071100000002952382012015-03-1824806.3201532
30100010100...011.17310420000000020681782252015-12-1552927.52015121
40110000000...09.33501419000000015062493602015-04-0129477.3201542

5 rows × 27 columns

plt.figure(figsize = (18,6))

#各月份的电影票房
revenue_by_month = genres_df3.groupby('month')['revenue'].mean()
plt.subplot(1, 2, 1)
revenue_by_month.plot(kind = 'bar')
plt.title('Average revenue by Month')
plt.xlabel('Month')
plt.ylabel('Average revenue')

#各月份的电影发行数
df['release_date'] = pd.to_datetime(df['release_date'])


df['month'] = df['release_date'].dt.month
num_by_month = df.groupby('month')['revenue'].count()
plt.subplot(1, 2, 2)
num_by_month.plot(kind = 'bar')
plt.title('Number of movie by Month')
plt.xlabel('Month')
plt.ylabel('Number of movie')

在这里插入图片描述
(2)不同星期的发行数量和票房

plt.figure(figsize = (18,6))

#各月份的电影票房
revenue_by_month = genres_df3.groupby('day')['revenue'].mean()
plt.subplot(1, 2, 1)
revenue_by_month.plot(kind = 'bar')
plt.title('Average revenue by Day of week')
plt.xlabel('Day of week')
plt.ylabel('Average revenue')

#各月份的电影发行数
df['release_date'] = pd.to_datetime(df['release_date'])

df['day'] = df['release_date'].dt.weekday#返回0—6,分别对应星期一到星期日
num_by_month = df.groupby('day')['revenue'].count()
plt.subplot(1, 2, 2)
num_by_month.plot(kind = 'bar')
plt.title('Number of movie by Day of week')
plt.xlabel('Day od week')
plt.ylabel('Number of movie')

在这里插入图片描述
4、观众评价: 电影的评分与哪些特征有关?

(1)评分与电影类型

movie_by_genre.sort_values(by = ['vote_average'])['vote_average'].plot.barh(figsize = (10, 5))
plt.title('Average vote by Genres')
plt.xlabel('Average vote')
plt.ylabel('Genres')

在这里插入图片描述

(2)评分与其他变量

plt.subplots(figsize=(8,8))#调节图像大小
sns.heatmap(df.corr(), annot = True, vmax = 1, square = True, cmap = 'Reds' )

在这里插入图片描述
(3)评分人数的变化趋势(2000-2015)

#不同年代的评分变化
plt.figure(figsize = (10,6))
vote_count_by_year = genres_df3.groupby('release_year')['vote_count'].sum().loc[2000:]
vote_count_by_year.plot(kind = 'bar')
plt.title('Vote count by Year(2000-2015)')
plt.xlabel('Year')
plt.ylabel('Vote count')

在这里插入图片描述
(4)不同年代的评分变化

#绘制箱线图
s1960 = genres_df3.query('release_year <= 1970').vote_average
s1970 = genres_df3.query('release_year > 1970').query('release_year <= 1980').vote_average
s1980 = genres_df3.query('release_year > 1980').query('release_year <= 1990').vote_average
s1990 = genres_df3.query('release_year >1990').query('release_year <= 2000').vote_average
s2000 = genres_df3.query('release_year > 2000').query('release_year <= 2010').vote_average
s2010 = genres_df3.query('release_year > 2010').query('release_year <= 2020').vote_average

plt.figure(figsize = (12, 6))
plt.boxplot([s1960,s1970,s1980,s1990,s2000,s2010],labels = ['1960s', '1970s', '1980s', '1990s', '2000s', '2010s'])
plt.title('Average Vote by Decade')
plt.xlabel('Decade')
plt.ylabel('Average Vote')

在这里插入图片描述
4、热度分析:电影热度与哪些特征有关?

(1)热度与电影类型

movie_by_genre.sort_values(by = ['popularity'])['popularity'].plot.barh(figsize = (10, 5))
plt.title('Popularity by Genres')
plt.xlabel('Popularity')
plt.ylabel('Genres')
plt.show()

在这里插入图片描述(2)受欢迎度与其他变量

plt.subplots(figsize=(8,8))#调节图像大小
sns.heatmap(df2.corr(), annot = True, vmax = 1, square = True, cmap = 'Blues' )

在这里插入图片描述
绘制电影热度与评价次数、票房、预算的相关性散点图及其线性回归线

#创建电影受欢迎度与评价次数、票房、预算的数据框
poularity = df2[['vote_count', 'budget', 'popularity', 'revenue']]
plt.figure(figsize = (18,6))

#电影受欢迎度与评价次数的相关性散点图及其线性回归线
ax1 = plt.subplot(1, 3, 1)
ax1 = sns.regplot(x = 'vote_count', y = 'popularity', data = poularity, color = 'y')
ax1.text(0, 25, 'r = 0.76')
plt.title('Popularity and Vote_count')
plt.xlabel('Vote_count')
plt.ylabel('Popularity')

#电影受欢迎度与票房的相关性散点图及其线性回归线
ax2 = plt.subplot(1, 3, 2)
ax2 = sns.regplot(x ='revenue' , y = 'popularity', data = poularity, color = 'r')
ax2.text(0, 25, 'r = 0.59')
plt.title('Popularity and Revenue')
plt.xlabel('Revenue')
plt.ylabel('Popularity')

#电影受欢迎度与预算的相关性散点图及其线性回归线
ax3 = plt.subplot(1, 3, 3)
ax3 = sns.regplot(x = 'budget' , y ='popularity', data = poularity, color = 'B')
ax3.text(0, 25, 'r = 0.52')
plt.title('Popularity and Budget ')
plt.xlabel('Budget')
plt.ylabel('Popularity')

在这里插入图片描述
5、电影盈利情况分析:电影利润与哪些因素有关?

(1)电影利润与类型

#计算各类型电影利润
movie_by_genre['profit'] = movie_by_genre['revenue'] - movie_by_genre['budget']
movie_by_genre
vote_averagepopularityrevenuebudgetprofit
Music6.3850001.1012701.047237e+083.013560e+077.458808e+07
Action6.0995681.7773691.802467e+086.309883e+071.171479e+08
Thriller6.1430931.4132001.123881e+084.111380e+077.127430e+07
Horror5.8348400.9748646.870872e+072.116128e+074.754744e+07
War6.6910001.4129211.238682e+084.698285e+077.688537e+07
Fantasy6.1358761.9162092.413179e+087.649616e+071.648217e+08
Family6.2101371.6380982.320525e+086.898803e+071.630644e+08
Science Fiction6.1075222.0973701.865328e+086.238452e+071.241483e+08
Comedy6.1153051.2001851.200283e+083.844007e+078.158827e+07
Mystery6.2832751.3004131.071573e+083.850171e+076.865555e+07
Crime6.3660041.2614409.649637e+073.647491e+076.002145e+07
Animation6.4548391.8215272.740895e+088.015984e+071.939297e+08
Drama6.5191881.1977039.213394e+073.224194e+075.989200e+07
Adventure6.1774002.0274862.392040e+087.520216e+071.640018e+08
Western6.5410261.4073281.119227e+086.475754e+074.716515e+07
History6.6596331.0851349.390964e+074.473635e+074.917329e+07
Documentary6.7266670.4490324.494842e+078.142680e+063.680574e+07
Romance6.3174701.1632301.097685e+083.171751e+077.805097e+07
movie_by_genre.profit.sort_values().plot.barh(figsize = (10, 5))

plt.title('Profit by Genres')
plt.xlabel('Profit')
plt.ylabel('Genres')

在这里插入图片描述

(2)各类型的投资回报率(ROI)

movie_by_genre['ROI'] = movie_by_genre['profit']/movie_by_genre['budget']
movie_by_genre
vote_averagepopularityrevenuebudgetprofitROI
Music6.3850001.1012701.047237e+083.013560e+077.458808e+072.475082
Action6.0995681.7773691.802467e+086.309883e+071.171479e+081.856577
Thriller6.1430931.4132001.123881e+084.111380e+077.127430e+071.733586
Horror5.8348400.9748646.870872e+072.116128e+074.754744e+072.246908
War6.6910001.4129211.238682e+084.698285e+077.688537e+071.636456
Fantasy6.1358761.9162092.413179e+087.649616e+071.648217e+082.154640
Family6.2101371.6380982.320525e+086.898803e+071.630644e+082.363663
Science Fiction6.1075222.0973701.865328e+086.238452e+071.241483e+081.990050
Comedy6.1153051.2001851.200283e+083.844007e+078.158827e+072.122480
Mystery6.2832751.3004131.071573e+083.850171e+076.865555e+071.783182
Crime6.3660041.2614409.649637e+073.647491e+076.002145e+071.645554
Animation6.4548391.8215272.740895e+088.015984e+071.939297e+082.419287
Drama6.5191881.1977039.213394e+073.224194e+075.989200e+071.857581
Adventure6.1774002.0274862.392040e+087.520216e+071.640018e+082.180812
Western6.5410261.4073281.119227e+086.475754e+074.716515e+070.728335
History6.6596331.0851349.390964e+074.473635e+074.917329e+071.099180
Documentary6.7266670.4490324.494842e+078.142680e+063.680574e+074.520101
Romance6.3174701.1632301.097685e+083.171751e+077.805097e+072.460817
#各类型电影的投资回报率
movie_by_genre['ROI'].sort_values().plot.barh(figsize = (10, 5))

plt.title('ROI by Genres')
plt.xlabel('Profit')
plt.ylabel('Genres')

在这里插入图片描述
(3) 电影利润与其他变量

df2['profit'] = df2['revenue'] - df2['budget']
plt.subplots(figsize=(8,8))#调节图像大小
sns.heatmap(df2.corr(), annot = True, vmax = 1, square = True, cmap = 'Reds' )

在这里插入图片描述
绘制电影利润与评价数、受欢迎度的散点图和趋势线

#创建利润与评价次数,受欢迎度的数据框
profit = df2[['vote_count', 'popularity', 'profit']]
plt.figure(figsize = (18,6))

#电影利润与评价次数的相关性散点图及其线性回归线
ax1 = plt.subplot(1, 2, 1)
ax1 = sns.regplot(x = 'vote_count', y = 'profit', data = profit, color = 'y')
ax1.text(0, 2.5, 'r = 0.71')
plt.title('Profit and Vote_count')
plt.xlabel('Vote_count')
plt.ylabel('Profit')

#电影利润与受欢迎度的相关性散点图及其线性回归线
ax2 = plt.subplot(1, 2, 2)
ax2 = sns.regplot(x ='popularity' , y = 'profit', data = profit, color = 'r')
ax2.text(0, 2.5, 'r = 0.57')
plt.title('Profit and Popularity')
plt.xlabel('Popularity')
plt.ylabel('Profit')

在这里插入图片描述

  • 12
    点赞
  • 79
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
TMDB5000数据分析案例是一个基于TMDB电影数据库的数据分析项目。该项目旨在从TMDB数据库中提取电影信息,并通过对数据进行分析和可视化,揭示电影行业的趋势和模式。 在这个案例中,我们可以使用Python编程语言和相关数据分析工具来处理和分析TMDB5000数据。首先,我们需要导入数据集并了解其中的字段和特征。例如,数据集包含电影的标题、类型、导演、演员、预算、收入、评分等信息。 接下来,我们可以利用数据分析工具,比如pandas和matplotlib,对数据集进行处理和可视化。例如,我们可以通过对电影类型频次进行分析,了解哪种类型的电影最受欢迎。我们还可以分析预算和收入之间的关系,以及评分和收入之间的关系,以揭示电影制作的经济和质量的关联性。 除了这些基本的数据分析任务,我们还可以进一步探索数据集,寻找更深入的见解和趋势。例如,我们可以分析不同国家和地区电影的产量和市场份额,以及电影发展随时间的变化。我们还可以使用机器学习算法,如聚类分析和预测模型,对电影的成功因素进行建模和预测。 总的来说,TMDB5000数据分析案例提供了一个丰富的电影信息资源,通过对数据的分析和挖掘,可以揭示电影行业中的潜在模式和趋势。这种数据分析的应用不仅可以帮助电影制片人和投资者做出更明智的决策,还可以为观众提供更好的电影推荐和体验。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值