探索电影数据集
by Kimi
在这个项目中,你将尝试使用所学的知识,使用 NumPy
、Pandas
、matplotlib
、seaborn
库中的函数,来对电影数据集进行探索。
下载数据集:
TMDb电影数据
数据集各列名称的含义:
列名称 | id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
含义 | 编号 | IMDB 编号 | 知名度 | 预算 | 票房 | 名称 | 主演 | 网站 | 导演 | 宣传词 | 关键词 | 简介 | 时常 | 类别 | 发行公司 | 发行日期 | 投票总数 | 投票均值 | 发行年份 | 预算(调整后) | 票房(调整后) |
请注意,你需要提交该报告导出的 .html
、.ipynb
以及 .py
文件。
第一节 数据的导入与处理
在这一部分,你需要编写代码,使用 Pandas 读取数据,并进行预处理。
任务1.1: 导入库以及数据
- 载入需要的库
NumPy
、Pandas
、matplotlib
、seaborn
。 - 利用
Pandas
库,读取tmdb-movies.csv
中的数据,保存为movie_data
。
提示:记得使用 notebook 中的魔法指令 %matplotlib inline
,否则会导致你接下来无法打印出图像。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
movie_data = pd.read_csv('tmdb-movies.csv', index_col=['id'])
movie_data.head()
imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | monster|dna|tyrannosaurus rex|velociraptor|island | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | future|chase|post-apocalyptic|dystopia|australia | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | based on novel|revolution|dystopia|sequel|dyst... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | android|spaceship|jedi|space opera|3d | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | car race|speed|revenge|suspense|car | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
考察不同年份中, 不同电影类型的发行情况.
# 拆分电影类型
df_genres = movie_data.drop('genres', axis=1).join(movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres'))
# 绘图区
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
# 为了易于辨认, 只展示部分电影类型
top5_genres = df_genres['genres'].value_counts().nlargest(5).index
btm5_genres = df_genres['genres'].value_counts().nsmallest(5).index
# 作出中位数参考线
median_cir = df_genres.groupby('release_year')['genres'].value_counts().unstack().mean(axis=1)
median_cir.plot(ax=axes[0], ls='--', label='mean', legend=True)
median_cir.plot(ax=axes[1], ls='--', label='mean', legend=True)
# 按年份作图
vis_params = {
'grid': True, 'marker': 'o', 'markersize': 2, 'linewidth': 1}
df_genres[df_genres['genres'].isin(top5_genres)] \
.groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
.plot(ax=axes[0], title='circulation over years of top 5 genres', **vis_params)
df_genres[df_genres['genres'].isin(btm5_genres)] \
.groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
.plot(ax=axes[1], title='circulation over years of bottom 5 genres', **vis_params)
plt.tight_layout()
按总收益来看, 哪些描述电影的关键字出现频率最多
#制作电影关键字的词云
from wordcloud import WordCloud
%config InlineBackend.figure_format = 'retina'
#用"|"分割的关键词 先分离出来
kw_expand = movie_data['keywords'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('keywords')
# 合并营收数据
df_kw_rev = movie_data[['revenue_adj']].join(kw_expand)
# 按照收入分类 生成字典
word_dict = df_kw_rev.groupby('keywords')['revenue_adj'].sum().to_dict()
# 创建词云
params = {
'mode': 'RGBA',
'background_color': 'rgba(255, 255, 255, 0)',
'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)
# 绘制词云
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
按总收益来看, 哪些电影类型的关键字出现频率最多
#制作电影关键字的词云
from wordcloud import WordCloud
%config InlineBackend.figure_format = 'retina'
#用"|"分割的关键词 先分离出来
kw_expand = movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres')
# 合并营收数据
df_kw_rev = movie_data[['revenue_adj']].join(kw_expand)
# 按照收入分类 生成字典
word_dict = df_kw_rev.groupby('genres')['revenue_adj'].sum().to_dict()
# 创建词云
params = {
'mode': 'RGBA',
'background_color': 'rgba(255, 255, 255, 0)',
'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)
# 绘制词云
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
**任务1.2: ** 了解数据
你会接触到各种各样的数据表,因此在读取之后,我们有必要通过一些简单的方法,来了解我们数据表是什么样子的。
- 获取数据表的行列,并打印。
- 使用
.head()
、.tail()
、.sample()
方法,观察、了解数据表的情况。 - 使用
.dtypes
属性,来查看各列数据的数据类型。 - 使用
isnull()
配合.any()
等方法,来查看各列是否存在空值。 - 使用
.describe()
方法,看看数据表中数值型的数据是怎么分布的。
#获取数据表的行列
movie_data.shape
(10866, 20)
#使用head()
movie_data.head(10)
imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | monster|dna|tyrannosaurus rex|velociraptor|island | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | future|chase|post-apocalyptic|dystopia|australia | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | based on novel|revolution|dystopia|sequel|dyst... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | android|spaceship|jedi|space opera|3d | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | car race|speed|revenge|suspense|car | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
281957 | tt1663202 | 9.110700 | 135000000 | 532950503 | The Revenant | Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... | http://www.foxmovies.com/movies/the-revenant | Alejandro González Iñárritu | (n. One who has returned, as if from the dead.) | father-son relationship|rape|based on novel|mo... | In the 1820s, a frontiersman, Hugh Glass, sets... | 156 | Western|Drama|Adventure|Thriller | Regency Enterprises|Appian Way|CatchPlay|Anony... | 12/25/15 | 3929 | 7.2 | 2015 | 1.241999e+08 | 4.903142e+08 |
87101 | tt1340138 | 8.654359 | 155000000 | 440603537 | Terminator Genisys | Arnold Schwarzenegger|Jason Clarke|Emilia Clar... | http://www.terminatormovie.com/ | Alan Taylor | Reset the future | saving the world|artificial intelligence|cybor... | The year is 2029. John Connor, leader of the r... | 125 | Science Fiction|Action|Thriller|Adventure | Paramount Pictures|Skydance Productions | 6/23/15 | 2598 | 5.8 | 2015 | 1.425999e+08 | 4.053551e+08 |
286217 | tt3659388 | 7.667400 | 108000000 | 595380321 | The Martian | Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... | http://www.foxmovies.com/movies/the-martian | Ridley Scott | Bring Him Home | based on novel|mars|nasa|isolation|botanist | During a manned mission to Mars, Astronaut Mar... | 141 | Drama|Adventure|Science Fiction | Twentieth Century Fox Film Corporation|Scott F... | 9/30/15 | 4572 | 7.6 | 2015 | 9.935996e+07 | 5.477497e+08 |
211672 | tt2293640 | 7.404165 | 74000000 | 1156730962 | Minions | Sandra Bullock|Jon Hamm|Michael Keaton|Allison... | http://www.minionsmovie.com/ | Kyle Balda|Pierre Coffin | Before Gru, they had a history of bad bosses | assistant|aftercreditsstinger|duringcreditssti... | Minions Stuart, Kevin and Bob are recruited by... | 91 | Family|Animation|Adventure|Comedy | Universal Pictures|Illumination En |