我所完成的探索电影数据集完成报告

探索电影数据集

by Kimi

在这个项目中,你将尝试使用所学的知识,使用 NumPyPandasmatplotlibseaborn 库中的函数,来对电影数据集进行探索。

下载数据集:
TMDb电影数据

数据集各列名称的含义:

列名称 id imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
含义 编号 IMDB 编号 知名度 预算 票房 名称 主演 网站 导演 宣传词 关键词 简介 时常 类别 发行公司 发行日期 投票总数 投票均值 发行年份 预算(调整后) 票房(调整后)

请注意,你需要提交该报告导出的 .html.ipynb 以及 .py 文件。



第一节 数据的导入与处理

在这一部分,你需要编写代码,使用 Pandas 读取数据,并进行预处理。

任务1.1: 导入库以及数据

  1. 载入需要的库 NumPyPandasmatplotlibseaborn
  2. 利用 Pandas 库,读取 tmdb-movies.csv 中的数据,保存为 movie_data

提示:记得使用 notebook 中的魔法指令 %matplotlib inline,否则会导致你接下来无法打印出图像。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

movie_data = pd.read_csv('tmdb-movies.csv', index_col=['id'])
movie_data.head()
imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
id
135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You based on novel|revolution|dystopia|sequel|dyst... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. android|spaceship|jedi|space opera|3d Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home car race|speed|revenge|suspense|car Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09

考察不同年份中, 不同电影类型的发行情况.

# 拆分电影类型
df_genres = movie_data.drop('genres', axis=1).join(movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres'))
# 绘图区
fig, axes = plt.subplots(2, 1, figsize=(20, 10))
# 为了易于辨认, 只展示部分电影类型
top5_genres = df_genres['genres'].value_counts().nlargest(5).index
btm5_genres = df_genres['genres'].value_counts().nsmallest(5).index

# 作出中位数参考线
median_cir = df_genres.groupby('release_year')['genres'].value_counts().unstack().mean(axis=1)
median_cir.plot(ax=axes[0], ls='--', label='mean', legend=True)
median_cir.plot(ax=axes[1], ls='--', label='mean', legend=True)

# 按年份作图
vis_params = {
   'grid': True, 'marker': 'o', 'markersize': 2, 'linewidth': 1}
df_genres[df_genres['genres'].isin(top5_genres)] \
                 .groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
                 .plot(ax=axes[0], title='circulation over years of top 5 genres', **vis_params)

df_genres[df_genres['genres'].isin(btm5_genres)] \
                 .groupby('release_year')['genres'].value_counts().unstack().fillna(0) \
                 .plot(ax=axes[1], title='circulation over years of bottom 5 genres', **vis_params)

plt.tight_layout()

在这里插入图片描述

按总收益来看, 哪些描述电影的关键字出现频率最多

#制作电影关键字的词云
from wordcloud import WordCloud
%config InlineBackend.figure_format = 'retina'
#用"|"分割的关键词 先分离出来
kw_expand = movie_data['keywords'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('keywords')
# 合并营收数据
df_kw_rev = movie_data[['revenue_adj']].join(kw_expand)
# 按照收入分类 生成字典
word_dict = df_kw_rev.groupby('keywords')['revenue_adj'].sum().to_dict()
# 创建词云
params = {
   'mode': 'RGBA', 
          'background_color': 'rgba(255, 255, 255, 0)', 
          'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)

# 绘制词云
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')

在这里插入图片描述

按总收益来看, 哪些电影类型的关键字出现频率最多

#制作电影关键字的词云
from wordcloud import WordCloud
%config InlineBackend.figure_format = 'retina'
#用"|"分割的关键词 先分离出来
kw_expand = movie_data['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genres')
# 合并营收数据
df_kw_rev = movie_data[['revenue_adj']].join(kw_expand)
# 按照收入分类 生成字典
word_dict = df_kw_rev.groupby('genres')['revenue_adj'].sum().to_dict()
# 创建词云
params = {
   'mode': 'RGBA', 
          'background_color': 'rgba(255, 255, 255, 0)', 
          'colormap': 'Spectral'}
wordcloud = WordCloud(width=1200, height=800, **params)
wordcloud.generate_from_frequencies(word_dict)

# 绘制词云
plt.figure(figsize=(15, 10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')

在这里插入图片描述


**任务1.2: ** 了解数据

你会接触到各种各样的数据表,因此在读取之后,我们有必要通过一些简单的方法,来了解我们数据表是什么样子的。

  1. 获取数据表的行列,并打印。
  2. 使用 .head().tail().sample() 方法,观察、了解数据表的情况。
  3. 使用 .dtypes 属性,来查看各列数据的数据类型。
  4. 使用 isnull() 配合 .any() 等方法,来查看各列是否存在空值。
  5. 使用 .describe() 方法,看看数据表中数值型的数据是怎么分布的。
#获取数据表的行列
movie_data.shape
(10866, 20)
#使用head()
movie_data.head(10)
imdb_id popularity budget revenue original_title cast homepage director tagline keywords overview runtime genres production_companies release_date vote_count vote_average release_year budget_adj revenue_adj
id
135397 tt0369610 32.985763 150000000 1513528810 Jurassic World Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... http://www.jurassicworld.com/ Colin Trevorrow The park is open. monster|dna|tyrannosaurus rex|velociraptor|island Twenty-two years after the events of Jurassic ... 124 Action|Adventure|Science Fiction|Thriller Universal Studios|Amblin Entertainment|Legenda... 6/9/15 5562 6.5 2015 1.379999e+08 1.392446e+09
76341 tt1392190 28.419936 150000000 378436354 Mad Max: Fury Road Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... http://www.madmaxmovie.com/ George Miller What a Lovely Day. future|chase|post-apocalyptic|dystopia|australia An apocalyptic story set in the furthest reach... 120 Action|Adventure|Science Fiction|Thriller Village Roadshow Pictures|Kennedy Miller Produ... 5/13/15 6185 7.1 2015 1.379999e+08 3.481613e+08
262500 tt2908446 13.112507 110000000 295238201 Insurgent Shailene Woodley|Theo James|Kate Winslet|Ansel... http://www.thedivergentseries.movie/#insurgent Robert Schwentke One Choice Can Destroy You based on novel|revolution|dystopia|sequel|dyst... Beatrice Prior must confront her inner demons ... 119 Adventure|Science Fiction|Thriller Summit Entertainment|Mandeville Films|Red Wago... 3/18/15 2480 6.3 2015 1.012000e+08 2.716190e+08
140607 tt2488496 11.173104 200000000 2068178225 Star Wars: The Force Awakens Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... http://www.starwars.com/films/star-wars-episod... J.J. Abrams Every generation has a story. android|spaceship|jedi|space opera|3d Thirty years after defeating the Galactic Empi... 136 Action|Adventure|Science Fiction|Fantasy Lucasfilm|Truenorth Productions|Bad Robot 12/15/15 5292 7.5 2015 1.839999e+08 1.902723e+09
168259 tt2820852 9.335014 190000000 1506249360 Furious 7 Vin Diesel|Paul Walker|Jason Statham|Michelle ... http://www.furious7.com/ James Wan Vengeance Hits Home car race|speed|revenge|suspense|car Deckard Shaw seeks revenge against Dominic Tor... 137 Action|Crime|Thriller Universal Pictures|Original Film|Media Rights ... 4/1/15 2947 7.3 2015 1.747999e+08 1.385749e+09
281957 tt1663202 9.110700 135000000 532950503 The Revenant Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... http://www.foxmovies.com/movies/the-revenant Alejandro González Iñárritu (n. One who has returned, as if from the dead.) father-son relationship|rape|based on novel|mo... In the 1820s, a frontiersman, Hugh Glass, sets... 156 Western|Drama|Adventure|Thriller Regency Enterprises|Appian Way|CatchPlay|Anony... 12/25/15 3929 7.2 2015 1.241999e+08 4.903142e+08
87101 tt1340138 8.654359 155000000 440603537 Terminator Genisys Arnold Schwarzenegger|Jason Clarke|Emilia Clar... http://www.terminatormovie.com/ Alan Taylor Reset the future saving the world|artificial intelligence|cybor... The year is 2029. John Connor, leader of the r... 125 Science Fiction|Action|Thriller|Adventure Paramount Pictures|Skydance Productions 6/23/15 2598 5.8 2015 1.425999e+08 4.053551e+08
286217 tt3659388 7.667400 108000000 595380321 The Martian Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... http://www.foxmovies.com/movies/the-martian Ridley Scott Bring Him Home based on novel|mars|nasa|isolation|botanist During a manned mission to Mars, Astronaut Mar... 141 Drama|Adventure|Science Fiction Twentieth Century Fox Film Corporation|Scott F... 9/30/15 4572 7.6 2015 9.935996e+07 5.477497e+08
211672 tt2293640 7.404165 74000000 1156730962 Minions Sandra Bullock|Jon Hamm|Michael Keaton|Allison... http://www.minionsmovie.com/ Kyle Balda|Pierre Coffin Before Gru, they had a history of bad bosses assistant|aftercreditsstinger|duringcreditssti... Minions Stuart, Kevin and Bob are recruited by... 91 Family|Animation|Adventure|Comedy Universal Pictures|Illumination En
  • 0
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值