数据来自kaggle上tmdb5000电影数据集,本次数据分析主要包括电影数据可视化和简单的电影推荐模型,如:
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐
数据分析
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告
[/code]
```code
movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')
[/code]
```code
movie.head(1)
[/code]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget
| genres | homepage | id | keywords | original_language |
original_title | overview | popularity | production_companies |
production_countries | release_date | revenue | runtime |
spoken_languages | status | tagline | title | vote_average |
vote_count
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 237000000 | [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam… |
http://www.avatarmovie.com/ | 19995 | [{“id”: 1463, “name”: “culture
clash”}, {“id”:… | en | Avatar | In the 22nd century, a paraplegic
Marine is di… | 150.437577 | [{“name”: “Ingenious Film Partners”, “id”:
289… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2009-12-10 |
2787965087 | 162.0 | [{“iso_639_1”: “en”, “name”: “English”}, {“iso… |
Released | Enter the World of Pandora. | Avatar | 7.2 | 11800
```code
movie.tail(3)
[/code]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget
| genres | homepage | id | keywords | original_language |
original_title | overview | popularity | production_companies |
production_countries | release_date | revenue | runtime |
spoken_languages | status | tagline | title | vote_average |
vote_count
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
4800 | 0 | [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam… |
http://www.hallmarkchannel.com/signedsealeddel… | 231617 | [{“id”: 248,
“name”: “date”}, {“id”: 699, “nam… | en | Signed, Sealed, Delivered |
“Signed, Sealed, Delivered” introduces a dedic… | 1.444476 | [{“name”:
“Front Street Pictures”, “id”: 3958}… | [{“iso_3166_1”: “US”, “name”:
“United States o… | 2013-10-13 | 0 | 120.0 | [{“iso_639_1”: “en”,
“name”: “English”}] | Released | NaN | Signed, Sealed, Delivered | 7.0
| 6
4801 | 0 | [] | http://shanghaicalling.com/ | 126186 | [] | en |
Shanghai Calling | When ambitious New York attorney Sam is sent t… |
0.857008 | [] | [{“iso_3166_1”: “US”, “name”: “United States o… |
2012-05-03 | 0 | 98.0 | [{“iso_639_1”: “en”, “name”: “English”}] |
Released | A New Yorker in Shanghai | Shanghai Calling | 5.7 | 7
4802 | 0 | [{“id”: 99, “name”: “Documentary”}] | NaN | 25975 |
[{“id”: 1523, “name”: “obsession”}, {“id”: 224… | en | My Date with Drew
| Ever since the second grade when he first saw … | 1.929883 | [{“name”:
“rusty bear entertainment”, “id”: 87… | [{“iso_3166_1”: “US”, “name”:
“United States o… | 2005-08-05 | 0 | 90.0 | [{“iso_639_1”: “en”,
“name”: “English”}] | Released | NaN | My Date with Drew | 6.3 | 16
```code
movie.info()#样本数量为4803,部分特征有缺失值
[/code]
```code
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget 4803 non-null int64
genres 4803 non-null object
homepage 1712 non-null object
id 4803 non-null int64
keywords 4803 non-null object
original_language 4803 non-null object
original_title 4803 non-null object
overview 4800 non-null object
popularity 4803 non-null float64
production_companies 4803 non-null object
production_countries 4803 non-null object
release_date 4802 non-null object
revenue 4803 non-null int64
runtime 4801 non-null float64
spoken_languages 4803 non-null object
status 4803 non-null object
tagline 3959 non-null object
title 4803 non-null object
vote_average 4803 non-null float64
vote_count 4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB
样本数为4803,部分特征有缺失值,homepage,tagline缺损较多,但这俩不影响基本分析,release_date和runtime可以填充;仔细观察,部分样本的genres,keywords,production
company特征值是[],需要注意。
credit.info
[/code]
## 数据清理
数据特征中有很多特征为json格式,即类似于字典的键值对形式,为了方便后续处理,我们需要将其转换成便于python操作的str或者list形式,利于提取有用信息。
```code
#movie genres电影流派,便于归类
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,对df中某一行、列应用某种操作。
[/code]
```code
movie['genres'].head(1)
[/code]
```code
0 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
list(zip(movie.index,movie['genres']))[:2]
[/code]
```code
[(0,
[{'id': 28, 'name': 'Action'},
{'id': 12, 'name': 'Adventure'},
{'id': 14, 'name': 'Fantasy'},
{'id': 878, 'name': 'Science Fiction'}]),
(1,
[{'id': 12, 'name': 'Adventure'},
{'id': 14, 'name': 'Fantasy'},
{'id': 28, 'name': 'Action'}])]
for index,i in zip(movie.index,movie['genres']):
list1=[]
for j in range(len(i)):
list1.append((i[j]['name']))# name:genres,Action...
movie.loc[index,'genres']=str(list1)
[/code]
```code
movie.head(1)
#genres列已经不是json格式,而是将name将的value即电影类型提取出来重新赋值给genres
[/code]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget
| genres | homepage | id | keywords | original_language |
original_title | overview | popularity | production_companies |
production_countries | release_date | revenue | runtime |
spoken_languages | status | tagline | title | vote_average |
vote_count
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 237000000 | [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… |
http://www.avatarmovie.com/ | 19995 | [{“id”: 1463, “name”: “culture
clash”}, {“id”:… | en | Avatar | In the 22nd century, a paraplegic
Marine is di… | 150.437577 | [{“name”: “Ingenious Film Partners”, “id”:
289… | [{“iso_3166_1”: “US”, “name”: “United States o… | 2009-12-10 |
2787965087 | 162.0 | [{“iso_639_1”: “en”, “name”: “English”}, {“iso… |
Released | Enter the World of Pandora. | Avatar | 7.2 | 11800
```code
#同样的方法应用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
list2=[]
for j in range(len(i)):
list2.append(i[j]['name'])
movie.loc[index,'keywords'] = str(list2)
[/code]
```code
#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name'])
movie.loc[index,'production_companies']=str(list3)
[/code]
```code
movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name'])
movie.loc[index,'production_countries']=str(list3)
[/code]
```code
movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name'])
movie.loc[index,'spoken_languages']=str(list3)
[/code]
```code
movie.head(1)
[/code]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; } | budget
| genres | homepage | id | keywords | original_language |
original_title | overview | popularity | production_companies |
production_countries | release_date | revenue | runtime |
spoken_languages | status | tagline | title | vote_average |
vote_count
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---
0 | 237000000 | [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… |
http://www.avatarmovie.com/ | 19995 | [‘culture clash’, ‘future’, ‘space
war’, ‘spac… | en | Avatar | In the 22nd century, a paraplegic Marine is
di… | 150.437577 | [‘Ingenious Film Partners’, ‘Twentieth Century… |
[‘United States of America’, ‘United Kingdom’] | 2009-12-10 | 2787965087
| 162.0 | [‘English’, ‘Español’] | Released | Enter the World of
Pandora. | Avatar | 7.2 | 11800
```code
credit.head(1)
[/code]
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; } |
movie_id | title | cast | crew
---|---|---|---|---
0 | 19995 | Avatar | [{“cast_id”: 242, “character”: “Jake Sully”, “… |
[{“credit_id”: “52fe48009251416c750aca23”, “de…
```code
credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name'])
credit.loc[index,'cast']=str(list3)
[/code]
```code
credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director,增加电影导演一列,用作后续分析
def director(x):
for i in x:
if i['job'] == 'Director':
return i['name&