数据来自kaggle上tmdb5000电影数据集,本次数据分析主要包括电影数据可视化和简单的电影推荐模型,如: 1.电影类型分配及其随时间的变化 2.利润、评分、受欢迎程度直接的关系 3.哪些导演的电影卖座或较好 4.最勤劳的演职人员 5.电影关键字分析 6.电影相似性推荐
数据分析
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot' )
import json
import warnings
warnings.filterwarnings('ignore' )
movie = pd.read_csv('tmdb_5000_movies.csv' )
credit = pd.read_csv('tmdb_5000_credits.csv' )
movie.head (1 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
vote_average
vote_count
0
237000000
[{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…
http://www.avatarmovie.com/
19995
[{“id”: 1463, “name”: “culture clash”}, {“id”:…
en
Avatar
In the 22nd century, a paraplegic Marine is di…
150.437577
[{“name”: “Ingenious Film Partners”, “id”: 289…
[{“iso_3166_1”: “US”, “name”: “United States o…
2009-12-10
2787965087
162.0
[{“iso_639_1”: “en”, “name”: “English”}, {“iso…
Released
Enter the World of Pandora.
Avatar
7.2
11800
movie.tail (3 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
vote_average
vote_count
4800
0
[{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…
http://www.hallmarkchannel.com/signedsealeddel…
231617
[{“id”: 248, “name”: “date”}, {“id”: 699, “nam…
en
Signed, Sealed, Delivered
“Signed, Sealed, Delivered” introduces a dedic…
1.444476
[{“name”: “Front Street Pictures”, “id”: 3958}…
[{“iso_3166_1”: “US”, “name”: “United States o…
2013-10-13
0
120.0
[{“iso_639_1”: “en”, “name”: “English”}]
Released
NaN
Signed, Sealed, Delivered
7.0
6
4801
0
[]
http://shanghaicalling.com/
126186
[]
en
Shanghai Calling
When ambitious New York attorney Sam is sent t…
0.857008
[]
[{“iso_3166_1”: “US”, “name”: “United States o…
2012-05-03
0
98.0
[{“iso_639_1”: “en”, “name”: “English”}]
Released
A New Yorker in Shanghai
Shanghai Calling
5.7
7
4802
0
[{“id”: 99, “name”: “Documentary”}]
NaN
25975
[{“id”: 1523, “name”: “obsession”}, {“id”: 224…
en
My Date with Drew
Ever since the second grade when he first saw …
1.929883
[{“name”: “rusty bear entertainment”, “id”: 87…
[{“iso_3166_1”: “US”, “name”: “United States o…
2005-08-05
0
90.0
[{“iso_639_1”: “en”, “name”: “English”}]
Released
NaN
My Date with Drew
6.3
16
movie.info ()#样本数量为4803,部分特征有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget 4803 non-null int64
genres 4803 non-null object
homepage 1712 non-null object
id 4803 non-null int64
keywords 4803 non-null object
original_language 4803 non-null object
original_title 4803 non-null object
overview 4800 non-null object
popularity 4803 non-null float64
production_companies 4803 non-null object
production_countries 4803 non-null object
release_date 4802 non-null object
revenue 4803 non-null int64
runtime 4801 non-null float64
spoken_languages 4803 non-null object
status 4803 non-null object
tagline 3959 non-null object
title 4803 non-null object
vote_average 4803 non-null float64
vote_count 4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB
样本数为4803,部分特征有缺失值,homepage,tagline缺损较多,但这俩不影响基本分析,release_date和runtime可以填充;仔细观察,部分样本的genres,keywords,production company特征值是[],需要注意。
credit.info
数据清理
数据特征中有很多特征为json格式,即类似于字典的键值对形式,为了方便后续处理,我们需要将其转换成便于python操作的str或者list形式,利于提取有用信息。
#movie genres电影流派,便于归类
movie['genres' ]=movie['genres' ].apply(json.loads)
#apply function to axis in df,对df中某一行、列应用某种操作。
movie['genres' ].head(1 )
0 [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
list(zip(movie.index,movie['genres' ]) ) [:2]
[(0,
[{'id': 28, 'name': 'Action'},
{'id': 12, 'name': 'Adventure'},
{'id': 14, 'name': 'Fantasy'},
{'id': 878, 'name': 'Science Fiction'}]),
(1,
[{'id': 12, 'name': 'Adventure'},
{'id': 14, 'name': 'Fantasy'},
{'id': 28, 'name': 'Action'}])]
for index ,i in zip(movie.index ,movie['genres' ]):
list1=[]
for j in range(len(i)):
list1.append((i[j]['name' ]))# name:genres,Action...
movie.loc[index ,'genres' ]=str (list1)
movie.head(1 )
#genres列已经不是json格式,而是将name将的value即电影类型提取出来重新赋值给genres
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
vote_average
vote_count
0
237000000
[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…
http://www.avatarmovie.com/
19995
[{“id”: 1463, “name”: “culture clash”}, {“id”:…
en
Avatar
In the 22nd century, a paraplegic Marine is di…
150.437577
[{“name”: “Ingenious Film Partners”, “id”: 289…
[{“iso_3166_1”: “US”, “name”: “United States o…
2009-12-10
2787965087
162.0
[{“iso_639_1”: “en”, “name”: “English”}, {“iso…
Released
Enter the World of Pandora.
Avatar
7.2
11800
#同样的方法应用到keywords列
movie['keywords' ] = movie['keywords' ].apply(json.loads)
for index ,i in zip(movie.index ,movie['keywords' ]):
list2=[]
for j in range(len(i)):
list2.append(i[j]['name' ])
movie.loc[index ,'keywords' ] = str (list2)
#同理production_companies
movie['production_companies' ] = movie['production_companies' ].apply(json.loads)
for index ,i in zip(movie.index ,movie['production_companies' ]):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name' ])
movie.loc[index ,'production_companies' ]=str (list3)
movie['production_countries' ] = movie['production_countries' ].apply(json.loads)
for index ,i in zip(movie.index ,movie['production_countries' ]):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name' ])
movie.loc[index ,'production_countries' ]=str(list3)
movie['spoken_languages' ] = movie['spoken_languages' ].apply(json.loads)
for index ,i in zip(movie.index ,movie['spoken_languages' ]):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name' ])
movie.loc[index ,'spoken_languages' ]=str(list3)
movie.head (1 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
production_countries
release_date
revenue
runtime
spoken_languages
status
tagline
title
vote_average
vote_count
0
237000000
[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…
http://www.avatarmovie.com/
19995
[‘culture clash’, ‘future’, ‘space war’, ‘spac…
en
Avatar
In the 22nd century, a paraplegic Marine is di…
150.437577
[‘Ingenious Film Partners’, ‘Twentieth Century…
[‘United States of America’, ‘United Kingdom’]
2009-12-10
2787965087
162.0
[‘English’, ‘Español’]
Released
Enter the World of Pandora.
Avatar
7.2
11800
credit.head (1 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id
title
cast
crew
0
19995
Avatar
[{“cast_id”: 242, “character”: “Jake Sully”, “…
[{“credit_id”: “52fe48009251416c750aca23”, “de…
credit['cast' ] = credit['cast' ].apply(json.loads)
for index ,i in zip(credit.index ,credit['cast' ]):
list3=[]
for j in range(len(i)):
list3.append(i[j]['name' ])
credit.loc[index ,'cast' ]=str(list3)
credit['crew' ] = credit['crew' ].apply(json.loads)
def director (x) :
for i in x:
if i['job' ] == 'Director' :
return i['name' ]
credit['crew' ]=credit['crew' ].apply(director)
credit.rename(columns={
'crew' :'director' },inplace=True )
credit.head (1 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id
title
cast
director
0
19995
Avatar
[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …
James Cameron
观察movie中id和credit中movie_id相同,可以将两个表合并,将所有信息统一在一个表中。
fulldf = pd.merge (movie,credit,left_on='id' ,right_on='movie_id' ,how='left' )
fulldf.head (1 )
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget
genres
homepage
id
keywords
original_language
original_title
overview
popularity
production_companies
…
spoken_languages
status
tagline
title_x
vote_average
vote_count
movie_id
title_y
cast
director
0
237000000
[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…
http://www.avatarmovie.com/
19995
[‘culture clash’, ‘future’, ‘space war’, ‘spac…
en
Avatar
In the 22nd century, a paraplegic Marine is di…
150.437577
[‘Ingenious Film Partners’, ‘Twentieth Century…
…
[‘English’, ‘Español’]
Released
Enter the World of Pandora.
Avatar
7.2
11800
19995
Avatar
[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …
James Cameron
1 rows × 24 columns
fulldf.shape
(4803, 24)
fulldf.rename(columns={
'title_x' :'title' },inplace=