kaggle TMDB5000电影数据分析和电影推荐模型

最新推荐文章于 2024-05-10 18:34:04 发布

iam_emily

最新推荐文章于 2024-05-10 18:34:04 发布

阅读量1.3w

点赞数 17

分类专栏：数据挖掘 kaggle 文章标签： kaggle 电影数据分析相似性推荐

本文链接：https://blog.csdn.net/iam_emily/article/details/80418800

版权

数据来自kaggle上tmdb5000电影数据集，本次数据分析主要包括电影数据可视化和简单的电影推荐模型，如：
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告

movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')

movie.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…	http://www.avatarmovie.com/	19995	[{“id”: 1463, “name”: “culture clash”}, {“id”:…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[{“name”: “Ingenious Film Partners”, “id”: 289…	[{“iso_3166_1”: “US”, “name”: “United States o…	2009-12-10	2787965087	162.0	[{“iso_639_1”: “en”, “name”: “English”}, {“iso…	Released	Enter the World of Pandora.	Avatar	7.2	11800

movie.tail(3)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
4800	[{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…	http://www.hallmarkchannel.com/signedsealeddel…	231617	[{“id”: 248, “name”: “date”}, {“id”: 699, “nam…	en	Signed, Sealed, Delivered	“Signed, Sealed, Delivered” introduces a dedic…	1.444476	[{“name”: “Front Street Pictures”, “id”: 3958}…	[{“iso_3166_1”: “US”, “name”: “United States o…	2013-10-13	120.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	NaN	Signed, Sealed, Delivered	7.0	6
4801	[]	http://shanghaicalling.com/	126186	[]	en	Shanghai Calling	When ambitious New York attorney Sam is sent t…	0.857008	[]	[{“iso_3166_1”: “US”, “name”: “United States o…	2012-05-03	98.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	A New Yorker in Shanghai	Shanghai Calling	5.7	7
4802	[{“id”: 99, “name”: “Documentary”}]	NaN	25975	[{“id”: 1523, “name”: “obsession”}, {“id”: 224…	en	My Date with Drew	Ever since the second grade when he first saw …	1.929883	[{“name”: “rusty bear entertainment”, “id”: 87…	[{“iso_3166_1”: “US”, “name”: “United States o…	2005-08-05	90.0	[{“iso_639_1”: “en”, “name”: “English”}]	Released	NaN	My Date with Drew	6.3	16

movie.info()#样本数量为4803，部分特征有缺失值

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

样本数为4803，部分特征有缺失值，homepage,tagline缺损较多，但这俩不影响基本分析，release_date和runtime可以填充；仔细观察，部分样本的genres,keywords,production company特征值是[]，需要注意。

credit.info

数据清理

数据特征中有很多特征为json格式，即类似于字典的键值对形式，为了方便后续处理，我们需要将其转换成便于python操作的str或者list形式，利于提取有用信息。

#movie genres电影流派，便于归类
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,对df中某一行、列应用某种操作。

movie['genres'].head(1)

0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object

list(zip(movie.index,movie['genres']))[:2]

[(0,
  [{'id': 28, 'name': 'Action'},
   {'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 878, 'name': 'Science Fiction'}]),
 (1,
  [{'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 28, 'name': 'Action'}])]

for index,i in zip(movie.index,movie['genres']):
    list1=[]
    for j in range(len(i)):
        list1.append((i[j]['name']))# name:genres,Action...
    movie.loc[index,'genres']=str(list1)

movie.head(1)
#genres列已经不是json格式，而是将name将的value即电影类型提取出来重新赋值给genres

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[{“id”: 1463, “name”: “culture clash”}, {“id”:…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[{“name”: “Ingenious Film Partners”, “id”: 289…	[{“iso_3166_1”: “US”, “name”: “United States o…	2009-12-10	2787965087	162.0	[{“iso_639_1”: “en”, “name”: “English”}, {“iso…	Released	Enter the World of Pandora.	Avatar	7.2	11800

#同样的方法应用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
    list2=[]
    for j in range(len(i)):
        list2.append(i[j]['name'])
    movie.loc[index,'keywords'] = str(list2)

#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_companies']=str(list3)

movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_countries']=str(list3)

movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'spoken_languages']=str(list3)

movie.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[‘culture clash’, ‘future’, ‘space war’, ‘spac…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[‘Ingenious Film Partners’, ‘Twentieth Century…	[‘United States of America’, ‘United Kingdom’]	2009-12-10	2787965087	162.0	[‘English’, ‘Español’]	Released	Enter the World of Pandora.	Avatar	7.2	11800

credit.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	movie_id	title	cast	crew
0	19995	Avatar	[{“cast_id”: 242, “character”: “Jake Sully”, “…	[{“credit_id”: “52fe48009251416c750aca23”, “de…

credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    credit.loc[index,'cast']=str(list3)

credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director，增加电影导演一列，用作后续分析
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={
  'crew':'director'},inplace=True)

credit.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	movie_id	title	cast	director
0	19995	Avatar	[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …	James Cameron

观察movie中id和credit中movie_id相同，可以将两个表合并，将所有信息统一在一个表中。

fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')

fulldf.head(1)

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	…	spoken_languages	status	tagline	title_x	vote_average	vote_count	movie_id	title_y	cast	director
0	237000000	[‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…	http://www.avatarmovie.com/	19995	[‘culture clash’, ‘future’, ‘space war’, ‘spac…	en	Avatar	In the 22nd century, a paraplegic Marine is di…	150.437577	[‘Ingenious Film Partners’, ‘Twentieth Century…	…	[‘English’, ‘Español’]	Released	Enter the World of Pandora.	Avatar	7.2	11800	19995	Avatar	[‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney …	James Cameron

1 rows × 24 columns

fulldf.shape

(4803, 24)

#观察到有相同列title，合并后自动命名成title_x,title_y
fulldf.rename(columns={
  'title_x':'title'},inplace=

最低0.47元/天解锁文章

iam_emily

关注

17
点赞
踩
146

收藏

觉得还不错? 一键收藏
6
评论
kaggle TMDB5000电影数据分析和电影推荐模型

数据来自kaggle上tmdb5000电影数据集，本次数据分析主要包括电影数据可视化和简单的电影推荐模型，如： 1.电影类型分配及其随时间的变化 2.利润、评分、受欢迎程度直接的关系 3.哪些导演的电影卖座或较好 4.最勤劳的演职人员 5.电影关键字分析 6.电影相似性推荐数据分析import pandas as pdimport numpy as npimport sea...
复制链接

扫一扫