kaggle TMDB5000电影数据分析和电影推荐模型

数据来自kaggle上tmdb5000电影数据集,本次数据分析主要包括电影数据可视化和简单的电影推荐模型,如:
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import json
import warnings
warnings.filterwarnings('ignore')#忽略警告
movie = pd.read_csv('tmdb_5000_movies.csv')
credit = pd.read_csv('tmdb_5000_credits.csv')
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
movie.tail(3)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
4800 0 [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam… http://www.hallmarkchannel.com/signedsealeddel… 231617 [{“id”: 248, “name”: “date”}, {“id”: 699, “nam… en Signed, Sealed, Delivered “Signed, Sealed, Delivered” introduces a dedic… 1.444476 [{“name”: “Front Street Pictures”, “id”: 3958}… [{“iso_3166_1”: “US”, “name”: “United States o… 2013-10-13 0 120.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN Signed, Sealed, Delivered 7.0 6
4801 0 [] http://shanghaicalling.com/ 126186 [] en Shanghai Calling When ambitious New York attorney Sam is sent t… 0.857008 [] [{“iso_3166_1”: “US”, “name”: “United States o… 2012-05-03 0 98.0 [{“iso_639_1”: “en”, “name”: “English”}] Released A New Yorker in Shanghai Shanghai Calling 5.7 7
4802 0 [{“id”: 99, “name”: “Documentary”}] NaN 25975 [{“id”: 1523, “name”: “obsession”}, {“id”: 224… en My Date with Drew Ever since the second grade when he first saw … 1.929883 [{“name”: “rusty bear entertainment”, “id”: 87… [{“iso_3166_1”: “US”, “name”: “United States o… 2005-08-05 0 90.0 [{“iso_639_1”: “en”, “name”: “English”}] Released NaN My Date with Drew 6.3 16
movie.info()#样本数量为4803,部分特征有缺失值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

样本数为4803,部分特征有缺失值,homepage,tagline缺损较多,但这俩不影响基本分析,release_date和runtime可以填充;仔细观察,部分样本的genres,keywords,production company特征值是[],需要注意。

credit.info

数据清理

数据特征中有很多特征为json格式,即类似于字典的键值对形式,为了方便后续处理,我们需要将其转换成便于python操作的str或者list形式,利于提取有用信息。

#movie genres电影流派,便于归类
movie['genres']=movie['genres'].apply(json.loads)
#apply function to axis in df,对df中某一行、列应用某种操作。
movie['genres'].head(1)
0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
Name: genres, dtype: object
list(zip(movie.index,movie['genres']))[:2]
[(0,
  [{'id': 28, 'name': 'Action'},
   {'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 878, 'name': 'Science Fiction'}]),
 (1,
  [{'id': 12, 'name': 'Adventure'},
   {'id': 14, 'name': 'Fantasy'},
   {'id': 28, 'name': 'Action'}])]
for index,i in zip(movie.index,movie['genres']):
    list1=[]
    for j in range(len(i)):
        list1.append((i[j]['name']))# name:genres,Action...
    movie.loc[index,'genres']=str(list1)
movie.head(1)
#genres列已经不是json格式,而是将name将的value即电影类型提取出来重新赋值给genres
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [{“id”: 1463, “name”: “culture clash”}, {“id”:… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [{“name”: “Ingenious Film Partners”, “id”: 289… [{“iso_3166_1”: “US”, “name”: “United States o… 2009-12-10 2787965087 162.0 [{“iso_639_1”: “en”, “name”: “English”}, {“iso… Released Enter the World of Pandora. Avatar 7.2 11800
#同样的方法应用到keywords列
movie['keywords'] = movie['keywords'].apply(json.loads)
for index,i in zip(movie.index,movie['keywords']):
    list2=[]
    for j in range(len(i)):
        list2.append(i[j]['name'])
    movie.loc[index,'keywords'] = str(list2)
#同理production_companies
movie['production_companies'] = movie['production_companies'].apply(json.loads)
for index,i in zip(movie.index,movie['production_companies']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_companies']=str(list3)
movie['production_countries'] = movie['production_countries'].apply(json.loads)
for index,i in zip(movie.index,movie['production_countries']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'production_countries']=str(list3)
movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
for index,i in zip(movie.index,movie['spoken_languages']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    movie.loc[index,'spoken_languages']=str(list3)
movie.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_languages status tagline title vote_average vote_count
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [‘culture clash’, ‘future’, ‘space war’, ‘spac… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [‘Ingenious Film Partners’, ‘Twentieth Century… [‘United States of America’, ‘United Kingdom’] 2009-12-10 2787965087 162.0 [‘English’, ‘Español’] Released Enter the World of Pandora. Avatar 7.2 11800
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast crew
0 19995 Avatar [{“cast_id”: 242, “character”: “Jake Sully”, “… [{“credit_id”: “52fe48009251416c750aca23”, “de…
credit['cast'] = credit['cast'].apply(json.loads)
for index,i in zip(credit.index,credit['cast']):
    list3=[]
    for j in range(len(i)):
        list3.append(i[j]['name'])
    credit.loc[index,'cast']=str(list3)
credit['crew'] = credit['crew'].apply(json.loads)
#提取crew中director,增加电影导演一列,用作后续分析
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
credit['crew']=credit['crew'].apply(director)
credit.rename(columns={
  'crew':'director'},inplace=True)
credit.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
movie_id title cast director
0 19995 Avatar [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … James Cameron

观察movie中id和credit中movie_id相同,可以将两个表合并,将所有信息统一在一个表中。

fulldf = pd.merge(movie,credit,left_on='id',right_on='movie_id',how='left')
fulldf.head(1)
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th { text-align: left; } .dataframe tbody tr th { vertical-align: top; }
budget genres homepage id keywords original_language original_title overview popularity production_companies spoken_languages status tagline title_x vote_average vote_count movie_id title_y cast director
0 237000000 [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi… http://www.avatarmovie.com/ 19995 [‘culture clash’, ‘future’, ‘space war’, ‘spac… en Avatar In the 22nd century, a paraplegic Marine is di… 150.437577 [‘Ingenious Film Partners’, ‘Twentieth Century… [‘English’, ‘Español’] Released Enter the World of Pandora. Avatar 7.2 11800 19995 Avatar [‘Sam Worthington’, ‘Zoe Saldana’, ‘Sigourney … James Cameron

1 rows × 24 columns

fulldf.shape
(4803, 24)
#观察到有相同列title,合并后自动命名成title_x,title_y
fulldf.rename(columns={
  'title_x':'title'},inplace=
  • 17
    点赞
  • 146
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
本教程为官方授权出品伴随着大数据时代的到来,作为发掘数据规律的重要手段,机器学习已经受到了越来越多的关注。而作为机器学习算法在大数据上的典型应用,推荐系统已成为各行业互联网公司营销体系中不可或缺的一部分,而且已经带来了真实可见的收益。目前,推荐系统和机器学习已经成为各大公司的发力重点,众多知名公司(如亚马逊、netflix、facebook、阿里巴巴、京东、腾讯、新浪、头条等)都在着眼于将蕴含在庞大数据中的宝藏发掘出来,懂机器学习算法的大数据工程师也成为了新时代最紧缺的人才。精心打造出了机器学习与推荐系统课程,将机器学习理论与推荐系统项目实战并重,对机器学习和推荐系统基础知识做了系统的梳理和阐述,并通过电影推荐网站的具体项目进行了实战演练,为有志于增加大数据项目经验、扩展机器学习发展方向的工程师提供更好的学习平台。本课程主要分为两部分,机器学习和推荐系统基础,与电影推荐系统项目实战。第一部分主要是机器学习和推荐系统基础理论的讲解,涉及到各种重要概念和基础算法,并对一些算法用Python做了实现;第二部分以电影网站作为业务应用场景,介绍推荐系统的开发实战。其中包括了如统计推荐、基于LFM的离线推荐、基于模型的实时推荐、基于内容的推荐等多个模块的代码实现,并与各种工具进行整合互接,构成完整的项目应用。通过理论和实际的紧密结合,可以使学员对推荐系统这一大数据应用有充分的认识和理解,在项目实战中对大数据的相关工具和知识做系统的回顾,并且可以掌握基本算法,入门机器学习这一前沿领域,为未来发展提供更多的选择,打开通向算法工程师的大门。谁适合学:1. 有一定的 Java、Scala 基础,希望了解大数据应用方向的编程人员2. 有 Java、Scala 开发经验,了解大数据相关知识,希望增加项目经验的开发人员3. 有较好的数学基础,希望学习机器学习和推荐系统相关算法的求职人员

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值