kaggle TMDB5000电影数据分析和电影推荐模型数据分析相关函数解释参考文章：

最新推荐文章于 2023-07-06 17:51:37 发布

wx1871428

最新推荐文章于 2023-07-06 17:51:37 发布

阅读量1.4k

点赞数

分类专栏：数据分析

本文链接：https://blog.csdn.net/wx1871428/article/details/118540847

版权

该博客介绍了对kaggle上的TMDB5000电影数据集进行的分析，包括电影类型分布、利润评分关联、导演表现、演职人员活跃度、电影关键字分析和电影推荐模型。特别提到了数据处理中的json格式，如json.loads和json.dumps等函数的使用。

摘要由CSDN通过智能技术生成

数据来自kaggle上tmdb5000电影数据集，本次数据分析主要包括电影数据可视化和简单的电影推荐模型，如：
1.电影类型分配及其随时间的变化
2.利润、评分、受欢迎程度直接的关系
3.哪些导演的电影卖座或较好
4.最勤劳的演职人员
5.电影关键字分析
6.电影相似性推荐

数据分析

    import pandas as pd
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    plt.style.use('ggplot')
    import json
    import warnings
    warnings.filterwarnings('ignore')#忽略警告
[/code]

```code
    movie = pd.read_csv('tmdb_5000_movies.csv')
    credit = pd.read_csv('tmdb_5000_credits.csv')
[/code]

```code
    movie.head(1)
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [{“id”: 28, “name”: “Action”}, {“id”: 12, “nam…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    movie.tail(3)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
4800  |  0  |  [{“id”: 35, “name”: “Comedy”}, {“id”: 18, “nam…  |
http://www.hallmarkchannel.com/signedsealeddel…  |  231617  |  [{“id”: 248,
“name”: “date”}, {“id”: 699, “nam…  |  en  |  Signed, Sealed, Delivered  |
“Signed, Sealed, Delivered” introduces a dedic…  |  1.444476  |  [{“name”:
“Front Street Pictures”, “id”: 3958}…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2013-10-13  |  0  |  120.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  Signed, Sealed, Delivered  |  7.0
|  6  
4801  |  0  |  []  |  http://shanghaicalling.com/  |  126186  |  []  |  en  |
Shanghai Calling  |  When ambitious New York attorney Sam is sent t…  |
0.857008  |  []  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |
2012-05-03  |  0  |  98.0  |  [{“iso_639_1”: “en”, “name”: “English”}]  |
Released  |  A New Yorker in Shanghai  |  Shanghai Calling  |  5.7  |  7  
4802  |  0  |  [{“id”: 99, “name”: “Documentary”}]  |  NaN  |  25975  |
[{“id”: 1523, “name”: “obsession”}, {“id”: 224…  |  en  |  My Date with Drew
|  Ever since the second grade when he first saw …  |  1.929883  |  [{“name”:
“rusty bear entertainment”, “id”: 87…  |  [{“iso_3166_1”: “US”, “name”:
“United States o…  |  2005-08-05  |  0  |  90.0  |  [{“iso_639_1”: “en”,
“name”: “English”}]  |  Released  |  NaN  |  My Date with Drew  |  6.3  |  16

```code
    movie.info()#样本数量为4803，部分特征有缺失值
[/code]

```code
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 20 columns):
    budget                  4803 non-null int64
    genres                  4803 non-null object
    homepage                1712 non-null object
    id                      4803 non-null int64
    keywords                4803 non-null object
    original_language       4803 non-null object
    original_title          4803 non-null object
    overview                4800 non-null object
    popularity              4803 non-null float64
    production_companies    4803 non-null object
    production_countries    4803 non-null object
    release_date            4802 non-null object
    revenue                 4803 non-null int64
    runtime                 4801 non-null float64
    spoken_languages        4803 non-null object
    status                  4803 non-null object
    tagline                 3959 non-null object
    title                   4803 non-null object
    vote_average            4803 non-null float64
    vote_count              4803 non-null int64
    dtypes: float64(3), int64(4), object(13)
    memory usage: 750.5+ KB

样本数为4803，部分特征有缺失值，homepage,tagline缺损较多，但这俩不影响基本分析，release_date和runtime可以填充；仔细观察，部分样本的genres,keywords,production
company特征值是[]，需要注意。

    credit.info
[/code]

##  数据清理

数据特征中有很多特征为json格式，即类似于字典的键值对形式，为了方便后续处理，我们需要将其转换成便于python操作的str或者list形式，利于提取有用信息。

```code
    #movie genres电影流派，便于归类
    movie['genres']=movie['genres'].apply(json.loads)
    #apply function to axis in df,对df中某一行、列应用某种操作。
[/code]

```code
    movie['genres'].head(1)
[/code]

```code
    0    [{'id': 28, 'name': 'Action'}, {'id': 12, 'nam...
    Name: genres, dtype: object

    list(zip(movie.index,movie['genres']))[:2]
[/code]

```code
    [(0,
      [{'id': 28, 'name': 'Action'},
       {'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 878, 'name': 'Science Fiction'}]),
     (1,
      [{'id': 12, 'name': 'Adventure'},
       {'id': 14, 'name': 'Fantasy'},
       {'id': 28, 'name': 'Action'}])]

    for index,i in zip(movie.index,movie['genres']):
        list1=[]
        for j in range(len(i)):
            list1.append((i[j]['name']))# name:genres,Action...
        movie.loc[index,'genres']=str(list1)
[/code]

```code
    movie.head(1)
    #genres列已经不是json格式，而是将name将的value即电影类型提取出来重新赋值给genres
[/code]

.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [{“id”: 1463, “name”: “culture
clash”}, {“id”:…  |  en  |  Avatar  |  In the 22nd century, a paraplegic
Marine is di…  |  150.437577  |  [{“name”: “Ingenious Film Partners”, “id”:
289…  |  [{“iso_3166_1”: “US”, “name”: “United States o…  |  2009-12-10  |
2787965087  |  162.0  |  [{“iso_639_1”: “en”, “name”: “English”}, {“iso…  |
Released  |  Enter the World of Pandora.  |  Avatar  |  7.2  |  11800

```code
    #同样的方法应用到keywords列
    movie['keywords'] = movie['keywords'].apply(json.loads)
    for index,i in zip(movie.index,movie['keywords']):
        list2=[]
        for j in range(len(i)):
            list2.append(i[j]['name'])
        movie.loc[index,'keywords'] = str(list2)
[/code]

```code
    #同理production_companies
    movie['production_companies'] = movie['production_companies'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_companies']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_companies']=str(list3)
[/code]

```code
    movie['production_countries'] = movie['production_countries'].apply(json.loads)
    for index,i in zip(movie.index,movie['production_countries']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'production_countries']=str(list3)
[/code]

```code
    movie['spoken_languages'] = movie['spoken_languages'].apply(json.loads)
    for index,i in zip(movie.index,movie['spoken_languages']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        movie.loc[index,'spoken_languages']=str(list3)
[/code]

```code
    movie.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |  budget
|  genres  |  homepage  |  id  |  keywords  |  original_language  |
original_title  |  overview  |  popularity  |  production_companies  |
production_countries  |  release_date  |  revenue  |  runtime  |
spoken_languages  |  status  |  tagline  |  title  |  vote_average  |
vote_count  
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---  
0  |  237000000  |  [‘Action’, ‘Adventure’, ‘Fantasy’, ‘Science Fi…  |
http://www.avatarmovie.com/  |  19995  |  [‘culture clash’, ‘future’, ‘space
war’, ‘spac…  |  en  |  Avatar  |  In the 22nd century, a paraplegic Marine is
di…  |  150.437577  |  [‘Ingenious Film Partners’, ‘Twentieth Century…  |
[‘United States of America’, ‘United Kingdom’]  |  2009-12-10  |  2787965087
|  162.0  |  [‘English’, ‘Español’]  |  Released  |  Enter the World of
Pandora.  |  Avatar  |  7.2  |  11800

```code
    credit.head(1)
[/code]  
  
.dataframe thead tr:only-child th { text-align: right; } .dataframe thead th {
text-align: left; } .dataframe tbody tr th { vertical-align: top; }  |
movie_id  |  title  |  cast  |  crew  
---|---|---|---|---  
0  |  19995  |  Avatar  |  [{“cast_id”: 242, “character”: “Jake Sully”, “…  |
[{“credit_id”: “52fe48009251416c750aca23”, “de…

```code
    credit['cast'] = credit['cast'].apply(json.loads)
    for index,i in zip(credit.index,credit['cast']):
        list3=[]
        for j in range(len(i)):
            list3.append(i[j]['name'])
        credit.loc[index,'cast']=str(list3)
[/code]

```code
    credit['crew'] = credit['crew'].apply(json.loads)
    #提取crew中director，增加电影导演一列，用作后续分析
    def director(x):
        for i in x:
            if i['job'] == 'Director':
                return i['name&