电影数据集TMDB数据分析练习

最新推荐文章于 2024-05-10 18:34:04 发布

新奥尔良乳猪

最新推荐文章于 2024-05-10 18:34:04 发布

阅读量7.6k

点赞数 7

分类专栏： PythonR 文章标签：可视化数据分析大数据机器学习

本文链接：https://blog.csdn.net/xh542428798/article/details/108706757

版权

PythonR 专栏收录该内容

4 篇文章 2 订阅

订阅专栏

加载TMDB数据集

TMDb电影数据库”，数据集中包含来自1960-2016年上映的近11000部电影的基本信息，主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。
本文作为自学练习小项目，将从最原始的数据格式化、数据清洗、数据分析进行全面的学习
并且事无巨细，展示练习全过程

参考文章 https://blog.csdn.net/moyue1002/article/details/80332186
python 3.7
pandas 0.23
numpy 1.18
metplotlib 2.2

import pandas as pd

credits = pd.read_csv('./tmdb_5000_credits.csv')
movies = pd.read_csv('./tmdb_5000_movies.csv')

查看各个dataframe的一般信息

# 这是movies表的信息
movies.head(1)
print(movies.info())

Out[3]: 
      budget                                             genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 20 columns):
    budget                  4803 non-null int64
    genres                  4803 non-null object
    homepage                1712 non-null object
    id                      4803 non-null int64
    keywords                4803 non-null object
    original_language       4803 non-null object
    original_title          4803 non-null object
    overview                4800 non-null object
    popularity              4803 non-null float64
    production_companies    4803 non-null object
    production_countries    4803 non-null object
    release_date            4802 non-null object
    revenue                 4803 non-null int64
    runtime                 4801 non-null float64
    spoken_languages        4803 non-null object
    status                  4803 non-null object
    tagline                 3959 non-null object
    title                   4803 non-null object
    vote_average            4803 non-null float64
    vote_count              4803 non-null int64
    dtypes: float64(3), int64(4), object(13)
    memory usage: 750.5+ KB
    None

这是credits表的信息

print(credits.info())
credits.head(1)

Out[4]: 
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 4803 entries, 0 to 4802
    Data columns (total 4 columns):
    movie_id    4803 non-null int64
    title       4803 non-null object
    cast        4803 non-null object
    crew        4803 non-null object
    dtypes: int64(1), object(3)
    memory usage: 150.2+ KB
    None

   movie_id                        ...                                                                       crew
0     19995                        ...                          [{"credit_id": "52fe48009251416c750aca23", "de...

credits表的cast列很奇怪，数据很多

进行具体查看

# 查看credists表的cast列索引0的值，发现是一长串东西
print('cast格式：', type(credits['cast'][0])) # 查看其类型，为`str`类型，无法处理
Out[5]:
    cast格式： <class 'str'>

json格式化数据处理

从表中看出，cast列其实是json格式化数据，应该用json包进行处理

json格式是`[{},{}]`

将json格式的字符串转换成Python对象用`json.loads()`

`json.load()`针对的是文件，从文件中读取json

import json
type(json.loads(credits['cast'][0]))
Out[6]:
    list

从上面可以看出`json.loads()`将json字符串转成了list,可以知道list里面又包裹多个dict

接下来批量处理

import json
json_col = ['cast','crew']
for i in json_col:
    credits[i] = credits[i].apply(json.loads)

credits['cast'][0][:3]

Out[7]:
    [{'cast_id': 242,
      'character': 'Jake Sully',
      'credit_id': '5602a8a7c3a3685532001c9a',
      'gender': 2,
      'id': 65731,
      'name': 'Sam Worthington',
      'order': 0},
     {'cast_id': 3,
      'character': 'Neytiri',
      'credit_id': '52fe48009251416c750ac9cb',
      'gender': 1,
      'id': 8691,
      'name': 'Zoe Saldana',
      'order': 1},
     {'cast_id': 25,
      'character': 'Dr. Grace Augustine',
      'credit_id': '52fe48009251416c750aca39',
      'gender': 1,
      'id': 10205,
      'name': 'Sigourney Weaver',
      'order': 2}]

print('再次查看cast类型是:',type(credits['cast'][0])) 
# 数据类型变成了list，可以用于循环处理

Out[8]:
    再次查看cast类型是: <class 'list'>

提取其中的名字

credits['cast'][0][:3]
# credits第一行的cast，是个列表

Out[9]:
    [{'cast_id': 242,
      'character': 'Jake Sully',
      'credit_id': '5602a8a7c3a3685532001c9a',
      'gender': 2,
      'id': 65731,
      'name': 'Sam Worthington',
      'order': 0},
     {'cast_id': 3,
      'character': 'Neytiri',
      'credit_id': '52fe48009251416c750ac9cb',
      'gender': 1,
      'id': 8691,
      'name': 'Zoe Saldana',
      'order': 1},
     {'cast_id': 25,
      'character': 'Dr. Grace Augustine',
      'credit_id': '52fe48009251416c750aca39',
      'gender': 1,
      'id': 10205,
      'name': 'Sigourney Weaver',
      'order': 2}]

credits['cast'][0][0]['name'] # 获取第一行第一个字典的人名

Out[10]:

    'Sam Worthington'

dict字典常用的函数

dict.get() 返回指定键的值，如果值不在字典中返回default值
dict.items() 以列表返回可遍历的(键, 值) 元组数组

# 代码测试如下：
i = credits['cast'][0][0]
for x in i.items():
    print(x)

Out[11]:
    ('cast_id', 242)
    ('character', 'Jake Sully')
    ('credit_id', '5602a8a7c3a3685532001c9a')
    ('gender', 2)
    ('id', 65731)
    ('name', 'Sam Worthington')
    ('order', 0)

创建get_names()函数，进一步分割cast

def get_names(x):
   return ','.join(i['name'] for i in x)
credits['cast'] = credits['cast'].apply(get_names)
credits['cast'][:3]

Out[12]:
    0    Sam Worthington,Zoe Saldana,Sigourney Weaver,S...
    1    Johnny Depp,Orlando Bloom,Keira Knightley,Stel...
    2    Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...
    Name: cast, dtype: object

crew提取导演

credits['crew'][0][0]
Out[13]:
    {'credit_id': '52fe48009251416c750aca23',
     'department': 'Editing',
     'gender': 0,
     'id': 1721,
     'job': 'Editor',
     'name': 'Stephen E. Rivkin'}

# 需要创建循环，找到job是director的，然后读取名字并返回
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']

credits['crew'] = credits['crew'].apply(director)

print(credits[['crew']][:3])
credits.rename(columns = {'crew':'director'},inplace=True) #修改列名
credits[['director']][:3]

Out[[14]:
    crew
    0   James Cameron
    1  Gore Verbinski
    2      Sam Mendes

movies表进行json解析

>>> movies.head(1)
      budget                                             genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  [{"id": 28, "name": "Action"}, {"id": 12, "nam...  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

可以看出genres, keywords, spoken_languages, production_countries, producion_companies需要json解析的

# 方法同crew表
json_col = ['genres','keywords','spoken_languages','production_countries','production_companies']
for i in json_col:
    movies[i] = movies[i].apply(json.loads)
    movies[i] = movies[i].apply(get_names)
>>> movies.head(1)  
      budget                                    genres                     homepage     id    ...                          tagline   title vote_average vote_count
0  237000000  Action,Adventure,Fantasy,Science Fiction  http://www.avatarmovie.com/  19995    ...      Enter the World of Pandora.  Avatar          7.2      11800

开始分析数据

credits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
movie_id    4803 non-null int64
title       4803 non-null object
cast        4803 non-null object
director    4773 non-null object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB

movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(13)
memory usage: 750.5+ KB

credits和movies都有一个id和title，他们是不是同一个东西？

检测一下

(credits['movie_id'] == movies['id']).describe()

count     4803
unique       1
top       True
freq      4803
dtype: object

(credits['title'] == movies['title']).describe()

count     4803
unique       1
top       True
freq      4803
Name: title, dtype: object

两列相同，合并数据

df = credits.merge(right=movies,how='inner',left_on='movie_id',right_on='id')
>>> df.head()
   movie_id                                   title_x    ...     vote_average vote_count
0     19995                                    Avatar    ...              7.2      11800
1       285  Pirates of the Caribbean: At World's End    ...              6.9       4500
2    206647                                   Spectre    ...              6.3       4466
3     49026                     The Dark Knight Rises    ...              7.6       9106
4     49529                               John Carter    ...              6.1       2124

df中有24个字段

movie_id:TMDB电影标识号
title_x & title_y: 这是合并时形成的两个一样的列，可删除一列，电影名称
cast：演员列表
direcor：导演
budget:预算
genres：电影风格
homepages:电影URL
id:同movie_id
original_language:电影语言
overview:剧情摘要
popularity:在database上的点击次数
production_companies:制作公司
production_countries:制作国家
release_date:上映时间
spoken_languages:口语
status:状态
tagline:电影标语
vote_average:平均评分
vote_count:评分次数

df.info()
# df[['movie_id','id']]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 24 columns):
movie_id                4803 non-null int64
title_x                 4803 non-null object
cast                    4803 non-null object
director                4773 non-null object
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title_y                 4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(5), object(16)
memory usage: 938.1+ KB

字段缺失值处理

del df['title_y']
del df['id']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 22 columns):
movie_id                4803 non-null int64
title_x                 4803 non-null object
cast                    4803 non-null object
director                4773 non-null object
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(15)
memory usage: 863.0+ KB

同时，从上面可以看到director,release_date,runtime有缺失值

director无法处理，只能处理release_date,runtime的缺失值
另外，homepage,original_tille,overview,spoken_language,tagline这几列数据我们也是用不到的，可以删除

df['release_date']=df['release_date'].fillna('2014-06-01')
df['runtime']=df['runtime'].fillna(df['runtime'].mean())

>>> df[['release_date','runtime']].isnull().describe()
       release_date runtime
count          4803    4803
unique            1       1
top           False   False
freq           4803    4803

>>> df.head(3)
   movie_id                                   title_x    ...     vote_average vote_count
0     19995                                    Avatar    ...              7.2      11800
1       285  Pirates of the Caribbean: At World's End    ...              6.9       4500
2    206647                                   Spectre    ...              6.3       4466

数据分析及可视化

处理日期时间

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4803 entries, 0 to 4802
Data columns (total 22 columns):
movie_id                4803 non-null int64
title_x                 4803 non-null object
cast                    4803 non-null object
director                4773 non-null object
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4803 non-null object
revenue                 4803 non-null int64
runtime                 4803 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null int64
dtypes: float64(3), int64(4), object(15)
memory usage: 863.0+ KB

# 从上面可以看出，release_time是object格式，因此要先转化为时间格式
df['release_year'] = pd.to_datetime(df.release_date,format='%Y-%m-%d').dt.year
df['release_month'] = pd.to_datetime(df.release_date,format='%Y-%m-%d').dt.month

>>> df.head(3)
   movie_id                                   title_x                                               cast      ...      vote_count  release_year release_month
0     19995                                    Avatar  Sam Worthington,Zoe Saldana,Sigourney Weaver,S...      ...           11800          2009            12
1       285  Pirates of the Caribbean: At World's End  Johnny Depp,Orlando Bloom,Keira Knightley,Stel...      ...            4500          2007             5
2    206647                                   Spectre  Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...      ...            4466          2015            10

电影类型分析

df['genres'][1].split(',') #split()分割字符串

['Adventure', 'Fantasy', 'Action']

set()可创建一个集合，集合的最重要特性是元素不可重复性

因此可以以此来得到电影所有类型总共归属于哪些

genre = set()
for i in df['genres'].str.split(','):
    genre=set().union(i,genre) # union()可以将i和genre合并到一起
genre

{'',
 'Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Foreign',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'TV Movie',
 'Thriller',
 'War',
 'Western'}

# 将上述集合转为list，并去除无用的‘’
genre = list(genre)
genre.remove('')
genre

['War',
 'History',
 'Science Fiction',
 'Foreign',
 'Western',
 'Action',
 'Comedy',
 'Family',
 'Documentary',
 'Animation',
 'Romance',
 'Drama',
 'Mystery',
 'Music',
 'Fantasy',
 'Horror',
 'TV Movie',
 'Adventure',
 'Thriller',
 'Crime']

电影类型和数量

for i in genre:
    df[i] = 0 # 创建名为i的列
    df[i][df.genres.str.contains(i)]  = 1 #genres列包含字符i时，赋值为1
>>> df.head(8)
   movie_id                                   title_x                                               cast           director   ...    Fantasy Romance Horror Foreign
0     19995                                    Avatar  Sam Worthington,Zoe Saldana,Sigourney Weaver,S...      James Cameron   ...          1       0      0       0
1       285  Pirates of the Caribbean: At World's End  Johnny Depp,Orlando Bloom,Keira Knightley,Stel...     Gore Verbinski   ...          1       0      0       0
2    206647                                   Spectre  Daniel Craig,Christoph Waltz,Léa Seydoux,Ralph...         Sam Mendes   ...          0       0      0       0
3     49026                     The Dark Knight Rises  Christian Bale,Michael Caine,Gary Oldman,Anne ...  Christopher Nolan   ...          0       0      0       0
4     49529                               John Carter  Taylor Kitsch,Lynn Collins,Samantha Morton,Wil...     Andrew Stanton   ...          0       0      0       0
5       559                              Spider-Man 3  Tobey Maguire,Kirsten Dunst,James Franco,Thoma...          Sam Raimi   ...          1       0      0       0
6     38757                                   Tangled  Zachary Levi,Mandy Moore,Donna Murphy,Ron Perl...       Byron Howard   ...          0       0      0       0
7     99861                   Avengers: Age of Ultron  Robert Downey Jr.,Chris Hemsworth,Mark Ruffalo...        Joss Whedon   ...          0       0      0       0

# 这里有另外一种更好的方法：
# for i in genre:
#     df[i] = df['genres'].str.contains(i).apply(lambda x:1 if x else 0)

建立包含电影类型和年份的dataframe

df_gy = df[genre+['release_year']]
>>> df_gy.head(10)
   War  Thriller  Animation  Action  Adventure  Music  Science Fiction  Documentary      ...       Family  Drama  Mystery  Fantasy  Romance  Horror  Foreign  release_year
0    0         0          0       1          1      0                1            0      ...            0      0        0        1        0       0        0          2009
1    0         0          0       1          1      0                0            0      ...            0      0        0        1        0       0        0          2007
2    0         0          0       1          1      0                0            0      ...            0      0        0        0        0       0        0          2015
3    0         1          0       1          0      0                0            0      ...            0      1        0        0        0       0        0          2012
4    0         0          0       1          1      0                1            0      ...            0      0        0        0        0       0        0          2012
5    0         0          0       1          1      0                0            0      ...            0      0        0        1        0       0        0          2007
6    0         0          1       0          0      0                0            0      ...            1      0        0        0        0       0        0          2010
7    0         0          0       1          1      0                1            0      ...            0      0        0        0        0       0        0          2015
8    0         0          0       0          1      0                0            0      ...            1      0        0        1        0       0        0          2009
9    0         0          0       1          1      0                0            0      ...            0      0        0        1        0       0        0          2016

可视化电影年度趋势

import matplotlib.pyplot as plt
x = df_gy['release_year'].value_counts().sort_index()
plt.plot(x) # 绘制电影数与时间的总的密度图
plt.xlabel('Time (year)')
plt.ylabel('Counts')
plt.show()

绘制分类型电影-时间图

x = df_gy.groupby('release_year').sum(axis = 1)
plt.figure(figsize=(12,6))
plt.xticks(range(1915,2018,5))
plt.plot(x)
plt.legend(x.columns.values,fontsize = 9)
plt.xlabel('Time (year)')
plt.ylabel('Counts')
plt.show()

绘制电影总量柱状图

y = x.sum().sort_values()
plt.figure(figsize=(12,6))
plt.xlabel('Counts',fontsize = 15)
plt.ylabel('Category',fontsize = 15)
plt.barh(y.index,y)
plt.show()

饼状图

bl = y / y.sum()
plt.figure(figsize=(6,6))
plt.pie(bl,labels=bl.index,autopct='%1.1f%%',explode=(bl>=0.06)/20+0.02)
plt.title('Pie of Category')
plt.show()

分析电影票房与哪些因素有关

df_revenue = df.groupby('release_year')['revenue'].sum() # 统计票房
df_revenue[:5]

release_year
1916     8394751
1925    22000000
1927      650422
1929     4358000
1930     8000000
Name: revenue, dtype: int64

年份和票房

df_revenue.plot(figsize=(12,6))
plt.xticks(range(1915,2018,6))
plt.title('Total revenue in each year',fontsize = 15)
plt.xlabel('Year',fontsize = 15)
plt.ylabel('Total revenue',fontsize = 15)
plt.show()

电影预算和票房的关系

plt.scatter(x=df.budget,y = df.revenue)
plt.xlabel('Revenue')
plt.ylabel('Budget')
plt.show()

评分和票房的关系

plt.scatter(x = df.vote_average,y = df.revenue)
plt.xlabel('Vote')
plt.ylabel('Revenue')
plt.show()

电影时长和票房的关系

plt.scatter(df.runtime,df.revenue)
plt.xlabel('Run time')
plt.ylabel('Revenue')
plt.show()

评分和受欢迎程度

plt.scatter(df.vote_average,df.popularity)
plt.xlabel('Vote')
plt.ylabel('Popularity')
plt.show()

时长和受欢迎程度

plt.scatter(df.runtime,df.popularity)
plt.xlabel('Runtime(minutes)',fontsize = 15)
plt.ylabel('Popularity',fontsize = 15)
plt.show() # 看起来观众更喜欢60-160之间的电影

根据两列相关信息绘制合适的可视化图形，即为数据分析的初级阶段

今天的训练就到这里

新奥尔良乳猪

关注

7
点赞
踩
61

收藏

觉得还不错? 一键收藏
0
评论
电影数据集TMDB数据分析练习

加载TMDB数据集TMDb电影数据库”，数据集中包含来自1960-2016年上映的近11000部电影的基本信息，主要包括了电影类型、预算、票房、演职人员、时长、评分等信息。本文作为自学练习小项目，将从最原始的数据格式化、数据清洗、数据分析进行全面的学习并且事无巨细，展示练习全过程参考文章 https://blog.csdn.net/moyue1002/article/details/80332186python 3.7pandas 0.23numpy 1.18metplotlib 2.2
复制链接

扫一扫