Netflix Movies and TV Shows --- 探索性数据分析

VIP文章追忆无义

已于 2022-08-02 09:44:17 修改

阅读量2.6k

点赞数 3

分类专栏： EDA 文章标签：数据分析 python 机器学习

于 2022-08-02 00:54:19 首次发布

本文链接：https://blog.csdn.net/m0_66235114/article/details/126113041

版权

背景

Netflix是最受欢迎的媒体和视频流平台之一。他们的平台上有超过8000部电影或电视节目，截止到2021年中期，他们在全球有超过2亿的用户。这个表格数据集由Netflix上的所有电影和电视节目的列表组成，并附有详细信息，如演员、导演、评级、发行年份、持续时间等。
文章主要是对数据集进行简单的探索性数据分析，后续会继续完善对Netflix的深入了解。

导入必要的包

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

数据读取和预处理

netflix_overall = pd.read_csv('./netflix-shows/netflix_titles.csv')
netflix_overall.head()

	show_id	type	title	director	cast	country	date_added	release_year	rating	duration	listed_in	description
0	s1	Movie	Dick Johnson Is Dead	Kirsten Johnson	NaN	United States	September 25, 2021	2020	PG-13	90 min	Documentaries	As her father nears the end of his life, filmm...
1	s2	TV Show	Blood & Water	NaN	Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...	South Africa	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, TV Dramas, TV Mysteries	After crossing paths at a party, a Cape Town t...
2	s3	TV Show	Ganglands	Julien Leclercq	Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...	NaN	September 24, 2021	2021	TV-MA	1 Season	Crime TV Shows, International TV Shows, TV Act...	To protect his family from a powerful drug lor...
3	s4	TV Show	Jailbirds New Orleans	NaN	NaN	NaN	September 24, 2021	2021	TV-MA	1 Season	Docuseries, Reality TV	Feuds, flirtations and toilet talk go down amo...
4	s5	TV Show	Kota Factory	NaN	Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...	India	September 24, 2021	2021	TV-MA	2 Seasons	International TV Shows, Romantic TV Shows, TV ...	In a city of coaching centers known to train I...

netflix_overall.shape

(8807, 12)

netflix_overall.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

netflix_overall.nunique()

show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

电影类型只有两种，着手分析下

# plt.rcParams['figure.dpi'] = 200
# plt.rcParams['figure.figsize'] = [6, 3.0]
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set3")

在这里插入图片描述

plt.figure(figsize=(12,6))
plt.title('netflix type')
plt.pie(netflix_overall.type.value_counts(), labels=netflix_overall.type.value_counts().index, autopct='%1.1f%%', startangle=180);

在这里插入图片描述

observation：

网飞节目还是以电影为主
其中电影占到了近7成，有着4000+的数量

缺失值分析

netflix_overall.isnull().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

sns.heatmap(netflix_overall.isnull(),cmap = 'viridis');

在这里插入图片描述

total = netflix_overall.isnull().sum().sort_values(ascending = False)
percent = (netflix_overall.isnull().sum()/netflix_overall.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(7)

	Total	Percent
director	2634	29.908028
country	831	9.435676
cast	825	9.367549
date_added	10	0.113546
rating	4	0.045418
duration	3	0.034064
show_id	0	0.000000

plt.figure(figsize=(12,6))
plt.title('Percentage of missing values')
plt.pie(missing_data.Total[:4], labels=missing_data.Total.index[:4], autopct='%1.2f%%', startangle=180);

在这里插入图片描述

从图中可以看出缺失值的分布情况，而导演和演员的缺失情况我们不能随意填充，可以考虑删除缺失值，而其他数据缺失较少的则用中位数，众数等填充

netflix_overall[netflix_overall['type'] == 'TV Show']['director'].isnull().sum()

netflix_overall[netflix_overall['type'] == 'TV Show'].shape[0]

netflix_overall[netflix_overall['type'] == 'Movie']['director'].isnull().sum()

netflix_overall[netflix_overall['type'] == 'Movie'].shape

最低0.47元/天解锁文章

追忆无义

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
Netflix Movies and TV Shows --- 探索性数据分析

关于数据集：Netflix 是最受欢迎的媒体和视频流媒体平台之一。他们的平台上有超过 8000 部电影或电视节目，截至 2021 年年中，他们在全球拥有超过 2 亿订阅者。该表格数据集包含 Netflix 上所有可用电影和电视节目的列表，以及演员、导演、评级、发行年份、持续时间等详细信息。......
复制链接

扫一扫