背景
Netflix是最受欢迎的媒体和视频流平台之一。他们的平台上有超过8000部电影或电视节目,截止到2021年中期,他们在全球有超过2亿的用户。这个表格数据集由Netflix上的所有电影和电视节目的列表组成,并附有详细信息,如演员、导演、评级、发行年份、持续时间等。
文章主要是对数据集进行简单的探索性数据分析,后续会继续完善对Netflix的深入了解。
导入必要的包
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
数据读取和预处理
netflix_overall = pd.read_csv('./netflix-shows/netflix_titles.csv')
netflix_overall.head()
show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
netflix_overall.shape
(8807, 12)
netflix_overall.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 show_id 8807 non-null object
1 type 8807 non-null object
2 title 8807 non-null object
3 director 6173 non-null object
4 cast 7982 non-null object
5 country 7976 non-null object
6 date_added 8797 non-null object
7 release_year 8807 non-null int64
8 rating 8803 non-null object
9 duration 8804 non-null object
10 listed_in 8807 non-null object
11 description 8807 non-null object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
netflix_overall.nunique()
show_id 8807
type 2
title 8807
director 4528
cast 7692
country 748
date_added 1767
release_year 74
rating 17
duration 220
listed_in 514
description 8775
dtype: int64
电影类型只有两种,着手分析下
# plt.rcParams['figure.dpi'] = 200
# plt.rcParams['figure.figsize'] = [6, 3.0]
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set3")
plt.figure(figsize=(12,6))
plt.title('netflix type')
plt.pie(netflix_overall.type.value_counts(), labels=netflix_overall.type.value_counts().index, autopct='%1.1f%%', startangle=180);
observation:
- 网飞节目还是以电影为主
- 其中电影占到了近7成,有着4000+的数量
缺失值分析
netflix_overall.isnull().sum()
show_id 0
type 0
title 0
director 2634
cast 825
country 831
date_added 10
release_year 0
rating 4
duration 3
listed_in 0
description 0
dtype: int64
sns.heatmap(netflix_overall.isnull(),cmap = 'viridis');
total = netflix_overall.isnull().sum().sort_values(ascending = False)
percent = (netflix_overall.isnull().sum()/netflix_overall.isnull().count()*100).sort_values(ascending = False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(7)
Total | Percent | |
---|---|---|
director | 2634 | 29.908028 |
country | 831 | 9.435676 |
cast | 825 | 9.367549 |
date_added | 10 | 0.113546 |
rating | 4 | 0.045418 |
duration | 3 | 0.034064 |
show_id | 0 | 0.000000 |
plt.figure(figsize=(12,6))
plt.title('Percentage of missing values')
plt.pie(missing_data.Total[:4], labels=missing_data.Total.index[:4], autopct='%1.2f%%', startangle=180);
从图中可以看出缺失值的分布情况,而导演和演员的缺失情况我们不能随意填充,可以考虑删除缺失值,而其他数据缺失较少的则用中位数,众数等填充
netflix_overall[netflix_overall['type'] == 'TV Show']['director'].isnull().sum()
2446
netflix_overall[netflix_overall['type'] == 'TV Show'].shape[0]
2676
netflix_overall[netflix_overall['type'] == 'Movie']['director'].isnull().sum()
188
netflix_overall[netflix_overall['type'] == 'Movie'].shape