Netflix Movies and TV Shows --- 探索性数据分析

背景

Netflix是最受欢迎的媒体和视频流平台之一。他们的平台上有超过8000部电影或电视节目,截止到2021年中期,他们在全球有超过2亿的用户。这个表格数据集由Netflix上的所有电影和电视节目的列表组成,并附有详细信息,如演员、导演、评级、发行年份、持续时间等。
文章主要是对数据集进行简单的探索性数据分析,后续会继续完善对Netflix的深入了解。

导入必要的包

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

数据读取和预处理

netflix_overall = pd.read_csv('./netflix-shows/netflix_titles.csv')
netflix_overall.head()
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
netflix_overall.shape
(8807, 12)
netflix_overall.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB
netflix_overall.nunique()
show_id         8807
type               2
title           8807
director        4528
cast            7692
country          748
date_added      1767
release_year      74
rating            17
duration         220
listed_in        514
description     8775
dtype: int64

电影类型只有两种,着手分析下

# plt.rcParams['figure.dpi'] = 200
# plt.rcParams['figure.figsize'] = [6, 3.0]
sns.set(style="darkgrid")
ax = sns.countplot(x="type", data=netflix_overall, palette="Set3")

在这里插入图片描述

plt.figure(figsize=(12,6))
plt.title('netflix type')
plt.pie(netflix_overall.type.value_counts(), labels=netflix_overall.type.value_counts().index, autopct='%1.1f%%', startangle=180);

在这里插入图片描述

observation:

  1. 网飞节目还是以电影为主
  2. 其中电影占到了近7成,有着4000+的数量

缺失值分析

netflix_overall.isnull().sum() 
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64
sns.heatmap(netflix_overall.isnull(),cmap = 'viridis');

在这里插入图片描述

total = netflix_overall.isnull().sum().sort_values(ascending = False)
percent = (netflix_overall.isnull().sum()/netflix_overall.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(7)
Total Percent
director 2634 29.908028
country 831 9.435676
cast 825 9.367549
date_added 10 0.113546
rating 4 0.045418
duration 3 0.034064
show_id 0 0.000000
plt.figure(figsize=(12,6))
plt.title('Percentage of missing values')
plt.pie(missing_data.Total[:4], labels=missing_data.Total.index[:4], autopct='%1.2f%%', startangle=180);

在这里插入图片描述

从图中可以看出缺失值的分布情况,而导演和演员的缺失情况我们不能随意填充,可以考虑删除缺失值,而其他数据缺失较少的则用中位数,众数等填充

netflix_overall[netflix_overall['type'] == 'TV Show']['director'].isnull().sum()
2446
netflix_overall[netflix_overall['type'] == 'TV Show'].shape[0]
2676
netflix_overall[netflix_overall['type'] == 'Movie']['director'].isnull().sum()
188
netflix_overall[netflix_overall['type'] == 'Movie'].shape
  • 3
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
著名的Netflix 智能推荐 百万美金大奖赛使用是数据集. 因为竞赛关闭, Netflix官网上已无法下载. Netflix provided a training data set of 100,480,507 ratings that 480,189 users gave to 17,770 movies. Each training rating is a quadruplet of the form . The user and movie fields are integer IDs, while grades are from 1 to 5 (integral) stars.[3] The qualifying data set contains over 2,817,131 triplets of the form , with grades known only to the jury. A participating team's algorithm must predict grades on the entire qualifying set, but they are only informed of the score for half of the data, the quiz set of 1,408,342 ratings. The other half is the test set of 1,408,789, and performance on this is used by the jury to determine potential prize winners. Only the judges know which ratings are in the quiz set, and which are in the test set—this arrangement is intended to make it difficult to hill climb on the test set. Submitted predictions are scored against the true grades in terms of root mean squared error (RMSE), and the goal is to reduce this error as much as possible. Note that while the actual grades are integers in the range 1 to 5, submitted predictions need not be. Netflix also identified a probe subset of 1,408,395 ratings within the training data set. The probe, quiz, and test data sets were chosen to have similar statistical properties. In summary, the data used in the Netflix Prize looks as follows: Training set (99,072,112 ratings not including the probe set, 100,480,507 including the probe set) Probe set (1,408,395 ratings) Qualifying set (2,817,131 ratings) consisting of: Test set (1,408,789 ratings), used to determine winners Quiz set (1,408,342 ratings), used to calculate leaderboard scores For each movie, title and year of release are provided in a separate dataset. No information at all is provided about users. In order to protect the privacy of customers, "some of the rating data for some customers in the training and qualifyin
Netflix使用数据挖掘技术来提高他们的电影和电视节目服务。数据挖掘是一种从大规模数据集中发现模式和知识的过程。 首先,Netflix通过收集用户的观看历史、评分和喜好等数据来了解用户的兴趣和偏好。他们使用这些数据来构建个性化推荐系统,根据用户以往的喜好和行为,推荐他们可能感兴趣的电影和电视节目。通过数据挖掘技术,Netflix能够不断优化他们的推荐算法,提高推荐的准确性和个性化程度,使用户更加满意。 此外,Netflix还利用数据挖掘技术进行内容策划和采购。他们分析用户的观看数据,了解到观众普遍喜欢哪种类型的内容、哪位演员的电影更受欢迎等。基于这些数据,Netflix能够更好地决策哪些电影和电视节目应该购买或制作,并预测作品的受欢迎程度。这样一来,Netflix能够提供更适合用户口味的内容,提高用户满意度和忠诚度。 此外,数据挖掘还帮助Netflix进行市场分析和竞争研究。他们分析用户的地理位置、年龄、性别等信息,为不同群体的用户提供定制化的内容推荐。他们还会分析其他竞争对手的数据,并对市场的趋势和需求进行预测。这些对Netflix来说都是宝贵的信息,可以指导他们制定战略,保持竞争力。 总之,Netflix利用数据挖掘技术来了解用户的兴趣和需求,提供个性化的推荐服务,优化内容策划和采购决策,以及进行市场分析和竞争研究。这些技术帮助Netflix更好地满足用户的需求,提高用户体验和业务效益。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值