机器学习10——电影案例分析

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

1 问题一

想要知道电影数据中某些数据的均值等

In [24]:

movie = pd.read_csv("./data/IMDB-Movie-Data.csv")

In [25]:

movie.head()

Out[25]:

RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

In [26]:

movie["Rating"].mean()

Out[26]:

6.723200000000003

In [27]:

movie["Director"].count()

Out[27]:

1000

In [28]:

movie["Director"].unique().shape[0]

Out[28]:

644

2 问题2

这一组电影数据,如果我们想Rating,Runtime (Minutes)的分布情况,应该如何呈现数据?

In [29]:

movie["Rating"].plot(kind ='hist')

Out[29]:

<AxesSubplot:ylabel='Frequency'>

在这里插入图片描述

In [30]:

# 改进
# 1.创建画布
plt.figure(figsize=(20,8),dpi=100)

# 2.绘制
plt.hist(movie["Rating"].values,20)  

# 2.1 增加X轴刻度
x_max = movie["Rating"].max()
x_min = movie["Rating"].min()
x1 = np.linspace(x_min ,x_max, 21) # 从x_min到x_max,分成20块

# print(x1)
plt.xticks(x1)
plt.grid()
plt.show()

在这里插入图片描述

  • 电影时长

In [31]:

#创建画布
plt.figure(figsize=(20,8),dpi = 100 )

#绘制直方图
plt.hist(movie["Runtime (Minutes)"].values,20)

#增加X轴刻度
max_ = movie["Runtime (Minutes)"].max()
min_ = movie["Runtime (Minutes)"].min()
x2 = np.linspace(min_,max_,21)

plt.xticks(x2)
plt.grid()
plt.show()

在这里插入图片描述

3 问题三:

对于这一组电影数据,如果我们希望统计电影不同种类(genre)的个数,应该如何处理数据?

In [32]:

movie.head()

Out[32]:

RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
01Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced …James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S…20141218.1757074333.1376.0
12PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te…Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa…20121247.0485820126.4665.0
23SplitHorror,ThrillerThree girls are kidnapped by a man with a diag…M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar…20161177.3157606138.1262.0
34SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea…Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma…20161087.260545270.3259.0
45Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th…David AyerWill Smith, Jared Leto, Margot Robbie, Viola D…20161236.2393727325.0240.0

In [33]:

m_g = [i.split(",") for i in movie['Genre']]

In [34]:

m_g[0:10:] 
#前十个

Out[34]:

[['Action', 'Adventure', 'Sci-Fi'],
 ['Adventure', 'Mystery', 'Sci-Fi'],
 ['Horror', 'Thriller'],
 ['Animation', 'Comedy', 'Family'],
 ['Action', 'Adventure', 'Fantasy'],
 ['Action', 'Adventure', 'Fantasy'],
 ['Comedy', 'Drama', 'Music'],
 ['Comedy'],
 ['Action', 'Adventure', 'Biography'],
 ['Adventure', 'Drama', 'Romance']]

In [35]:

genre_un =np.array( np.unique([j for i in m_g for j in i ]))

In [36]:

genre_un

Out[36]:

array(['Action', 'Adventure', 'Animation', 'Biography', 'Comedy', 'Crime',
       'Drama', 'Family', 'Fantasy', 'History', 'Horror', 'Music',
       'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Sport', 'Thriller',
       'War', 'Western'], dtype='<U9')

In [37]:

genre_un.shape[0] # shape[0]:输出行数    shape[1]:输出列数 

Out[37]:

20

In [38]:

movie.shape[0]

Out[38]:

1000

In [39]:

# 提取数据,形成表格
data_gen = pd.DataFrame(np.zeros([movie.shape[0],genre_un.shape[0]]),columns = genre_un)

In [40]:

data_gen.head()

Out[40]:

ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern
00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
10.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
30.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
40.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0

In [41]:

for i in range(movie.shape[0]):
    data_gen.loc[i,m_g[i]] = 1
     # df.loc[ 行索引, 列索引]
    # loc函数通过调用index名称的具体值来取数据

In [42]:

data_gen.head()

Out[42]:

ActionAdventureAnimationBiographyComedyCrimeDramaFamilyFantasyHistoryHorrorMusicMusicalMysteryRomanceSci-FiSportThrillerWarWestern
01.01.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.0
10.01.00.00.00.00.00.00.00.00.00.00.00.01.00.01.00.00.00.00.0
20.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.01.00.00.0
30.00.01.00.01.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.0
41.01.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.0

In [43]:

 data_gen.sum().sort_values()

Out[43]:

Musical        5.0
Western        7.0
War           13.0
Music         16.0
Sport         18.0
History       29.0
Animation     49.0
Family        51.0
Biography     81.0
Fantasy      101.0
Mystery      106.0
Horror       119.0
Sci-Fi       120.0
Romance      141.0
Crime        150.0
Thriller     195.0
Adventure    259.0
Comedy       279.0
Action       303.0
Drama        513.0
dtype: float64

In [44]:

 data_gen.sum().sort_values(ascending = False).plot(kind = "bar",figsize = (20,5),fontsize = 15)

Out[44]:

<AxesSubplot:>

在这里插入图片描述

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值