scatter_matrix&df.plot&sns.boxplot

最新推荐文章于 2024-08-21 21:18:17 发布

mmい

最新推荐文章于 2024-08-21 21:18:17 发布

阅读量2.6k

点赞数

分类专栏：数据挖掘—dataquest

本文链接：https://blog.csdn.net/zm714981790/article/details/51225088

版权

数据挖掘—dataquest 专栏收录该内容

38 篇文章 4 订阅

订阅专栏

Dataset

本文的数据集hollywood_movies.csv是好莱坞2007到2011年的电影的信息，可视化的目标是更好的理解好莱坞的基本经济以及探索电影成功的异常性质。

下面是csv文件中某些重要的属性

Year: the year the movie was released.
Critic Rating: average rating by the critics.
Audience Rating: average rating by the audience.
Genre: the genre the movie belongs to. （电影类型）
Budget: the movie’s budget, in millions of dollars.（国内(美国)收入）
Domestic Gross: domestic (U.S.) revenue, in millions of dollars.（国外收入）
Worldwide Gross: total revenue worldwide, in millions of dollars.（全球总收入）
Profitability: ratio of Budget to Worldwide Gross.

因为exclude这一列都是空值，因此用drop函数将其删掉

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

hollywood_movies = pd.read_csv("hollywood_movies.csv")
hollywood_movies = hollywood_movies.drop("exclude", axis=1)

Scatter Plots - Profitability And Audience Ratings

观察电影的收益与观众评论的关系

fig = plt.figure(figsize=(6,10))
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)
ax1.scatter(hollywood_movies["Profitability"], hollywood_movies["Audience Rating"])
ax1.set_ylabel("Audience Rating")
ax1.set_xlabel("Profitability")
ax1.set_title("Hollywood Movies, 2007 - 2011")
ax2.scatter(hollywood_movies["Audience Rating"], hollywood_movies["Profitability"])
ax2.set_ylabel("Profitability")
ax2.set_xlabel("Audience Rating")
ax2.set_title("Hollywood Movies, 2007 - 2011")
plt.show()

这里写图片描述

Scatter Matrix - Profitability And Critic Ratings

上面两个图中都有一个离群点，因此需要过滤掉这个电影然后对其他的数据进行可视化。

scatter_matrix函数优点类似Seaborn中的Pairplot函数，就是描述属性的两两匹配，但这里不同的是，只有两个属性，但是匹配的方式是调换横纵坐标并且刻画散点图和柱状图，总共产生四个图。

from pandas.tools.plotting import scatter_matrix
normal_movies = hollywood_movies[hollywood_movies["Profitability"] < 10000]
scatter_matrix(normal_movies[["Profitability", "Audience Rating"]], figsize=(6,6))

这里写图片描述

Box Plot - Audience And Critic Ratings

这里调用的是DataFrame自己的可视化工具，当然也是基于matplotlib的。

normal_movies[["Critic Rating", "Audience Rating"]].plot(kind="box")

这里写图片描述

Box Plot - Critic Vs Audience Ratings Per Year

可视化每年的观众评论和专家评论的盒图，利用Seaborn

fig = plt.figure(figsize=(8,4))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
normal_movies = normal_movies.sort("Year")
sns.boxplot(data=normal_movies[pd.notnull(normal_movies["Genre"])], x="Year", y="Critic Rating", ax=ax1)
sns.boxplot(data=normal_movies[pd.notnull(normal_movies["Genre"])], x="Year", y="Audience Rating", ax=ax2)
plt.show()

这里写图片描述

Box Plots - Profitable Vs Unprofitable Movies

可视化盈利电影和非盈利电影，创建了一个属性Profitable：是否盈利

def is_profitable(row):
    if row["Profitability"] <= 1.0:
        return False
    return True
normal_movies["Profitable"] = normal_movies.apply(is_profitable, axis=1)
fig = plt.figure(figsize=(12,6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)
sns.boxplot(x="Profitable", y="Audience Rating", data=normal_movies, ax=ax1)
sns.boxplot(x="Profitable", y="Critic Rating", data=normal_movies, ax=ax2)