可视化系列汇总——相关关系图形

最新推荐文章于 2024-06-24 16:17:51 发布

庄闪闪

最新推荐文章于 2024-06-24 16:17:51 发布

阅读量4.1k

点赞数 3

分类专栏： R可视化文章标签：数据挖掘人工智能

本文链接：https://blog.csdn.net/qq_37379316/article/details/127192679

版权

R可视化专栏收录该内容

45 篇文章

订阅专栏

引言

在进行数据分析时，免不了对结果进行可视化。那么，什么样的图形才最适合自己的数据呢？一个有效的图形应具备以下特点：

能正确传递信息，而不会产生歧义；
样式简单，但是易于理解；
添加的图形美学应辅助理解信息；
图形上不应出现冗余无用的信息。

本系列推文，小编将汇总可视化中常用 7 大类型图形，供读者参考。每类制作成一篇推文，主要参考资料为：Top 50 ggplot2 Visualizations。其他类似功能网站，资料包括：

系列目录

本文主要介绍第一部分：相关关系图形。

加载数据集

使用 ggplot2 包中自带数据集作为示例数据集。

library(ggplot2)
library(plotrix)
data("midwest", package = "ggplot2") #加载数据集

midwest 数据集

全局主题设置

全局配色、主题设置。注意，本文使用离散色阶，如果需要使用连续色阶，则需要重写。

options(scipen=999)  # 关掉像 1e+48 这样的科学符号
# 颜色设置（灰色系列）
cbp1 <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

# 颜色设置（黑色系列）
cbp2 <- c("#000000", "#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")


ggplot <- function(...) ggplot2::ggplot(...) + 
  scale_color_manual(values = cbp1) +
  scale_fill_manual(values = cbp1) + # 注意: 使用连续色阶时需要重写
  theme_bw()

1. 相关关系

1.1 两个变量散点图

展示两个变量之间的相关关系，最常使用的是散点图。在 ggplot 中，使用geom_point()绘制。此外，默认情况下，geom_smooth 会绘制一条平滑线（基于 losses ），可以通过设置method='lm'来调整以绘制最佳拟合的线。

# Scatterplot
gg <- ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) + 
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) + 
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot", 
       caption = "Source: midwest")

plot(gg)

两个变量散点图

1.2 环绕式散点图

使用 ggalt 包中的 geom_encircle() 函数可实现在散点图中圈选特定区域：

# install.packages("ggalt")
library(ggalt)
options(scipen = 999)
library(ggplot2)
library(ggalt)

## 设定筛选条件
midwest_select <- midwest[midwest$poptotal > 350000 & 
                            midwest$poptotal <= 500000 & 
                            midwest$area > 0.01 & 
                            midwest$area < 0.1, ]

# Plot
ggplot(midwest, aes(x=area, y=poptotal)) + 
  geom_point(aes(col=state, size=popdensity)) +   # draw points
  geom_smooth(method="loess", se=F) + 
  xlim(c(0, 0.1)) + 
  ylim(c(0, 500000)) +   # draw smoothing line
  geom_encircle(aes(x=area, y=poptotal), 
                data=midwest_select, 
                color="red", 
                size=2, 
                expand=0.08) +   # encircle
  labs(subtitle="Area Vs Population", 
       y="Population", 
       x="Area", 
       title="Scatterplot + Encircle", 
       caption="Source: midwest")

环绕式散点图

1.3 抖动图

抖动图可以解决数据点重叠的问题。通过 geom_jitter() 函数中的 width 参数设置抖动范围，重叠点在其原始位置周围随机抖动。

g <- ggplot(mpg, aes(cty, hwy))
g + geom_jitter(width = .5, size=1) +
  labs(subtitle="mpg: city vs highway mileage", 
       y="hwy", 
       x="cty", 
       title="Jittered Points")

抖动图

1.4 计数图

克服数据点重叠问题的第二个选择是使用计数图 geom_count()。重叠点越多，圆就越大。

g + geom_count(col="tomato3", show.legend=F) +
  labs(subtitle="mpg: city vs highway mileage", 
       y="hwy", 
       x="cty", 
       title="Counts Plot")

计数图

1.5 气泡图

气泡图适合 4 维数据，其中两个是数值型（分别是 X 和 Y），另一个是分类变量（用 color 表示）和另一个数值变量（用 size 表示）。

data(mpg, package="ggplot2")

mpg_select <- mpg[mpg$manufacturer %in% c("audi", "ford", "honda", "hyundai"), ]

# Scatterplot
theme_set(theme_bw())  # pre-set the bw theme.
g <- ggplot(mpg_select, aes(displ, cty)) + 
  labs(subtitle="mpg: Displacement vs City Mileage",
       title="Bubble chart")

g + geom_jitter(aes(col=manufacturer, size=hwy)) + 
  geom_smooth(aes(col=manufacturer), method="lm", se=F)

气泡图

1.6 边际直方图/箱线图

如果您想在同一个图中显示变量的关系和分布，可以使用边际直方图。更改 ggMarginal() 函数中的 type 参数可以将边际直方图换成箱线图或者密度图。

library(ggExtra)
data(mpg, package="ggplot2")

# Scatterplot
theme_set(theme_bw())  
mpg_select <- mpg[mpg$hwy >= 35 & mpg$cty > 27, ]
g <- ggplot(mpg, aes(cty, hwy)) + 
  geom_count() + 
  geom_smooth(method="lm", se=F)

ggMarginal(g, type = "histogram", fill="transparent")

ggMarginal(g, type = "boxplot", fill="transparent")

ggMarginal(g, type = "density", fill="transparent")

1.7 相关系数图

相关系数图可以查看同一组数据中多个连续变量的相关性。

library(ggcorrplot)

# Correlation matrix
data(mtcars)
corr <- round(cor(mtcars), 1)

# Plot
ggcorrplot(corr, hc.order = TRUE, 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           method="circle", 
           colors = c("tomato2", "white", "springgreen3"), 
           title="Correlogram of mtcars", 
           ggtheme=theme_bw)