FALL_20_NOTE EDAV「Exploratory Data Analysis and Visualization」图像可视化

最新推荐文章于 2022-08-26 10:30:44 发布

喝冰可乐吃辣火锅的追星胖胖要找工作上岸

最新推荐文章于 2022-08-26 10:30:44 发布

阅读量3.7k

点赞数

分类专栏：努力打怪升级找工作 # 数据可视化

本文链接：https://blog.csdn.net/LAJIANGJIADEXIANGYE/article/details/108699891

版权

努力打怪升级找工作同时被 2 个专栏收录

10 篇文章 0 订阅

订阅专栏

数据可视化

2 篇文章 0 订阅

订阅专栏

一、课堂笔记

1. Intro to EDA

WHY

• detecting patterns
• finding outliers
• making comparisons
• identifying clusters

HOW

• Deep understanding of the dataset, where it came from, what its limitations are
• Experiment with different graphic forms, based on theory on what forms work well with different data types

Evaluating Graphs

• Wrong or misleading
• Meaningless
• Little added value
• Good alternatives

Q0：关于Rstudio+Markdown

R Markdown使用说明书

btw，如果在左上角的代码区▶️运行代码，通常结果会默认显示在Chunk Output Inline，如果想显示在console，在最上面加上👇
---
output: pdf_document
editor_options:
chunk_output_type: console
---

2. ggplot2

在这里插入图片描述

3. Histograms

For Continuous Variables, we’re looking for features such as:

Asymmetry 不对称
Outliers
Multimodality
Gaps
Heaping/Rounding 堆
Impossibilities/Errors

Parameters

bin boundaries
bin number

怎么用ggplot输出两个并列的图？

# using binwidth
p1 <- ggplot(finches, aes(x = Depth)) +
  geom_histogram(binwidth = 0.5, boundary = 6) +
  # You can use boundary to specify the endpoint of any bin
  ggtitle("Changed binwidth value")
# using bins
p2 <- ggplot(finches, aes(x = Depth)) +
  geom_histogram(bins = 48, boundary = 6) +
  ggtitle("Changed bins value")

# format plot layout
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

在画直方图的时候，需要尤其注意bin boundaries，哪边是开区间。
有个简单的方法就是将center设为xx.5，这样整数点都不会在两头

# import ggplot
library(ggplot2)
# must store data as dataframe
df <- data.frame(x)

# plot data
ggplot(df, aes(x)) +
  geom_histogram(color = "grey", fill = "lightBlue", 
                 binwidth = 5, center = 52.5) +
  ggtitle("ggplot2 histogram of x")

Q1: GEOM和STAT都有boxplot，那他们画图的区别是什么？

在这里插入图片描述

「geom_和stat_之间的关系」
相互替代。
e.g. geom_bar和stat_count,geom_bar默认stat是count,stat_count默认使用geom = bar,即默认画出的是柱状图

# geom_bar vs stat_count
library(ggplot2)
library(MASS)

ggplot(mpg,aes(x=class)) + geom_bar() # 使用一个变量做柱状图
ggplot(mpg,aes(x=class)) + stat_count() # 和上面一样

在这里插入图片描述

# geom_bar vs geom_col -> 引出【Q2】
ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink') # 在geom_bar中更改默认的"count"为"identity"就可以接受两个变量作图 (btw,搞了个有颜色的
# 其中stat="identity"表示不对数据进行统计变换，也就是原来的x对应y
# 如果是stat="bin"则表示取横坐标x的频数

ggplot(mpg,aes(x=class,y=displ)) + geom_col(fill = 'pink') # 与上面相同,geom_col默认stat="identity"

# 这里还要注意一点：geom里面有关于颜色的两个常用参数：color（柱形的边框色） & fill （填充柱形）

在这里插入图片描述

#但是又有个问题 
# ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink') 

#如果变成👇，结果会是一样的么？
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar", fill = 'pink') 
# 或者 ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="col")和上面的代码出现的结果一致
# 但是这两个代码和前面的图画出来的都不一致❌

在这里插入图片描述

看到geom和stat的相互替换现象，一个很自然的想法是，geom_bar修改stat为"identity"作图结果，和stat_identity修改geom为"bar"应该是一样的。但是实际上却不一样。

后者作图结果是什么呢？我们可以从散点图中得到启发。比如第一根柱子最高是7，我们可以看到散点图中2seater对应的点纵坐标最大也是7.所以我们猜想这样做的结果是将点换成一个有相同高度的柱子来表示，而因为这个数据的横坐标是离散的，很多柱子重叠在一起，无法分辨，所以我们考虑换一个横坐标是连续的数据再试一试。mtcars数据集作图结果正好证实了我们的猜想

ggplot(mtcars, aes(wt, mpg)) + stat_identity(geom="bar")
# 连续变量
mtcars[c('wt','mpg')]

# 返回的部分结果展示
                       wt  mpg
Mazda RX4           2.620 21.0
Mazda RX4 Wag       2.875 21.0
Datsun 710          2.320 22.8
Hornet 4 Drive      3.215 21.4
Hornet Sportabout   3.440 18.7
Valiant             3.460 18.1
Duster 360          3.570 14.3
Merc 240D           3.190 24.4
Merc 230            3.150 22.8

在这里插入图片描述

为什么geom_bar(stat=“identity”)和stat_identity(geom=“bar”)结果不一样？

在这里插入图片描述

position = identity：原地放，前后排队，只能看到最高的
position = stack : 一个在一个上面堆叠，搭积木

# 改为
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar",position="stack", fill = 'pink')
# 这时候就一模一样了

在这里插入图片描述

ggplot(mpg,aes(x=class,y=displ)) + stat_identity() # 散点图
ggplot(mpg,aes(x=class,y=displ)) + geom_point() # 等价于上一条

Q2 ：都是画柱形图，geom_bar() VS geom_col()的区别是？

区别：
geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

> library(ggplot2)
> df <- data.frame(x = rep(c(2.9, 3.1, 4.5), c(5, 10, 4)))
> ggplot(df, aes(x)) + geom_bar()

在这里插入图片描述

> ggplot(df, aes(x)) + geom_col()
Error: geom_col requires the following missing aesthetics: y

# 前面的bar‘等同于’，但是一个是count，一个是两个变量的值
> df1 <- data.frame(x = c(2.9, 3.1, 4.5), y = c(5, 10, 4))
> ggplot() + geom_col(data = df1, aes(x,y))

在这里插入图片描述

Q3: Count, Relative Frequency, Density Histogram的区别

在这里插入图片描述

P.S. 在看直方图的时候，要注意它的binwidth是不是相同，比如，当出现以下这种情况👇，会导致misleading

-> 常见的解决办法：

如果有原数据，重新画图
没有原数据，合并两个binwidth = 5
画density hist

Q4: ggvis = interactive plot 交互式图像

-》 adjusting parameters of a histogram interactively while coding
数据可以从Github上获取
（我找数据的时候发现我们教授是个大佬🐂
现在代码在Rstudio不起作用

Q5: facet_grid VS facet_wrap

缠绕分面 facet_wrap
facet_warp 即“缠绕分面”，对数据分类只能应用一个标准，不同组数据获得的小形按从左到右从上到下的“缠绕”顺序进行排列
格网分面facet_grid
可以应用多个标准对数据进行分组。

4. tidy data

左下这张表是messy data，因为tidy definition：

1 variable per column
1 observation per row
*also depends on the use case

在这里插入图片描述

# 这里学到的是「保留某一列,⚠️id列不动，其他都进行变化」
tidydata <- messydata %>% 
 pivot_longer(cols = !id, names_to = "roadtype", 
 values_to = "mpg")

-》但有的时候 「if there is no id column」

在这里插入图片描述

5. Rounding Pattern

Test for Normality:

Density Curve + Normal Curve
QQ-plot
Shapiro Wilk test
- $H_0 :$ data is normally distributed
- $H_\alpha :$ data is not normally distributed

shapiro.test(x)

6. Boxplot

在这里插入图片描述

查看箱线图的统计量
在这里插入图片描述

Density Curve

+geom_density()
或者ggvis

Violin plots

在这里插入图片描述

Ridgeline plot

-> ggridge package

7. Categorical Variables

Types of data

nominal - no fixed category order Sort from highest to lowest count (left to right, or top to bottom)
ordinal - fixed category order (Sort in logical order of the categories (left to right)/(starting at bottom OR top)
(‘real’) discrete, small # of possibilities (‘fake’:rounding height -> Cleveland dot plot
Not always clearcut: nominal vs. ordinal, ordinal vs. discrete, and…
Sometimes numbers = nominal, not discrete

8. WebScraping_rvest

在这里插入图片描述

Data in table form

library(tidyverse) library(rvest) library(robotstxt) paths_allowed("https://cran.r-project.org/web/packages/forcats/index.html")

forcats_data <- read_html("https://cran.r-project.org/web/packages/forcats/index.html") %>% html_table() length(forcats_data)

forcats_data[[1]]

mytable <- forcats_data[[1]] 
str(mytable)

version <- mytable %>% filter(X1 == "Version:") %>% pull(X2) date <- mytable %>% filter(X1 == "Published:") %>% pull(X2)

Data not in table form

<h2 id="current_visitors" class="data">319,942</h2>

- h2 tag 
- html_nodes("h2") 
- id attribute 
- html_nodes("#current_visitors") 
- class attribute 
- html_nodes(".data")

paths_allowed("https://analytics.usa.gov/")

webdata <- read_html("https://analytics.usa.gov/") webdata %>% html_nodes("h2")

webdata %>% html_nodes("#current_visitors")

webdata %>% html_nodes(".data")

webdata %>% html_nodes("h2") %>% html_text()

webdata_dl <- read_html("analytics.html")  #网页保存到本地再读取
webdata_dl %>% html_nodes("h2") %>% html_text()

webdata %>% html_nodes("script")
webdata %>% html_nodes("script") %>% html_attr("type")

9. Categorical Variables Code

9.1 Character vs factor data

character data: plotted alphabetically
factor data: plotted in order of factor levels

9.2 Binned, ordinal data, levels out of order

Recoding factor levels -》 fct_recode()

x <- factor(c("G234", "G452", "G136")) 
y <- fct_recode(x, Physics = "G234", Math = "G452", Chemistry = "G136")

If the row order is correct, use fct_inorder()

df <- data.frame(temperature = factor(c("cold", "warm", "hot")), count = c(15, 5, 22)) 

# row order is correct (think: factor in ROW order) 
ggplot(df, aes(x = fct_inorder(temperature), y = count)) + 
	geom_col(color = mycolor, fill = myfill) + 
	theme_grey(16)
# 如果没有fct_inorder() 横坐标就变成了 c-h-w的顺序

fct_relevel() 移动levels的位置

move levels to the beginning

x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2")

## [1] A B C move1 D E move2 F 
## Levels: move1 move2 A B C D E F

to move levels after an item (by position)

x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2", after = 4) # move after the fourth item

## [1] A B C move1 D E move2 F 
## Levels: A B C D move1 move2 E F

move levels to the end

x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2", after = Inf)

## [1] A B C move1 D E move2 F 
## Levels: A B C D E F move1 move2

在这里插入图片描述

9.3 Binned, nominal

Order bars by frequency count using fct_reorder()

在这里插入图片描述

P.S. 但是这里要注意⚠️，如果旋转成水平，.desc = False
在这里插入图片描述

9.4 Unbinned, ordinal, levels out of order -> fct_relevel()

9.5 Unbinned, nominal data -> fct_infreq() (default is decreasing order of frequency)

同样注意⚠️Horizontal bars -> fct_rev+fct_infreq

ggplot(df, aes(fct_rev(fct_infreq(mmcolor)))) + 
	geom_bar() + 
	coord_flip() + 
	theme_grey(16)

9.6 Dealing with NAs -》fct_explicit_na

df <- data.frame(temperature = factor(c("cold", "warm", "hot", NA)), count = c(15, 5, 22, 12))
 
ggplot(df, aes(x = fct_inorder(temperature), y = count)) + 
# 即使这里使用 fct_rev+fct_inorder 也没用
# 只改变除NA之外其他变量的顺序
	geom_col() + 
	coord_flip() + 
	ggtitle("(NA bar is too prominent)") + 
	theme_grey(16)

在这里插入图片描述

df %>% 
	mutate(temperature = fct_explicit_na(temperature, "NA") %>%
	fct_relevel("NA", "hot", "warm", "cold")) %>% 
	ggplot(aes(x = temperature, y = count)) + 
	geom_col(color = mycolor, fill = myfill) + coord_flip() + 
	theme_grey(16)

在这里插入图片描述

9.7 Rebinning

df <- as.data.frame(Titanic)
# problem
ggplot(df, aes(Class, Freq)) + 
	geom_col(color = "grey50", fill = "lightblue") + 
	theme_grey(16)

在这里插入图片描述

df %>% 
	group_by(Class) %>% 
	summarize(Freq = sum(Freq)) %>% 
	ggplot(aes(Class, Freq)) + 
	geom_col(color = "grey50", fill = "lightblue") + 
	theme_grey(16)

在这里插入图片描述

9.8 Percentages, more than one group!

# 1. 每个class的占比
df %>% 
	group_by(Class) %>% 
	summarize(Freq = sum(Freq)) %>% 
	mutate(prop = Freq/sum(Freq))
	
# 2. Overall percentages, 每个class+survived的组合占比
df2 <- df %>% 
 	group_by(Class, Survived) %>% 
 	summarize(Freq = sum(Freq)) %>% 
 	ungroup()  # very important 

df2 %>% mutate(prop = Freq/sum(Freq))

在这里插入图片描述

summarize() removes the last group

df %>% group_by(Class, Survived) %>% groups()
# 显示2个 class+survived
df %>% group_by(Class, Survived) %>% summarize(Freq = sum(Freq)) %>% groups()
# 只显示第1个 class

10. Dependency Relationships

Linear model
+geom_smooth(method = ‘lm’, se = FALSE)
Residual Plot
Augment accepts a model object and a dataset and adds information about each observation in the dataset. Most commonly, this includes predicted values in the .fitted column, residuals in the .resid column, and standard errors for the fitted values in a .se.fit column. New columns always begin with a . prefix to avoid overwriting columns in the original dataset.

在这里插入图片描述

library(broom) 
df <- mod %>% augment() 
ggplot(df, aes(.fitted, .std.resid)) + 
	geom_point() + 
	geom_hline(yintercept = 0, col = "blue")

在这里插入图片描述

Interactive (Plotly ggplot2 library)

library(plotly) 
ggplotly(g) #g是ggplot的图

Interactive (Plotly R library)

plot_ly()

散点图能告诉我们什么？
- Associations （describe what you see
- Outliers
- Clusters
- Gaps
- Barriers (boundaries)
- Conditional relationships (different relationships for different intervals of x)
如果散点图的点很多，怎么办？
- set alpha & stroke
- Don’t plot all points (remove outliers, subset data, sample data)
- Transform to log scale
- Heatmaps (bin counts or density estimates)
- Density contour lines
- Combination of above
- Multiple variable : scatterplot matrices

# example
# （1）subset
binned <- movies %>%
	mutate(mybin = ntitle(votes, 10)) %>%  #number of groups to split up into 分成10类,然后filter取出一类进行画图

# （2）log
ggplot()+
	geom_point()+
	scale_x_log10() #这里也可以设置breaks

# （3）heatmap
# （3.1）square heatmap of bin counts(defualt:30 bins
ggplot()+
	scale_fill_viridis_c()+ #颜色填充
	# 也可以自己控制颜色 scale_fill_gradient(low = '#F6F8FB',high = '#09005F')
	theme_classic()+
	geom_bin2d() #这里可以设置binwidth = c(xx,xx)控制方格的大小
	
#（3.2）hex heatmap
+ geom_hex() #也可以设置binwidth

# 4.1 Density estimate contour lines
library(MASS)
ggplot()+geom_point()+geom_density_2d()

# 4.2 2D Kernel density estimate
f <- kde2d(x,y,n)
image(f)
# w/ contour lines
contour(f,add = T)
# w/ points
points(x,y,pch = xx)

# 4.3another 2D kernel density estimation
smoothScatter(x,y)

# 4.4 calculate the kde, plot with ggplot2
df <- con2tr(f)
ggplot(df,aes(x,y))+
	geom_contour(aes(z = z))
# 或者
ggplot(df,aes(x,y))+
	geom_tile(aes(fill = z))+
	scale_fill_viridis_c()
	
# 5. scatterplot matrices
plot(data_frame)	

# 或者
library(lattice)
splom(splomvar)

11. Graphical Perception

图像会说谎

Ordered Elementary Tasks

Position along a common scale
Position along identical, nonaligned scales
Length
Angle / Slope
Area
Volume
Color hue/ Color saturation饱和/ Density

在这里插入图片描述

11.1 Position along a common scale

在这里插入图片描述

11.2 Position along a identical, nonaligned scales

在这里插入图片描述

11.3 Length

在这里插入图片描述

12. Multivariate Continuous

Two continuous variables: scatterplot
Three continuous variables:
- scatterplot matrix
- 3D scatterplot （R:scatterplot3d
- interactive 3D scatterplot

library(plotly) 
plot_ly(df, x = ~x, y = ~y, z = ~z, mode = "markers", marker = list(size = 4)) %>% add_markers()

Four continuous variables: Parallel Coordinates Plot

12.1 用ggplot2画

x <- rnorm(50, 20, 5) 
y <- runif(50, 8, 12) - x 
df <- data.frame(x, y)

tidydf <- df %>% 
	select(x, y) %>% 
	rownames_to_column("ID") %>% 
	gather(var, value, -ID)
	
ggplot(tidydf, aes(x = var, y = value, group = ID)) + geom_line()
# **group**

数据是配对的
在这里插入图片描述

无group
在这里插入图片描述

有group没标准化
在这里插入图片描述

有group标准化
在这里插入图片描述

standardize <- function(x) (x-mean(x))/sd(x) 

df2 <- tidydf %>% 
	group_by(var) %>% 
	mutate(value = standardize(value)) %>% 
	ungroup() 
	
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Standardize")

有group+rescale到0，1
在这里插入图片描述

df2 <- tidydf %>% 
	group_by(var) %>% 
	mutate(value = scales::rescale(value)) %>% 
	ungroup() 
	
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Rescale")

What if compare different distributions?

# 原始图
x <- rnorm(50, 20, 5)
weirdvar <- c(1, rep(50, 48), 100) 
df <- data.frame(x, weirdvar) 
tidydf <- df %>% 
	rownames_to_column("ID") %>% 
	gather(var, value, -ID) 

ggplot(tidydf, aes(x = var, y = value, group = ID)) + 	
	geom_line(alpha = .5)

在这里插入图片描述

# 标准化
tidydf %>%
    group_by(var) %>%
    mutate(value = standardize(value)) %>%
    ungroup() %>%
    ggplot(aes(x = var, y = value, group = ID))+
    geom_line()

在这里插入图片描述

tidydf %>%
    group_by(var) %>%
    mutate(value = scales::rescale(value)) %>%
    ungroup() %>%
    ggplot(aes(x = var, y = value, group = ID))+
    geom_line()

在这里插入图片描述

12.2 用GGally中的ggparcoord画

(1) scale = “globalminmax” 不做任何变化

library(GGally) 
mystates <- data.frame(state.x77) %>% 
	rownames_to_column("State") %>% 
	mutate(Region = factor(state.region)) 
# state.region is a separate vector,并不在state.x77数据集中
 
mystates$Region <- factor(mystates$Region, levels = c("Northeast", "North Central", "South","West")) 

ggparcoord(mystates, columns = 2:9, scale = "globalminmax")

在这里插入图片描述

(2) scale = std(default)

ggparcoord(mystates, columns = 2:9)

在这里插入图片描述

(3) scale = std (default) + reordered

ggparcoord(mystates, columns = c(2, 4, 6, 8, 3, 5, 7, 9))

在这里插入图片描述

(4) alpha(alphaLines) + rescale (scale = “uniminmax”)

# scale = std (default) 
ggparcoord(mystates, columns = 2:9, alphaLines = .3, scale = "uniminmax")

在这里插入图片描述

(5) Dataset with repeats -》 Splines

x <- 1:10 
y <- c(2,2,4,4,5,5,5,10,10,10) 
z <- c(3,3,2,3,3,7,7,5,7,7) 
w <- c(1, 1, 1, 7, 7, 7, 8, 8, 8, 8) 
df <- data.frame(x,y,z, w) 
ggparcoord(df, columns = 1:4, scale = "globalminmax") + 
    geom_vline(xintercept = 1:4, color = "lightblue")

# 这种的话，到了结点处不知道哪个对应的是哪个

在这里插入图片描述

ggparcoord(df, columns = 1:4, scale = "globalminmax",splineFactor = 10) + 
    geom_vline(xintercept = 1:4, color = "lightblue")
# alpha+rescale+splines可以联合在一起

在这里插入图片描述

# Alpha + rescale + splines + group
# scale = std (default) 
ggparcoord(mystates, columns = 2:9, alphaLines = .5, scale = "uniminmax", splineFactor = 10, groupColumn = 10) + 
	geom_vline(xintercept = 2:9, color = "lightblue")

在这里插入图片描述

(6) Highlighting a trend -> ifelse+scale_color_manual

mystates %>% 
	mutate(color = factor(ifelse(Murder > 11, 1, 0))) %>% 
	arrange(color) %>%
	ggparcoord(columns = 2:9, groupColumn = "color") + 
	scale_color_manual(values = c("grey70", "red")) + 
	coord_flip() + 
	guides(color = FALSE) + 
	ggtitle("States with Murder Rate > 11 (per 100000) in red")

在这里插入图片描述

(7) Watch out for categorical variables

library('d3r')
data.frame(Titanic) %>% 
	parcoords( rownames = F, # turn off rownames from the data.frame
	brushMode = "1D-axes" , 
	reorderable = T , 
	queue = T , 
	color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) , 
	withD3 = TRUE )

在这里插入图片描述

12.3 Html widget: parcoords -》interactive

# See: http://www.buildingwidgets.com/blog/2015/1/30/week-04-interactive-parallel-coordinates-1 
# devtools::install_github("timelyportfolio/parcoords") 

library(parcoords) 
mystates %>% 
	arrange(Region) %>% 
	parcoords( rownames = F ,  # turn off rownames from the data.frame
		brushMode = "1D-axes" , 
		reorderable = T , 
		queue = T,
		alpha = 0.5 )

# with color
parcoords(mystates , 
	rownames = F , 
	brushMode = "1D-axes" , 
	reorderable = T , 
	queue = T , 
	color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) , 
	withD3 = TRUE )

13. Multivariate Categorical

Frequency
- Bar charts
  - Stacked bar chart
  - Grouped bar chart （2variables）
  - Grouped bar chart w/ facets （3 variables）
- Cleveland dot plots
Proportion / Association
- Mosaic plots
- Fluctation diagrams

Chi Square Test of Independence

$H_0:independent$
原始数据集：
在这里插入图片描述

localmat <- as.matrix(local[,2:3]) rownames(localmat) <- local$Age X <- chisq.test(localmat, correct = FALSE) 

X$observed
X$expected

X

Mosaic plots

在这里插入图片描述

P.S.

“Treatment” level should be on the bottom and darker than the other shades (for ordinal data)
The levels for all ordinal data should appear in order.
Choose one variable to order by frequency count
用vcd的mosaic时默认频数列是“Freq”

Mosaic pairs plot

在这里插入图片描述

Similar plots

mosaic plot = filled rectangular plot with consistent number of rows and columns, where each small rectangle represents a unique combination of levels of factors of the variables displayed
treemap = filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)
spine plot = mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction

Categorical data formats - Conversions

cases
counts (tidy data with Freq column)
contingency or pivot table

在这里插入图片描述
Link

Likert data (满意度)

Stacked Bar Chart

在这里插入图片描述

Diverging Stacked Bar Chart

在这里插入图片描述

Diverging Stacked Bar Chart w/ Separate Neutrals

在这里插入图片描述

14. Alluvial(冲积层) diagrams

14.1 Alluvial Plots in ggplot2官方帮助文件

在这里插入图片描述

Define the following elements of a typical alluvial plot:

An axis is a dimension (variable) along which the data are vertically grouped at a fixed horizontal position. The plot above uses three categorical axes: Class, Sex, and Age.
The groups at each axis are depicted as opaque blocks called strata. For example, the Class axis contains four strata: 1st, 2nd, 3rd, and Crew.
Horizontal (x -) splines called alluvia span the width of the plot. In this plot, each alluvium corresponds to a fixed value of each axis variable, indicated by its vertical position at the axis, as well as of the Survived variable, indicated by its fill color.
The segments of the alluvia between pairs of adjacent axes are flows.
The alluvia intersect the strata at lodes. The lodes are not visualized in the above plot, but they can be inferred as filled rectangles extending the flows through the strata at each end of the plot or connecting the flows on either side of the center stratum.

在这里插入图片描述

14.2 Alluvial data

(1) Alluvia (wide) format

head(as.data.frame(UCBAdmissions))

在这里插入图片描述

install.packages("ggalluvial")
library(ggalluvial)
library(ggplot2)
library(dplyr)

ggplot(as.data.frame(UCBAdmissions),
       aes(y = Freq, axis1 = Gender, axis2 = Dept, axis3 = Admit)) +
    geom_alluvium(aes(fill = Gender), width = 1/12) +
    geom_stratum(width = 1/12, fill = "grey80", color = "grey") +
    geom_label(stat = "stratum", 
               aes(label = after_stat(stratum))) +
    scale_x_discrete(expand = c(.05, .05)) +
    scale_fill_brewer(type = "qual", palette = "Set1") +
    ggtitle("UC Berkeley admissions and rejections") +
 theme_void()

在这里插入图片描述

-> change geom_alluvium to geom_flow
在这里插入图片描述

中文版宽数据vs长数据的例子

（2）Lodes (long) format

UCB_lodes <- to_lodes_form(as.data.frame(UCBAdmissions),
                           axes = 1:3,
                           id = "Cohort")

在这里插入图片描述

x, the “key” variable indicating the axis to which the row corresponds, which are to be arranged along the horizontal axis;
stratum, the “value” taken by the axis variable indicated by x; and
alluvium, the indexing scheme that links the rows of a single alluvium.

data(Refugees, package = "alluvial")
country_regions <- c(
  Afghanistan = "Middle East",
  Burundi = "Central Africa",
  `Congo DRC` = "Central Africa",
  Iraq = "Middle East",
  Myanmar = "Southeast Asia",
  Palestine = "Middle East",
  Somalia = "Horn of Africa",
  Sudan = "Central Africa",
  Syria = "Middle East",
  Vietnam = "Southeast Asia"
)
Refugees$region <- country_regions[Refugees$country]
ggplot(data = Refugees,
       aes(x = year, y = refugees, alluvium = country)) +
  geom_alluvium(aes(fill = country, colour = country),
                alpha = .75, decreasing = FALSE) +
  scale_x_continuous(breaks = seq(2003, 2013, 2)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -30, hjust = 0)) +
  scale_fill_brewer(type = "qual", palette = "Set3") +
  scale_color_brewer(type = "qual", palette = "Set3") +
  facet_wrap(~ region, scales = "fixed") +

在这里插入图片描述
The format allows us to assign aesthetics that change from axis to axis along the same alluvium, which is useful for repeated measures datasets.

15. Simpsons Paradox

16. Color

RColorBrewer Color Schemes

sequential
diverging
qualitative (for categorical data)
perceptually uniform color spaces

Continuous data

+scale_color_viridis_c()
+scale_color_distiller(palette = ‘PuBu’)

create your own squential
+scale_color_gradient(low = ‘white’, high = ‘red’)
create your own diverging
+scale_color_gradient2(low = ‘blue’, mid = ‘white’, high = ‘red’)

Discrete data

+scale_color_viridis_d()
+scale_color_brewer(palette = ‘PuBu’)
create your own
+scale_color_manual(values = c(‘red’, ‘yellow’, ‘blue’))

Color Vision Deficiency

protanopia (red)
deuteranopia (green)
tritanopia (blue)

Legend order matches graph order

17. Heatmaps

can show frequency counts (2D histogram) or value of a third variable
can be used for continuous or categorical data (both for axes and fill color)

17.1 Drawing heatmaps with ggplot2

（1）geom_tile with numerical data, compare to geom_point

library(ggplot2) #画图
library(gridExtra) #拼图

x <- 1:3 
y <- c(5, 2, 7) 
df <- data.frame(x, y) 
g1 <- ggplot(df, aes(x, y)) + geom_point() 
g2 <- ggplot(df, aes(x, y)) + geom_tile() 
grid.arrange(g1, g2, nrow = 1)

在这里插入图片描述

（2）geom_raster as geom_tile w/ uniform w, h & faster

geom_tile() uses the center of the tile and its size (x, y, width, height). geom_raster is a high performance special case for when all the tiles are the same size

x <- c("apples", "bananas", "oranges") 
y <- c("NJ", "NY", "NJ") 
df <- data.frame(x, y) 
ggplot(df, aes(x,y)) + geom_raster() + theme_grey()

在这里插入图片描述

（3）Complete set of (x,y) pairs

y <- rep(c("apples", "bananas", "oranges"), 2) 
x <- rep(c("NJ", "NY"), each = 3) 
df <- data.frame(x, y) 
ggplot(df, aes(x, y)) + geom_raster()

-> Add fill color

set.seed(2019) 
df$z <- sample(6) 
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))

在这里插入图片描述

-> What if z is categorical ?

df$z <- c("A", "C", "B", "B", "A", "D") 
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))

在这里插入图片描述

-> Create a heat map theme
theme(axis.line = element_blank(), axis.ticks = element_blank()) -》轴刻度消失

theme_heat <- theme_classic() + 
  theme(axis.line = element_blank(), axis.ticks = element_blank()) 
ggplot(df, aes(x, y)) + 
  geom_raster(aes(fill = z)) + 
  theme_heat

在这里插入图片描述

-> Add coord_fixed

ggplot(df, aes(x, y)) + 
  geom_raster(aes(fill = z)) + 
  coord_fixed() + 
  theme_heat

在这里插入图片描述
-> Add white border
(doesn’t work with geom_raster())

ggplot(df, aes(x, y)) + 
  geom_tile(aes(fill = z), color = 'white') + 
  coord_fixed() + 
  theme_heat

二、作业复盘

Q1: Draw a parallel coordinates plot of the numeric columns in the dataset -> 怎么去找数据集中所有的数值类字段

library(palmerpenguins)
select_if(penguins_raw, is.numeric)

选择题

在这里插入图片描述

something else

Making Faceted Heatmaps with ggplot2

参考大佬🧾

知乎：Dwzb
EDAV教授去年的课件
 R 函数学习- facet_wrap,facet_grid

喝冰可乐吃辣火锅的追星胖胖要找工作上岸

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
FALL_20_NOTE EDAV「Exploratory Data Analysis and Visualization」图像可视化

Intro to EDAWHY• detecting patterns• finding outliers• making comparisons• identifying clustersHOW• Deep understanding of the dataset, where it came from, what its limitations are• Experiment with different graphic forms, based on theory on what fo
复制链接

扫一扫