FALL_20_NOTE EDAV「Exploratory Data Analysis and Visualization」图像可视化

一、课堂笔记

1. Intro to EDA

WHY

• detecting patterns
• finding outliers
• making comparisons
• identifying clusters

HOW

Deep understanding of the dataset, where it came from, what its limitations are
• Experiment with different graphic forms, based on theory on what forms work well with different data types

Evaluating Graphs

• Wrong or misleading
• Meaningless
• Little added value
• Good alternatives

Q0:关于Rstudio+Markdown

R Markdown使用说明书

btw,如果在左上角的代码区▶️运行代码,通常结果会默认显示在Chunk Output Inline,如果想显示在console,在最上面加上👇
---
output: pdf_document
editor_options:
chunk_output_type: console
---

2. ggplot2

在这里插入图片描述

推荐的书写顺序:

其中,<labels> 包含

  • ggtitle()
  • labs()
  • xlab()
  • ylab()
  • annotate()

在这里插入图片描述

零碎知识点:scale/coord/theme (每类只能用一次)

  • 在画完图之后,
    • + theme_wsj() # library(ggthemes)
    • +coord_polar(),那么变成极坐标的形式
    • + facet_grid()
    • +scale_x_reverse(),那么x轴将会翻转
  • One scale per mapping
    • scale_x_date()
    • scale_y_continuous()
    • scale_color_manual()
    • scale_fill_viridis_c()

推荐的画图代码

在这里插入图片描述

library(Sleuth3) # data
library(ggplot2) # plotting

# load data
finches <- Sleuth3::case0201
# finch histograms by year with overlayed density curves
ggplot(finches, aes(x = Depth, y = ..density..)) + 
  # plotting
  geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
  geom_density(color = "#3D6480") + 
  facet_wrap(~Year) +
  # formatting
  ggtitle("Severe Drought Led to Finches with Bigger Chompers",
          subtitle = "Beak Depth Density of Galapagos Finches by Year") +
  labs(x = "Beak Depth (mm)", caption = "Source: Sleuth3::case0201") +
  theme(plot.title = element_text(face = "bold")) +
  theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
  theme(plot.caption = element_text(color = "grey68"))

3. Histograms

For Continuous Variables, we’re looking for features such as:

  • Asymmetry 不对称
  • Outliers
  • Multimodality
  • Gaps
  • Heaping/Rounding 堆
  • Impossibilities/Errors

Parameters

  • bin boundaries
  • bin number

怎么用ggplot输出两个并列的图?

# using binwidth
p1 <- ggplot(finches, aes(x = Depth)) +
  geom_histogram(binwidth = 0.5, boundary = 6) +
  # You can use boundary to specify the endpoint of any bin
  ggtitle("Changed binwidth value")
# using bins
p2 <- ggplot(finches, aes(x = Depth)) +
  geom_histogram(bins = 48, boundary = 6) +
  ggtitle("Changed bins value")

# format plot layout
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

在画直方图的时候,需要尤其注意bin boundaries,哪边是开区间。
有个简单的方法就是将center设为xx.5,这样整数点都不会在两头

# import ggplot
library(ggplot2)
# must store data as dataframe
df <- data.frame(x)

# plot data
ggplot(df, aes(x)) +
  geom_histogram(color = "grey", fill = "lightBlue", 
                 binwidth = 5, center = 52.5) +
  ggtitle("ggplot2 histogram of x")

Q1: GEOM和STAT都有boxplot,那他们画图的区别是什么?

在这里插入图片描述

geom_和stat_之间的关系
相互替代。
e.g. geom_bar和stat_count,geom_bar默认stat是count,stat_count默认使用geom = bar,即默认画出的是柱状图

# geom_bar vs stat_count
library(ggplot2)
library(MASS)

ggplot(mpg,aes(x=class)) + geom_bar() # 使用一个变量做柱状图
ggplot(mpg,aes(x=class)) + stat_count() # 和上面一样

在这里插入图片描述

# geom_bar vs geom_col -> 引出【Q2】
ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink') # 在geom_bar中更改默认的"count"为"identity"就可以接受两个变量作图 (btw,搞了个有颜色的
# 其中stat="identity"表示不对数据进行统计变换,也就是原来的x对应y
# 如果是stat="bin"则表示取横坐标x的频数

ggplot(mpg,aes(x=class,y=displ)) + geom_col(fill = 'pink') # 与上面相同,geom_col默认stat="identity"

# 这里还要注意一点:geom里面有关于颜色的两个常用参数:color(柱形的边框色) & fill (填充柱形)

在这里插入图片描述

#但是又有个问题 
# ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink') 

#如果变成👇,结果会是一样的么?
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar", fill = 'pink') 
# 或者 ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="col")和上面的代码出现的结果一致
# 但是这两个代码和前面的图画出来的都不一致❌

在这里插入图片描述

看到geom和stat的相互替换现象,一个很自然的想法是,geom_bar修改stat为"identity"作图结果,和stat_identity修改geom为"bar"应该是一样的。但是实际上却不一样。

后者作图结果是什么呢?我们可以从散点图中得到启发。比如第一根柱子最高是7,我们可以看到散点图中2seater对应的点纵坐标最大也是7.所以我们猜想这样做的结果是将点换成一个有相同高度的柱子来表示,而因为这个数据的横坐标是离散的,很多柱子重叠在一起,无法分辨,所以我们考虑换一个横坐标是连续的数据再试一试。mtcars数据集作图结果正好证实了我们的猜想

ggplot(mtcars, aes(wt, mpg)) + stat_identity(geom="bar")
# 连续变量
mtcars[c('wt','mpg')]

# 返回的部分结果展示
                       wt  mpg
Mazda RX4           2.620 21.0
Mazda RX4 Wag       2.875 21.0
Datsun 710          2.320 22.8
Hornet 4 Drive      3.215 21.4
Hornet Sportabout   3.440 18.7
Valiant             3.460 18.1
Duster 360          3.570 14.3
Merc 240D           3.190 24.4
Merc 230            3.150 22.8

在这里插入图片描述

为什么geom_bar(stat=“identity”)和stat_identity(geom=“bar”)结果不一样?

在这里插入图片描述

  • position = identity:原地放,前后排队,只能看到最高的
  • position = stack : 一个在一个上面堆叠,搭积木
# 改为
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar",position="stack", fill = 'pink')
# 这时候就一模一样了

在这里插入图片描述

ggplot(mpg,aes(x=class,y=displ)) + stat_identity() # 散点图
ggplot(mpg,aes(x=class,y=displ)) + geom_point() # 等价于上一条

Q2 :都是画柱形图,geom_bar() VS geom_col()的区别是?

区别
geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.

> library(ggplot2)
> df <- data.frame(x = rep(c(2.9, 3.1, 4.5), c(5, 10, 4)))
> ggplot(df, aes(x)) + geom_bar()

在这里插入图片描述

> ggplot(df, aes(x)) + geom_col()
Error: geom_col requires the following missing aesthetics: y

# 前面的bar‘等同于’,但是一个是count,一个是两个变量的值
> df1 <- data.frame(x = c(2.9, 3.1, 4.5), y = c(5, 10, 4))
> ggplot() + geom_col(data = df1, aes(x,y))

在这里插入图片描述

Q3: Count, Relative Frequency, Density Histogram的区别

在这里插入图片描述

P.S. 在看直方图的时候,要注意它的binwidth是不是相同,比如,当出现以下这种情况👇,会导致misleading

-> 常见的解决办法:

  • 如果有原数据,重新画图
  • 没有原数据,合并两个binwidth = 5
  • 画density hist
    在这里插入图片描述

Q4: ggvis = interactive plot 交互式图像

-》 adjusting parameters of a histogram interactively while coding
数据可以从Github上获取
(我找数据的时候发现我们教授是个大佬🐂
现在代码在Rstudio不起作用

Q5: facet_grid VS facet_wrap

缠绕分面 facet_wrap
facet_warp 即“缠绕分面”,对数据分类只能应用一个标准,不同组数据获得的小形按从左到右从上到下的“缠绕”顺序进行排列
格网分面facet_grid
可以应用多个标准对数据进行分组。


4. tidy data

左下这张表是messy data,因为tidy definition:

  • 1 variable per column
  • 1 observation per row
  • *also depends on the use case

在这里插入图片描述

# 这里学到的是「保留某一列,⚠️id列不动,其他都进行变化」
tidydata <- messydata %>% 
 pivot_longer(cols = !id, names_to = "roadtype", 
 values_to = "mpg")

-》 但有的时候 「if there is no id column」

在这里插入图片描述

在这里插入图片描述

5. Rounding Pattern

Test for Normality:

  • Density Curve + Normal Curve
  • QQ-plot
  • Shapiro Wilk test
    • H 0 : H_0 : H0:data is normally distributed
    • H α : H_\alpha : Hα:data is not normally distributed
shapiro.test(x)

6. Boxplot

在这里插入图片描述

查看箱线图的统计量
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述


Density Curve

+geom_density()
或者ggvis

Violin plots

在这里插入图片描述

Ridgeline plot

-> ggridge package

7. Categorical Variables

Types of data

  • nominal - no fixed category order Sort from highest to lowest count (left to right, or top to bottom)
  • ordinal - fixed category order (Sort in logical order of the categories (left to right)/(starting at bottom OR top)
  • (‘real’) discrete, small # of possibilities (‘fake’:rounding height -> Cleveland dot plot
  • Not always clearcut: nominal vs. ordinal, ordinal vs. discrete, and…
  • Sometimes numbers = nominal, not discrete

8. WebScraping_rvest

在这里插入图片描述
在这里插入图片描述

Data in table form

library(tidyverse) library(rvest) library(robotstxt) paths_allowed("https://cran.r-project.org/web/packages/forcats/index.html")

forcats_data <- read_html("https://cran.r-project.org/web/packages/forcats/index.html") %>% html_table() length(forcats_data)

forcats_data[[1]]

mytable <- forcats_data[[1]] 
str(mytable)

version <- mytable %>% filter(X1 == "Version:") %>% pull(X2) date <- mytable %>% filter(X1 == "Published:") %>% pull(X2)

Data not in table form

<h2 id="current_visitors" class="data">319,942</h2>

- h2 tag 
- html_nodes("h2") 
- id attribute 
- html_nodes("#current_visitors") 
- class attribute 
- html_nodes(".data")
paths_allowed("https://analytics.usa.gov/")

webdata <- read_html("https://analytics.usa.gov/") webdata %>% html_nodes("h2")

webdata %>% html_nodes("#current_visitors")

webdata %>% html_nodes(".data")

webdata %>% html_nodes("h2") %>% html_text()

webdata_dl <- read_html("analytics.html")  #网页保存到本地再读取
webdata_dl %>% html_nodes("h2") %>% html_text()

webdata %>% html_nodes("script")
webdata %>% html_nodes("script") %>% html_attr("type")

9. Categorical Variables Code

9.1 Character vs factor data

  • character data: plotted alphabetically
  • factor data: plotted in order of factor levels
    在这里插入图片描述

9.2 Binned, ordinal data, levels out of order

Recoding factor levels -》 fct_recode()
x <- factor(c("G234", "G452", "G136")) 
y <- fct_recode(x, Physics = "G234", Math = "G452", Chemistry = "G136")
If the row order is correct, use fct_inorder()
df <- data.frame(temperature = factor(c("cold", "warm", "hot")), count = c(15, 5, 22)) 

# row order is correct (think: factor in ROW order) 
ggplot(df, aes(x = fct_inorder(temperature), y = count)) + 
	geom_col(color = mycolor, fill = myfill) + 
	theme_grey(16)
# 如果没有fct_inorder() 横坐标就变成了 c-h-w的顺序
fct_relevel() 移动levels的位置
  • move levels to the beginning
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2")

## [1] A B C move1 D E move2 F 
## Levels: move1 move2 A B C D E F
  • to move levels after an item (by position)
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2", after = 4) # move after the fourth item

## [1] A B C move1 D E move2 F 
## Levels: A B C D move1 move2 E F
  • move levels to the end
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F") 
fct_relevel(x, "move1", "move2", after = Inf)

## [1] A B C move1 D E move2 F 
## Levels: A B C D E F move1 move2

在这里插入图片描述

9.3 Binned, nominal

Order bars by frequency count using fct_reorder()

在这里插入图片描述

P.S. 但是这里要注意⚠️,如果旋转成水平,.desc = False
在这里插入图片描述

9.4 Unbinned, ordinal, levels out of order -> fct_relevel()

9.5 Unbinned, nominal data -> fct_infreq() (default is decreasing order of frequency)

同样注意⚠️Horizontal bars -> fct_rev+fct_infreq

ggplot(df, aes(fct_rev(fct_infreq(mmcolor)))) + 
	geom_bar() + 
	coord_flip() + 
	theme_grey(16)

9.6 Dealing with NAs -》fct_explicit_na

df <- data.frame(temperature = factor(c("cold", "warm", "hot", NA)), count = c(15, 5, 22, 12))
 
ggplot(df, aes(x = fct_inorder(temperature), y = count)) + 
# 即使这里使用 fct_rev+fct_inorder 也没用
# 只改变除NA之外其他变量的顺序
	geom_col() + 
	coord_flip() + 
	ggtitle("(NA bar is too prominent)") + 
	theme_grey(16)

在这里插入图片描述

df %>% 
	mutate(temperature = fct_explicit_na(temperature, "NA") %>%
	fct_relevel("NA", "hot", "warm", "cold")) %>% 
	ggplot(aes(x = temperature, y = count)) + 
	geom_col(color = mycolor, fill = myfill) + coord_flip() + 
	theme_grey(16)

在这里插入图片描述

9.7 Rebinning

df <- as.data.frame(Titanic)
# problem
ggplot(df, aes(Class, Freq)) + 
	geom_col(color = "grey50", fill = "lightblue") + 
	theme_grey(16)

在这里插入图片描述

df %>% 
	group_by(Class) %>% 
	summarize(Freq = sum(Freq)) %>% 
	ggplot(aes(Class, Freq)) + 
	geom_col(color = "grey50", fill = "lightblue") + 
	theme_grey(16)

在这里插入图片描述

9.8 Percentages, more than one group!

# 1. 每个class的占比
df %>% 
	group_by(Class) %>% 
	summarize(Freq = sum(Freq)) %>% 
	mutate(prop = Freq/sum(Freq))
	
# 2. Overall percentages, 每个class+survived的组合占比
df2 <- df %>% 
 	group_by(Class, Survived) %>% 
 	summarize(Freq = sum(Freq)) %>% 
 	ungroup()  # very important 

df2 %>% mutate(prop = Freq/sum(Freq))

在这里插入图片描述

在这里插入图片描述

  • summarize() removes the last group
df %>% group_by(Class, Survived) %>% groups()
# 显示2个 class+survived
df %>% group_by(Class, Survived) %>% summarize(Freq = sum(Freq)) %>% groups()
# 只显示第1个 class

10. Dependency Relationships

  • Linear model
    +geom_smooth(method = ‘lm’, se = FALSE)

  • Residual Plot
    Augment accepts a model object and a dataset and adds information about each observation in the dataset. Most commonly, this includes predicted values in the .fitted column, residuals in the .resid column, and standard errors for the fitted values in a .se.fit column. New columns always begin with a . prefix to avoid overwriting columns in the original dataset.

在这里插入图片描述

library(broom) 
df <- mod %>% augment() 
ggplot(df, aes(.fitted, .std.resid)) + 
	geom_point() + 
	geom_hline(yintercept = 0, col = "blue")

在这里插入图片描述

  • Interactive (Plotly ggplot2 library)
library(plotly) 
ggplotly(g) #g是ggplot的图
  • Interactive (Plotly R library)
plot_ly()
  • 散点图能告诉我们什么?
    • Associations (describe what you see
    • Outliers
    • Clusters
    • Gaps
    • Barriers (boundaries)
    • Conditional relationships (different relationships for different intervals of x)
  • 如果散点图的点很多,怎么办?
    • set alpha & stroke
      在这里插入图片描述
    • Don’t plot all points (remove outliers, subset data, sample data)
    • Transform to log scale
    • Heatmaps (bin counts or density estimates)
    • Density contour lines
    • Combination of above
    • Multiple variable : scatterplot matrices
# example
# (1)subset
binned <- movies %>%
	mutate(mybin = ntitle(votes, 10)) %>%  #number of groups to split up into 分成10类,然后filter取出一类进行画图

# (2)log
ggplot()+
	geom_point()+
	scale_x_log10() #这里也可以设置breaks

# (3)heatmap
# (3.1)square heatmap of bin counts(defualt:30 bins
ggplot()+
	scale_fill_viridis_c()+ #颜色填充
	# 也可以自己控制颜色 scale_fill_gradient(low = '#F6F8FB',high = '#09005F')
	theme_classic()+
	geom_bin2d() #这里可以设置binwidth = c(xx,xx)控制方格的大小
	
#(3.2)hex heatmap
+ geom_hex() #也可以设置binwidth

# 4.1 Density estimate contour lines
library(MASS)
ggplot()+geom_point()+geom_density_2d()

# 4.2 2D Kernel density estimate
f <- kde2d(x,y,n)
image(f)
# w/ contour lines
contour(f,add = T)
# w/ points
points(x,y,pch = xx)

# 4.3another 2D kernel density estimation
smoothScatter(x,y)

# 4.4 calculate the kde, plot with ggplot2
df <- con2tr(f)
ggplot(df,aes(x,y))+
	geom_contour(aes(z = z))
# 或者
ggplot(df,aes(x,y))+
	geom_tile(aes(fill = z))+
	scale_fill_viridis_c()
	
# 5. scatterplot matrices
plot(data_frame)	

# 或者
library(lattice)
splom(splomvar)

11. Graphical Perception

图像会说谎

Ordered Elementary Tasks

  • Position along a common scale
  • Position along identical, nonaligned scales
  • Length
  • Angle / Slope
  • Area
  • Volume
  • Color hue/ Color saturation饱和/ Density

在这里插入图片描述

11.1 Position along a common scale

在这里插入图片描述

11.2 Position along a identical, nonaligned scales

在这里插入图片描述

11.3 Length

在这里插入图片描述

12. Multivariate Continuous

  • Two continuous variables: scatterplot
  • Three continuous variables:
    • scatterplot matrix
    • 3D scatterplot (R:scatterplot3d
    • interactive 3D scatterplot
library(plotly) 
plot_ly(df, x = ~x, y = ~y, z = ~z, mode = "markers", marker = list(size = 4)) %>% add_markers()
  • Four continuous variables: Parallel Coordinates Plot

12.1 用ggplot2画

x <- rnorm(50, 20, 5) 
y <- runif(50, 8, 12) - x 
df <- data.frame(x, y)

tidydf <- df %>% 
	select(x, y) %>% 
	rownames_to_column("ID") %>% 
	gather(var, value, -ID)
	
ggplot(tidydf, aes(x = var, y = value, group = ID)) + geom_line()
# **group**

数据是配对的
在这里插入图片描述

无group
在这里插入图片描述

有group没标准化
在这里插入图片描述

有group标准化
在这里插入图片描述

standardize <- function(x) (x-mean(x))/sd(x) 

df2 <- tidydf %>% 
	group_by(var) %>% 
	mutate(value = standardize(value)) %>% 
	ungroup() 
	
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Standardize")

有group+rescale到0,1
在这里插入图片描述

df2 <- tidydf %>% 
	group_by(var) %>% 
	mutate(value = scales::rescale(value)) %>% 
	ungroup() 
	
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Rescale")
  • What if compare different distributions?
# 原始图
x <- rnorm(50, 20, 5)
weirdvar <- c(1, rep(50, 48), 100) 
df <- data.frame(x, weirdvar) 
tidydf <- df %>% 
	rownames_to_column("ID") %>% 
	gather(var, value, -ID) 

ggplot(tidydf, aes(x = var, y = value, group = ID)) + 	
	geom_line(alpha = .5)

在这里插入图片描述

# 标准化
tidydf %>%
    group_by(var) %>%
    mutate(value = standardize(value)) %>%
    ungroup() %>%
    ggplot(aes(x = var, y = value, group = ID))+
    geom_line()

在这里插入图片描述

tidydf %>%
    group_by(var) %>%
    mutate(value = scales::rescale(value)) %>%
    ungroup() %>%
    ggplot(aes(x = var, y = value, group = ID))+
    geom_line()

在这里插入图片描述

12.2 用GGally中的ggparcoord画

(1) scale = “globalminmax” 不做任何变化
library(GGally) 
mystates <- data.frame(state.x77) %>% 
	rownames_to_column("State") %>% 
	mutate(Region = factor(state.region)) 
# state.region is a separate vector,并不在state.x77数据集中
 
mystates$Region <- factor(mystates$Region, levels = c("Northeast", "North Central", "South","West")) 

ggparcoord(mystates, columns = 2:9, scale = "globalminmax")

在这里插入图片描述

(2) scale = std(default)
ggparcoord(mystates, columns = 2:9)

在这里插入图片描述

(3) scale = std (default) + reordered
ggparcoord(mystates, columns = c(2, 4, 6, 8, 3, 5, 7, 9))

在这里插入图片描述

(4) alpha(alphaLines) + rescale (scale = “uniminmax”)
# scale = std (default) 
ggparcoord(mystates, columns = 2:9, alphaLines = .3, scale = "uniminmax")

在这里插入图片描述

(5) Dataset with repeats -》 Splines
x <- 1:10 
y <- c(2,2,4,4,5,5,5,10,10,10) 
z <- c(3,3,2,3,3,7,7,5,7,7) 
w <- c(1, 1, 1, 7, 7, 7, 8, 8, 8, 8) 
df <- data.frame(x,y,z, w) 
ggparcoord(df, columns = 1:4, scale = "globalminmax") + 
    geom_vline(xintercept = 1:4, color = "lightblue")

# 这种的话,到了结点处不知道哪个对应的是哪个

在这里插入图片描述

ggparcoord(df, columns = 1:4, scale = "globalminmax",splineFactor = 10) + 
    geom_vline(xintercept = 1:4, color = "lightblue")
# alpha+rescale+splines可以联合在一起

在这里插入图片描述

# Alpha + rescale + splines + group
# scale = std (default) 
ggparcoord(mystates, columns = 2:9, alphaLines = .5, scale = "uniminmax", splineFactor = 10, groupColumn = 10) + 
	geom_vline(xintercept = 2:9, color = "lightblue")

在这里插入图片描述

(6) Highlighting a trend -> ifelse+scale_color_manual
mystates %>% 
	mutate(color = factor(ifelse(Murder > 11, 1, 0))) %>% 
	arrange(color) %>%
	ggparcoord(columns = 2:9, groupColumn = "color") + 
	scale_color_manual(values = c("grey70", "red")) + 
	coord_flip() + 
	guides(color = FALSE) + 
	ggtitle("States with Murder Rate > 11 (per 100000) in red") 

在这里插入图片描述

(7) Watch out for categorical variables
library('d3r')
data.frame(Titanic) %>% 
	parcoords( rownames = F, # turn off rownames from the data.frame
	brushMode = "1D-axes" , 
	reorderable = T , 
	queue = T , 
	color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) , 
	withD3 = TRUE )

在这里插入图片描述

12.3 Html widget: parcoords -》interactive

# See: http://www.buildingwidgets.com/blog/2015/1/30/week-04-interactive-parallel-coordinates-1 
# devtools::install_github("timelyportfolio/parcoords") 

library(parcoords) 
mystates %>% 
	arrange(Region) %>% 
	parcoords( rownames = F ,  # turn off rownames from the data.frame
		brushMode = "1D-axes" , 
		reorderable = T , 
		queue = T,
		alpha = 0.5 )

# with color
parcoords(mystates , 
	rownames = F , 
	brushMode = "1D-axes" , 
	reorderable = T , 
	queue = T , 
	color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) , 
	withD3 = TRUE )

13. Multivariate Categorical

  • Frequency
    • Bar charts
      • Stacked bar chart
      • Grouped bar chart (2variables)
      • Grouped bar chart w/ facets (3 variables)
    • Cleveland dot plots
  • Proportion / Association
    • Mosaic plots
    • Fluctation diagrams

Chi Square Test of Independence

H 0 : i n d e p e n d e n t H_0:independent H0:independent
原始数据集:
在这里插入图片描述

localmat <- as.matrix(local[,2:3]) rownames(localmat) <- local$Age X <- chisq.test(localmat, correct = FALSE) 

X$observed
X$expected

X 

Mosaic plots

在这里插入图片描述

P.S.

  • “Treatment” level should be on the bottom and darker than the other shades (for ordinal data)
  • The levels for all ordinal data should appear in order.
  • Choose one variable to order by frequency count
  • 用vcd的mosaic时默认频数列是“Freq”

Mosaic pairs plot

在这里插入图片描述

Similar plots

  • mosaic plot = filled rectangular plot with consistent number of rows and columns, where each small rectangle represents a unique combination of levels of factors of the variables displayed
  • treemap = filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)
  • spine plot = mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction

Categorical data formats - Conversions

  • cases
    在这里插入图片描述

  • counts (tidy data with Freq column)
    在这里插入图片描述

  • contingency or pivot table
    在这里插入图片描述

在这里插入图片描述
Link

Likert data (满意度)

Stacked Bar Chart

在这里插入图片描述

Diverging Stacked Bar Chart

在这里插入图片描述

Diverging Stacked Bar Chart w/ Separate Neutrals

在这里插入图片描述

14. Alluvial(冲积层) diagrams

14.1 Alluvial Plots in ggplot2官方帮助文件

在这里插入图片描述

Define the following elements of a typical alluvial plot:

  • An axis is a dimension (variable) along which the data are vertically grouped at a fixed horizontal position. The plot above uses three categorical axes: Class, Sex, and Age.
  • The groups at each axis are depicted as opaque blocks called strata. For example, the Class axis contains four strata: 1st, 2nd, 3rd, and Crew.
  • Horizontal (x -) splines called alluvia span the width of the plot. In this plot, each alluvium corresponds to a fixed value of each axis variable, indicated by its vertical position at the axis, as well as of the Survived variable, indicated by its fill color.
  • The segments of the alluvia between pairs of adjacent axes are flows.
  • The alluvia intersect the strata at lodes. The lodes are not visualized in the above plot, but they can be inferred as filled rectangles extending the flows through the strata at each end of the plot or connecting the flows on either side of the center stratum.

在这里插入图片描述

14.2 Alluvial data

(1) Alluvia (wide) format
head(as.data.frame(UCBAdmissions))

在这里插入图片描述

install.packages("ggalluvial")
library(ggalluvial)
library(ggplot2)
library(dplyr)

ggplot(as.data.frame(UCBAdmissions),
       aes(y = Freq, axis1 = Gender, axis2 = Dept, axis3 = Admit)) +
    geom_alluvium(aes(fill = Gender), width = 1/12) +
    geom_stratum(width = 1/12, fill = "grey80", color = "grey") +
    geom_label(stat = "stratum", 
               aes(label = after_stat(stratum))) +
    scale_x_discrete(expand = c(.05, .05)) +
    scale_fill_brewer(type = "qual", palette = "Set1") +
    ggtitle("UC Berkeley admissions and rejections") +
 theme_void()

在这里插入图片描述

-> change geom_alluvium to geom_flow
在这里插入图片描述

中文版宽数据vs长数据的例子

(2)Lodes (long) format
UCB_lodes <- to_lodes_form(as.data.frame(UCBAdmissions),
                           axes = 1:3,
                           id = "Cohort")

在这里插入图片描述

  • x, the “key” variable indicating the axis to which the row corresponds, which are to be arranged along the horizontal axis;
  • stratum, the “value” taken by the axis variable indicated by x; and
  • alluvium, the indexing scheme that links the rows of a single alluvium.
data(Refugees, package = "alluvial")
country_regions <- c(
  Afghanistan = "Middle East",
  Burundi = "Central Africa",
  `Congo DRC` = "Central Africa",
  Iraq = "Middle East",
  Myanmar = "Southeast Asia",
  Palestine = "Middle East",
  Somalia = "Horn of Africa",
  Sudan = "Central Africa",
  Syria = "Middle East",
  Vietnam = "Southeast Asia"
)
Refugees$region <- country_regions[Refugees$country]
ggplot(data = Refugees,
       aes(x = year, y = refugees, alluvium = country)) +
  geom_alluvium(aes(fill = country, colour = country),
                alpha = .75, decreasing = FALSE) +
  scale_x_continuous(breaks = seq(2003, 2013, 2)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -30, hjust = 0)) +
  scale_fill_brewer(type = "qual", palette = "Set3") +
  scale_color_brewer(type = "qual", palette = "Set3") +
  facet_wrap(~ region, scales = "fixed") +

在这里插入图片描述
The format allows us to assign aesthetics that change from axis to axis along the same alluvium, which is useful for repeated measures datasets.

15. Simpsons Paradox

16. Color

RColorBrewer Color Schemes

  • sequential
  • diverging
  • qualitative (for categorical data)
  • perceptually uniform color spaces

Continuous data

+scale_color_viridis_c()
+scale_color_distiller(palette = ‘PuBu’)

create your own squential
+scale_color_gradient(low = ‘white’, high = ‘red’)
create your own diverging
+scale_color_gradient2(low = ‘blue’, mid = ‘white’, high = ‘red’)

Discrete data

+scale_color_viridis_d()
+scale_color_brewer(palette = ‘PuBu’)
create your own
+scale_color_manual(values = c(‘red’, ‘yellow’, ‘blue’))

Color Vision Deficiency

  • protanopia (red)
  • deuteranopia (green)
  • tritanopia (blue)

Legend order matches graph order

17. Heatmaps

  • can show frequency counts (2D histogram) or value of a third variable
  • can be used for continuous or categorical data (both for axes and fill color)

17.1 Drawing heatmaps with ggplot2

(1)geom_tile with numerical data, compare to geom_point
library(ggplot2) #画图
library(gridExtra) #拼图

x <- 1:3 
y <- c(5, 2, 7) 
df <- data.frame(x, y) 
g1 <- ggplot(df, aes(x, y)) + geom_point() 
g2 <- ggplot(df, aes(x, y)) + geom_tile() 
grid.arrange(g1, g2, nrow = 1)

在这里插入图片描述

(2)geom_raster as geom_tile w/ uniform w, h & faster

geom_tile() uses the center of the tile and its size (x, y, width, height). geom_raster is a high performance special case for when all the tiles are the same size

x <- c("apples", "bananas", "oranges") 
y <- c("NJ", "NY", "NJ") 
df <- data.frame(x, y) 
ggplot(df, aes(x,y)) + geom_raster() + theme_grey()

在这里插入图片描述

(3)Complete set of (x,y) pairs
y <- rep(c("apples", "bananas", "oranges"), 2) 
x <- rep(c("NJ", "NY"), each = 3) 
df <- data.frame(x, y) 
ggplot(df, aes(x, y)) + geom_raster()

-> Add fill color

set.seed(2019) 
df$z <- sample(6) 
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))

在这里插入图片描述

-> What if z is categorical ?

df$z <- c("A", "C", "B", "B", "A", "D") 
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))

在这里插入图片描述

-> Create a heat map theme
theme(axis.line = element_blank(), axis.ticks = element_blank()) -》轴刻度消失

theme_heat <- theme_classic() + 
  theme(axis.line = element_blank(), axis.ticks = element_blank()) 
ggplot(df, aes(x, y)) + 
  geom_raster(aes(fill = z)) + 
  theme_heat

在这里插入图片描述

-> Add coord_fixed

ggplot(df, aes(x, y)) + 
  geom_raster(aes(fill = z)) + 
  coord_fixed() + 
  theme_heat

在这里插入图片描述
-> Add white border
(doesn’t work with geom_raster())

ggplot(df, aes(x, y)) + 
  geom_tile(aes(fill = z), color = 'white') + 
  coord_fixed() + 
  theme_heat

二、作业复盘

Q1: Draw a parallel coordinates plot of the numeric columns in the dataset -> 怎么去找数据集中所有的数值类字段

library(palmerpenguins)
select_if(penguins_raw, is.numeric)

选择题

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述


something else

Making Faceted Heatmaps with ggplot2

参考大佬🧾

知乎:Dwzb
EDAV教授去年的课件
R 函数学习- facet_wrap,facet_grid

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值