一、课堂笔记
1. Intro to EDA
WHY
• detecting patterns
• finding outliers
• making comparisons
• identifying clusters
HOW
• Deep understanding of the dataset, where it came from, what its limitations are
• Experiment with different graphic forms, based on theory on what forms work well with different data types
Evaluating Graphs
• Wrong or misleading
• Meaningless
• Little added value
• Good alternatives
Q0:关于Rstudio+Markdown
btw,如果在左上角的代码区▶️运行代码,通常结果会默认显示在Chunk Output Inline,如果想显示在console,在最上面加上👇
---
output: pdf_document
editor_options:
chunk_output_type: console
---
2. ggplot2
推荐的书写顺序:
其中,<labels> 包含
- ggtitle()
- labs()
- xlab()
- ylab()
- annotate()
零碎知识点:scale/coord/theme (每类只能用一次)
- 在画完图之后,
- + theme_wsj() # library(ggthemes)
- +coord_polar(),那么变成极坐标的形式
- + facet_grid()
- +scale_x_reverse(),那么x轴将会翻转
- One scale per mapping
- scale_x_date()
- scale_y_continuous()
- scale_color_manual()
- scale_fill_viridis_c()
推荐的画图代码
library(Sleuth3) # data
library(ggplot2) # plotting
# load data
finches <- Sleuth3::case0201
# finch histograms by year with overlayed density curves
ggplot(finches, aes(x = Depth, y = ..density..)) +
# plotting
geom_histogram(bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") +
facet_wrap(~Year) +
# formatting
ggtitle("Severe Drought Led to Finches with Bigger Chompers",
subtitle = "Beak Depth Density of Galapagos Finches by Year") +
labs(x = "Beak Depth (mm)", caption = "Source: Sleuth3::case0201") +
theme(plot.title = element_text(face = "bold")) +
theme(plot.subtitle = element_text(face = "bold", color = "grey35")) +
theme(plot.caption = element_text(color = "grey68"))
3. Histograms
For Continuous Variables, we’re looking for features such as:
- Asymmetry 不对称
- Outliers
- Multimodality
- Gaps
- Heaping/Rounding 堆
- Impossibilities/Errors
Parameters
- bin boundaries
- bin number
怎么用ggplot输出两个并列的图?
# using binwidth
p1 <- ggplot(finches, aes(x = Depth)) +
geom_histogram(binwidth = 0.5, boundary = 6) +
# You can use boundary to specify the endpoint of any bin
ggtitle("Changed binwidth value")
# using bins
p2 <- ggplot(finches, aes(x = Depth)) +
geom_histogram(bins = 48, boundary = 6) +
ggtitle("Changed bins value")
# format plot layout
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
在画直方图的时候,需要尤其注意bin boundaries,哪边是开区间。
有个简单的方法就是将center设为xx.5,这样整数点都不会在两头
# import ggplot
library(ggplot2)
# must store data as dataframe
df <- data.frame(x)
# plot data
ggplot(df, aes(x)) +
geom_histogram(color = "grey", fill = "lightBlue",
binwidth = 5, center = 52.5) +
ggtitle("ggplot2 histogram of x")
Q1: GEOM和STAT都有boxplot,那他们画图的区别是什么?
「geom_和stat_之间的关系」
相互替代。
e.g. geom_bar和stat_count,geom_bar默认stat是count,stat_count默认使用geom = bar,即默认画出的是柱状图
# geom_bar vs stat_count
library(ggplot2)
library(MASS)
ggplot(mpg,aes(x=class)) + geom_bar() # 使用一个变量做柱状图
ggplot(mpg,aes(x=class)) + stat_count() # 和上面一样
# geom_bar vs geom_col -> 引出【Q2】
ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink') # 在geom_bar中更改默认的"count"为"identity"就可以接受两个变量作图 (btw,搞了个有颜色的
# 其中stat="identity"表示不对数据进行统计变换,也就是原来的x对应y
# 如果是stat="bin"则表示取横坐标x的频数
ggplot(mpg,aes(x=class,y=displ)) + geom_col(fill = 'pink') # 与上面相同,geom_col默认stat="identity"
# 这里还要注意一点:geom里面有关于颜色的两个常用参数:color(柱形的边框色) & fill (填充柱形)
#但是又有个问题
# ggplot(mpg,aes(x=class,y=displ)) + geom_bar(stat="identity", fill = 'pink')
#如果变成👇,结果会是一样的么?
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar", fill = 'pink')
# 或者 ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="col")和上面的代码出现的结果一致
# 但是这两个代码和前面的图画出来的都不一致❌
看到geom和stat的相互替换现象,一个很自然的想法是,geom_bar修改stat为"identity"作图结果,和stat_identity修改geom为"bar"应该是一样的。但是实际上却不一样。
后者作图结果是什么呢?我们可以从散点图中得到启发。比如第一根柱子最高是7,我们可以看到散点图中2seater对应的点纵坐标最大也是7.所以我们猜想这样做的结果是将点换成一个有相同高度的柱子来表示,而因为这个数据的横坐标是离散的,很多柱子重叠在一起,无法分辨,所以我们考虑换一个横坐标是连续的数据再试一试。mtcars数据集作图结果正好证实了我们的猜想
ggplot(mtcars, aes(wt, mpg)) + stat_identity(geom="bar")
# 连续变量
mtcars[c('wt','mpg')]
# 返回的部分结果展示
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
Duster 360 3.570 14.3
Merc 240D 3.190 24.4
Merc 230 3.150 22.8
为什么geom_bar(stat=“identity”)和stat_identity(geom=“bar”)结果不一样?
- position = identity:原地放,前后排队,只能看到最高的
- position = stack : 一个在一个上面堆叠,搭积木
# 改为
ggplot(mpg,aes(x=class,y=displ)) + stat_identity(geom="bar",position="stack", fill = 'pink')
# 这时候就一模一样了
ggplot(mpg,aes(x=class,y=displ)) + stat_identity() # 散点图
ggplot(mpg,aes(x=class,y=displ)) + geom_point() # 等价于上一条
Q2 :都是画柱形图,geom_bar() VS geom_col()的区别是?
区别:
geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights). If you want the heights of the bars to represent values in the data, use geom_col() instead. geom_bar() uses stat_count() by default: it counts the number of cases at each x position. geom_col() uses stat_identity(): it leaves the data as is.
> library(ggplot2)
> df <- data.frame(x = rep(c(2.9, 3.1, 4.5), c(5, 10, 4)))
> ggplot(df, aes(x)) + geom_bar()
> ggplot(df, aes(x)) + geom_col()
Error: geom_col requires the following missing aesthetics: y
# 前面的bar‘等同于’,但是一个是count,一个是两个变量的值
> df1 <- data.frame(x = c(2.9, 3.1, 4.5), y = c(5, 10, 4))
> ggplot() + geom_col(data = df1, aes(x,y))
Q3: Count, Relative Frequency, Density Histogram的区别
P.S. 在看直方图的时候,要注意它的binwidth是不是相同,比如,当出现以下这种情况👇,会导致misleading
-> 常见的解决办法:
- 如果有原数据,重新画图
- 没有原数据,合并两个binwidth = 5
- 画density hist
Q4: ggvis = interactive plot 交互式图像
-》 adjusting parameters of a histogram interactively while coding
数据可以从Github上获取
(我找数据的时候发现我们教授是个大佬🐂
现在代码在Rstudio不起作用
Q5: facet_grid VS facet_wrap
缠绕分面 facet_wrap
facet_warp 即“缠绕分面”,对数据分类只能应用一个标准,不同组数据获得的小形按从左到右从上到下的“缠绕”顺序进行排列
格网分面facet_grid
可以应用多个标准对数据进行分组。
4. tidy data
左下这张表是messy data,因为tidy definition:
- 1 variable per column
- 1 observation per row
- *also depends on the use case
# 这里学到的是「保留某一列,⚠️id列不动,其他都进行变化」
tidydata <- messydata %>%
pivot_longer(cols = !id, names_to = "roadtype",
values_to = "mpg")
-》 但有的时候 「if there is no id column」
5. Rounding Pattern
Test for Normality:
- Density Curve + Normal Curve
- QQ-plot
- Shapiro Wilk test
- H 0 : H_0 : H0:data is normally distributed
- H α : H_\alpha : Hα:data is not normally distributed
shapiro.test(x)
6. Boxplot
查看箱线图的统计量
Density Curve
+geom_density()
或者ggvis
Violin plots
Ridgeline plot
-> ggridge package
7. Categorical Variables
Types of data
- nominal - no fixed category order Sort from highest to lowest count (left to right, or top to bottom)
- ordinal - fixed category order (Sort in logical order of the categories (left to right)/(starting at bottom OR top)
- (‘real’) discrete, small # of possibilities (‘fake’:rounding height -> Cleveland dot plot
- Not always clearcut: nominal vs. ordinal, ordinal vs. discrete, and…
- Sometimes numbers = nominal, not discrete
8. WebScraping_rvest
Data in table form
library(tidyverse) library(rvest) library(robotstxt) paths_allowed("https://cran.r-project.org/web/packages/forcats/index.html")
forcats_data <- read_html("https://cran.r-project.org/web/packages/forcats/index.html") %>% html_table() length(forcats_data)
forcats_data[[1]]
mytable <- forcats_data[[1]]
str(mytable)
version <- mytable %>% filter(X1 == "Version:") %>% pull(X2) date <- mytable %>% filter(X1 == "Published:") %>% pull(X2)
Data not in table form
<h2 id="current_visitors" class="data">319,942</h2>
- h2 tag
- html_nodes("h2")
- id attribute
- html_nodes("#current_visitors")
- class attribute
- html_nodes(".data")
paths_allowed("https://analytics.usa.gov/")
webdata <- read_html("https://analytics.usa.gov/") webdata %>% html_nodes("h2")
webdata %>% html_nodes("#current_visitors")
webdata %>% html_nodes(".data")
webdata %>% html_nodes("h2") %>% html_text()
webdata_dl <- read_html("analytics.html") #网页保存到本地再读取
webdata_dl %>% html_nodes("h2") %>% html_text()
webdata %>% html_nodes("script")
webdata %>% html_nodes("script") %>% html_attr("type")
9. Categorical Variables Code
9.1 Character vs factor data
- character data: plotted alphabetically
- factor data: plotted in order of factor levels
9.2 Binned, ordinal data, levels out of order
Recoding factor levels -》 fct_recode()
x <- factor(c("G234", "G452", "G136"))
y <- fct_recode(x, Physics = "G234", Math = "G452", Chemistry = "G136")
If the row order is correct, use fct_inorder()
df <- data.frame(temperature = factor(c("cold", "warm", "hot")), count = c(15, 5, 22))
# row order is correct (think: factor in ROW order)
ggplot(df, aes(x = fct_inorder(temperature), y = count)) +
geom_col(color = mycolor, fill = myfill) +
theme_grey(16)
# 如果没有fct_inorder() 横坐标就变成了 c-h-w的顺序
fct_relevel() 移动levels的位置
- move levels to the beginning
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F")
fct_relevel(x, "move1", "move2")
## [1] A B C move1 D E move2 F
## Levels: move1 move2 A B C D E F
- to move levels after an item (by position)
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F")
fct_relevel(x, "move1", "move2", after = 4) # move after the fourth item
## [1] A B C move1 D E move2 F
## Levels: A B C D move1 move2 E F
- move levels to the end
x <- c("A", "B", "C", "move1", "D", "E", "move2", "F")
fct_relevel(x, "move1", "move2", after = Inf)
## [1] A B C move1 D E move2 F
## Levels: A B C D E F move1 move2
9.3 Binned, nominal
Order bars by frequency count using fct_reorder()
P.S. 但是这里要注意⚠️,如果旋转成水平,.desc = False
9.4 Unbinned, ordinal, levels out of order -> fct_relevel()
9.5 Unbinned, nominal data -> fct_infreq() (default is decreasing order of frequency)
同样注意⚠️Horizontal bars -> fct_rev+fct_infreq
ggplot(df, aes(fct_rev(fct_infreq(mmcolor)))) +
geom_bar() +
coord_flip() +
theme_grey(16)
9.6 Dealing with NAs -》fct_explicit_na
df <- data.frame(temperature = factor(c("cold", "warm", "hot", NA)), count = c(15, 5, 22, 12))
ggplot(df, aes(x = fct_inorder(temperature), y = count)) +
# 即使这里使用 fct_rev+fct_inorder 也没用
# 只改变除NA之外其他变量的顺序
geom_col() +
coord_flip() +
ggtitle("(NA bar is too prominent)") +
theme_grey(16)
df %>%
mutate(temperature = fct_explicit_na(temperature, "NA") %>%
fct_relevel("NA", "hot", "warm", "cold")) %>%
ggplot(aes(x = temperature, y = count)) +
geom_col(color = mycolor, fill = myfill) + coord_flip() +
theme_grey(16)
9.7 Rebinning
df <- as.data.frame(Titanic)
# problem
ggplot(df, aes(Class, Freq)) +
geom_col(color = "grey50", fill = "lightblue") +
theme_grey(16)
df %>%
group_by(Class) %>%
summarize(Freq = sum(Freq)) %>%
ggplot(aes(Class, Freq)) +
geom_col(color = "grey50", fill = "lightblue") +
theme_grey(16)
9.8 Percentages, more than one group!
# 1. 每个class的占比
df %>%
group_by(Class) %>%
summarize(Freq = sum(Freq)) %>%
mutate(prop = Freq/sum(Freq))
# 2. Overall percentages, 每个class+survived的组合占比
df2 <- df %>%
group_by(Class, Survived) %>%
summarize(Freq = sum(Freq)) %>%
ungroup() # very important
df2 %>% mutate(prop = Freq/sum(Freq))
- summarize() removes the last group
df %>% group_by(Class, Survived) %>% groups()
# 显示2个 class+survived
df %>% group_by(Class, Survived) %>% summarize(Freq = sum(Freq)) %>% groups()
# 只显示第1个 class
10. Dependency Relationships
-
Linear model
+geom_smooth(method = ‘lm’, se = FALSE) -
Residual Plot
Augment accepts a model object and a dataset and adds information about each observation in the dataset. Most commonly, this includes predicted values in the .fitted column, residuals in the .resid column, and standard errors for the fitted values in a .se.fit column. New columns always begin with a . prefix to avoid overwriting columns in the original dataset.
library(broom)
df <- mod %>% augment()
ggplot(df, aes(.fitted, .std.resid)) +
geom_point() +
geom_hline(yintercept = 0, col = "blue")
- Interactive (Plotly ggplot2 library)
library(plotly)
ggplotly(g) #g是ggplot的图
- Interactive (Plotly R library)
plot_ly()
- 散点图能告诉我们什么?
- Associations (describe what you see
- Outliers
- Clusters
- Gaps
- Barriers (boundaries)
- Conditional relationships (different relationships for different intervals of x)
- 如果散点图的点很多,怎么办?
- set alpha & stroke
- Don’t plot all points (remove outliers, subset data, sample data)
- Transform to log scale
- Heatmaps (bin counts or density estimates)
- Density contour lines
- Combination of above
- Multiple variable : scatterplot matrices
- set alpha & stroke
# example
# (1)subset
binned <- movies %>%
mutate(mybin = ntitle(votes, 10)) %>% #number of groups to split up into 分成10类,然后filter取出一类进行画图
# (2)log
ggplot()+
geom_point()+
scale_x_log10() #这里也可以设置breaks
# (3)heatmap
# (3.1)square heatmap of bin counts(defualt:30 bins
ggplot()+
scale_fill_viridis_c()+ #颜色填充
# 也可以自己控制颜色 scale_fill_gradient(low = '#F6F8FB',high = '#09005F')
theme_classic()+
geom_bin2d() #这里可以设置binwidth = c(xx,xx)控制方格的大小
#(3.2)hex heatmap
+ geom_hex() #也可以设置binwidth
# 4.1 Density estimate contour lines
library(MASS)
ggplot()+geom_point()+geom_density_2d()
# 4.2 2D Kernel density estimate
f <- kde2d(x,y,n)
image(f)
# w/ contour lines
contour(f,add = T)
# w/ points
points(x,y,pch = xx)
# 4.3another 2D kernel density estimation
smoothScatter(x,y)
# 4.4 calculate the kde, plot with ggplot2
df <- con2tr(f)
ggplot(df,aes(x,y))+
geom_contour(aes(z = z))
# 或者
ggplot(df,aes(x,y))+
geom_tile(aes(fill = z))+
scale_fill_viridis_c()
# 5. scatterplot matrices
plot(data_frame)
# 或者
library(lattice)
splom(splomvar)
11. Graphical Perception
图像会说谎
Ordered Elementary Tasks
- Position along a common scale
- Position along identical, nonaligned scales
- Length
- Angle / Slope
- Area
- Volume
- Color hue/ Color saturation饱和/ Density
11.1 Position along a common scale
11.2 Position along a identical, nonaligned scales
11.3 Length
12. Multivariate Continuous
- Two continuous variables: scatterplot
- Three continuous variables:
- scatterplot matrix
- 3D scatterplot (R:scatterplot3d
- interactive 3D scatterplot
library(plotly)
plot_ly(df, x = ~x, y = ~y, z = ~z, mode = "markers", marker = list(size = 4)) %>% add_markers()
- Four continuous variables: Parallel Coordinates Plot
12.1 用ggplot2画
x <- rnorm(50, 20, 5)
y <- runif(50, 8, 12) - x
df <- data.frame(x, y)
tidydf <- df %>%
select(x, y) %>%
rownames_to_column("ID") %>%
gather(var, value, -ID)
ggplot(tidydf, aes(x = var, y = value, group = ID)) + geom_line()
# **group**
数据是配对的
无group
有group没标准化
有group标准化
standardize <- function(x) (x-mean(x))/sd(x)
df2 <- tidydf %>%
group_by(var) %>%
mutate(value = standardize(value)) %>%
ungroup()
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Standardize")
有group+rescale到0,1
df2 <- tidydf %>%
group_by(var) %>%
mutate(value = scales::rescale(value)) %>%
ungroup()
ggplot(df2, aes(x = var, y = value, group = ID)) + geom_line(alpha = .5) + ggtitle("Rescale")
- What if compare different distributions?
# 原始图
x <- rnorm(50, 20, 5)
weirdvar <- c(1, rep(50, 48), 100)
df <- data.frame(x, weirdvar)
tidydf <- df %>%
rownames_to_column("ID") %>%
gather(var, value, -ID)
ggplot(tidydf, aes(x = var, y = value, group = ID)) +
geom_line(alpha = .5)
# 标准化
tidydf %>%
group_by(var) %>%
mutate(value = standardize(value)) %>%
ungroup() %>%
ggplot(aes(x = var, y = value, group = ID))+
geom_line()
tidydf %>%
group_by(var) %>%
mutate(value = scales::rescale(value)) %>%
ungroup() %>%
ggplot(aes(x = var, y = value, group = ID))+
geom_line()
12.2 用GGally中的ggparcoord画
(1) scale = “globalminmax” 不做任何变化
library(GGally)
mystates <- data.frame(state.x77) %>%
rownames_to_column("State") %>%
mutate(Region = factor(state.region))
# state.region is a separate vector,并不在state.x77数据集中
mystates$Region <- factor(mystates$Region, levels = c("Northeast", "North Central", "South","West"))
ggparcoord(mystates, columns = 2:9, scale = "globalminmax")
(2) scale = std(default)
ggparcoord(mystates, columns = 2:9)
(3) scale = std (default) + reordered
ggparcoord(mystates, columns = c(2, 4, 6, 8, 3, 5, 7, 9))
(4) alpha(alphaLines) + rescale (scale = “uniminmax”)
# scale = std (default)
ggparcoord(mystates, columns = 2:9, alphaLines = .3, scale = "uniminmax")
(5) Dataset with repeats -》 Splines
x <- 1:10
y <- c(2,2,4,4,5,5,5,10,10,10)
z <- c(3,3,2,3,3,7,7,5,7,7)
w <- c(1, 1, 1, 7, 7, 7, 8, 8, 8, 8)
df <- data.frame(x,y,z, w)
ggparcoord(df, columns = 1:4, scale = "globalminmax") +
geom_vline(xintercept = 1:4, color = "lightblue")
# 这种的话,到了结点处不知道哪个对应的是哪个
ggparcoord(df, columns = 1:4, scale = "globalminmax",splineFactor = 10) +
geom_vline(xintercept = 1:4, color = "lightblue")
# alpha+rescale+splines可以联合在一起
# Alpha + rescale + splines + group
# scale = std (default)
ggparcoord(mystates, columns = 2:9, alphaLines = .5, scale = "uniminmax", splineFactor = 10, groupColumn = 10) +
geom_vline(xintercept = 2:9, color = "lightblue")
(6) Highlighting a trend -> ifelse+scale_color_manual
mystates %>%
mutate(color = factor(ifelse(Murder > 11, 1, 0))) %>%
arrange(color) %>%
ggparcoord(columns = 2:9, groupColumn = "color") +
scale_color_manual(values = c("grey70", "red")) +
coord_flip() +
guides(color = FALSE) +
ggtitle("States with Murder Rate > 11 (per 100000) in red")
(7) Watch out for categorical variables
library('d3r')
data.frame(Titanic) %>%
parcoords( rownames = F, # turn off rownames from the data.frame
brushMode = "1D-axes" ,
reorderable = T ,
queue = T ,
color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) ,
withD3 = TRUE )
12.3 Html widget: parcoords -》interactive
# See: http://www.buildingwidgets.com/blog/2015/1/30/week-04-interactive-parallel-coordinates-1
# devtools::install_github("timelyportfolio/parcoords")
library(parcoords)
mystates %>%
arrange(Region) %>%
parcoords( rownames = F , # turn off rownames from the data.frame
brushMode = "1D-axes" ,
reorderable = T ,
queue = T,
alpha = 0.5 )
# with color
parcoords(mystates ,
rownames = F ,
brushMode = "1D-axes" ,
reorderable = T ,
queue = T ,
color = list( colorBy = "Region" ,colorScale = "scaleOrdinal" ,colorScheme = "schemeCategory10" ) ,
withD3 = TRUE )
13. Multivariate Categorical
- Frequency
- Bar charts
- Stacked bar chart
- Grouped bar chart (2variables)
- Grouped bar chart w/ facets (3 variables)
- Cleveland dot plots
- Bar charts
- Proportion / Association
- Mosaic plots
- Fluctation diagrams
Chi Square Test of Independence
H
0
:
i
n
d
e
p
e
n
d
e
n
t
H_0:independent
H0:independent
原始数据集:
localmat <- as.matrix(local[,2:3]) rownames(localmat) <- local$Age X <- chisq.test(localmat, correct = FALSE)
X$observed
X$expected
X
Mosaic plots
P.S.
- “Treatment” level should be on the bottom and darker than the other shades (for ordinal data)
- The levels for all ordinal data should appear in order.
- Choose one variable to order by frequency count
- 用vcd的mosaic时默认频数列是“Freq”
Mosaic pairs plot
Similar plots
- mosaic plot = filled rectangular plot with consistent number of rows and columns, where each small rectangle represents a unique combination of levels of factors of the variables displayed
- treemap = filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)
- spine plot = mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction
Categorical data formats - Conversions
-
cases
-
counts (tidy data with Freq column)
-
contingency or pivot table
Likert data (满意度)
Stacked Bar Chart
Diverging Stacked Bar Chart
Diverging Stacked Bar Chart w/ Separate Neutrals
14. Alluvial(冲积层) diagrams
14.1 Alluvial Plots in ggplot2官方帮助文件
Define the following elements of a typical alluvial plot:
- An axis is a dimension (variable) along which the data are vertically grouped at a fixed horizontal position. The plot above uses three categorical axes: Class, Sex, and Age.
- The groups at each axis are depicted as opaque blocks called strata. For example, the Class axis contains four strata: 1st, 2nd, 3rd, and Crew.
- Horizontal (x -) splines called alluvia span the width of the plot. In this plot, each alluvium corresponds to a fixed value of each axis variable, indicated by its vertical position at the axis, as well as of the Survived variable, indicated by its fill color.
- The segments of the alluvia between pairs of adjacent axes are flows.
- The alluvia intersect the strata at lodes. The lodes are not visualized in the above plot, but they can be inferred as filled rectangles extending the flows through the strata at each end of the plot or connecting the flows on either side of the center stratum.
14.2 Alluvial data
(1) Alluvia (wide) format
head(as.data.frame(UCBAdmissions))
install.packages("ggalluvial")
library(ggalluvial)
library(ggplot2)
library(dplyr)
ggplot(as.data.frame(UCBAdmissions),
aes(y = Freq, axis1 = Gender, axis2 = Dept, axis3 = Admit)) +
geom_alluvium(aes(fill = Gender), width = 1/12) +
geom_stratum(width = 1/12, fill = "grey80", color = "grey") +
geom_label(stat = "stratum",
aes(label = after_stat(stratum))) +
scale_x_discrete(expand = c(.05, .05)) +
scale_fill_brewer(type = "qual", palette = "Set1") +
ggtitle("UC Berkeley admissions and rejections") +
theme_void()
-> change geom_alluvium to geom_flow
(2)Lodes (long) format
UCB_lodes <- to_lodes_form(as.data.frame(UCBAdmissions),
axes = 1:3,
id = "Cohort")
x
, the “key” variable indicating the axis to which the row corresponds, which are to be arranged along the horizontal axis;stratum
, the “value” taken by the axis variable indicated by x; andalluvium
, the indexing scheme that links the rows of a single alluvium.
data(Refugees, package = "alluvial")
country_regions <- c(
Afghanistan = "Middle East",
Burundi = "Central Africa",
`Congo DRC` = "Central Africa",
Iraq = "Middle East",
Myanmar = "Southeast Asia",
Palestine = "Middle East",
Somalia = "Horn of Africa",
Sudan = "Central Africa",
Syria = "Middle East",
Vietnam = "Southeast Asia"
)
Refugees$region <- country_regions[Refugees$country]
ggplot(data = Refugees,
aes(x = year, y = refugees, alluvium = country)) +
geom_alluvium(aes(fill = country, colour = country),
alpha = .75, decreasing = FALSE) +
scale_x_continuous(breaks = seq(2003, 2013, 2)) +
theme_bw() +
theme(axis.text.x = element_text(angle = -30, hjust = 0)) +
scale_fill_brewer(type = "qual", palette = "Set3") +
scale_color_brewer(type = "qual", palette = "Set3") +
facet_wrap(~ region, scales = "fixed") +
The format allows us to assign aesthetics that change from axis to axis along the same alluvium, which is useful for repeated measures datasets.
15. Simpsons Paradox
16. Color
- sequential
- diverging
- qualitative (for categorical data)
- perceptually uniform color spaces
Continuous data
+scale_color_viridis_c()
+scale_color_distiller(palette = ‘PuBu’)
create your own squential
+scale_color_gradient(low = ‘white’, high = ‘red’)
create your own diverging
+scale_color_gradient2(low = ‘blue’, mid = ‘white’, high = ‘red’)
Discrete data
+scale_color_viridis_d()
+scale_color_brewer(palette = ‘PuBu’)
create your own
+scale_color_manual(values = c(‘red’, ‘yellow’, ‘blue’))
Color Vision Deficiency
- protanopia (red)
- deuteranopia (green)
- tritanopia (blue)
Legend order matches graph order
17. Heatmaps
- can show frequency counts (2D histogram) or value of a third variable
- can be used for continuous or categorical data (both for axes and fill color)
17.1 Drawing heatmaps with ggplot2
(1)geom_tile with numerical data, compare to geom_point
library(ggplot2) #画图
library(gridExtra) #拼图
x <- 1:3
y <- c(5, 2, 7)
df <- data.frame(x, y)
g1 <- ggplot(df, aes(x, y)) + geom_point()
g2 <- ggplot(df, aes(x, y)) + geom_tile()
grid.arrange(g1, g2, nrow = 1)
(2)geom_raster as geom_tile w/ uniform w, h & faster
geom_tile() uses the center of the tile and its size (x, y, width, height). geom_raster is a high performance special case for when all the tiles are the same size
x <- c("apples", "bananas", "oranges")
y <- c("NJ", "NY", "NJ")
df <- data.frame(x, y)
ggplot(df, aes(x,y)) + geom_raster() + theme_grey()
(3)Complete set of (x,y) pairs
y <- rep(c("apples", "bananas", "oranges"), 2)
x <- rep(c("NJ", "NY"), each = 3)
df <- data.frame(x, y)
ggplot(df, aes(x, y)) + geom_raster()
-> Add fill color
set.seed(2019)
df$z <- sample(6)
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))
-> What if z is categorical ?
df$z <- c("A", "C", "B", "B", "A", "D")
ggplot(df, aes(x, y)) + geom_raster(aes(fill = z))
-> Create a heat map theme
theme(axis.line = element_blank(), axis.ticks = element_blank()) -》轴刻度消失
theme_heat <- theme_classic() +
theme(axis.line = element_blank(), axis.ticks = element_blank())
ggplot(df, aes(x, y)) +
geom_raster(aes(fill = z)) +
theme_heat
-> Add coord_fixed
ggplot(df, aes(x, y)) +
geom_raster(aes(fill = z)) +
coord_fixed() +
theme_heat
-> Add white border
(doesn’t work with geom_raster())
ggplot(df, aes(x, y)) +
geom_tile(aes(fill = z), color = 'white') +
coord_fixed() +
theme_heat
二、作业复盘
Q1: Draw a parallel coordinates plot of the numeric columns in the dataset -> 怎么去找数据集中所有的数值类字段
library(palmerpenguins)
select_if(penguins_raw, is.numeric)
选择题
something else
Making Faceted Heatmaps with ggplot2