R for Data Science总结之——Factors
factor类型在R中用于处理分类变量,这里我们使用forcats包,也就是for categorical variables:
library(tidyverse)
library(forcats)
定义factor:
x1 <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
#> [1] Dec Apr Jan Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
如果levels中没有的变量会自动转化为NA:
y2 <- factor(x2, levels = month_levels)
y2
#> [1] Dec Apr <NA> Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
如果想得到一个warning信息,可以使用readr::parse_factor():
y2 <- parse_factor(x2, levels = month_levels)
#> Warning: 1 parsing failure.
#> row # A tibble: 1 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 3 NA value in level set Jam
如果省略levels,那么数据会按字母顺序排列:
factor(x1)
#> [1] Dec Apr Jan Mar
#> Levels: Apr Dec Jan Mar
如果想让levels顺序符合数据中第一次出现的顺序可以使用unique()或在定义后进行fct_inorder():
f1 <- factor(x1, levels = unique(x1))
f1
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar
f2 <- x1 %>% factor() %>% fct_inorder()
f2
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar
想直接获得levels可用:
levels(f2)
#> [1] "Dec" "Apr" "Jan" "Mar"
现在拿gss_cat数据集做实验: