R语言-datacamp笔记-introduction to R-3 factor
datacamp-introduction to R-3 factor因子
- What’s a factor and why would you use it?
(1) R将变量分为连续变量和分类变量
分类变量:性别是一个典型例子
`# Assign to the variable theory what this chapter is about!
theory <- "factors"`
- What’s a factor and why would you use it? (2)
(1) 在R中创建因子,可以用factor()函数:
首先创造一个向量,它包含对一个数量有限的类别的所有观察。例如:
# Sex vector
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
这个例子有两个类别,R的术语是“因素水平”(factor levels)
factor()函数将向量转换成因子:
# Convert sex_vector to a factor
factor_sex_vector <- factor(sex_vector)
输出得到结果
# Print out factor_sex_vector
factor_sex_vector
factor_sex_vector
[1] Male Female Female Male Male
Levels: Female Male
- What’s a factor and why would you use it? (3)
(1) 分类变量可以进一步分为:名称类别变量和序数类别变量(nominal categorical variable和 ordinal categorical variable).
A.名称类别变量不隐含顺序,即无价值排序。比如动物分类:
# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
运行结果
# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
[1] Elephant Giraffe Donkey Horse
Levels: Donkey Elephant Giraffe Horse
B.相反,序数类别变量隐含自然顺序,有价值排序。比如温度分类:
# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
运行结果
# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
[1] High Low High Low Medium
Levels: Low < Medium < High
- Factor levels
(1) 用levels()函数来改变默认分类标准。
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
factor_survey_vector
运行结果
[1] M F F M M
Levels: F M
指定分类标准
# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female", "Male")
factor_survey_vector
运行结果
[1] Male Female Female Male Male
Levels: Female Male
- summary()函数
(1) R用lsummary()函数来快速概述一个变量。
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
factor_survey_vector
# Generate summary for survey_vector
summary(survey_vector)
# Generate summary for factor_survey_vector
summary(factor_survey_vector)
运行结果
> factor_survey_vector
[1] Male Female Female Male Male
Levels: Female Male
> summary(survey_vector)
Length Class Mode
5 character character
> summary(factor_survey_vector)
Female Male
2 3
- 检测名称类别变量是否能比较大小
(1) R中名称类别变量不能比较大小
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
# Male
male <- factor_survey_vector[1]
# Female
female <- factor_survey_vector[2]
# Battle of the sexes: Male 'larger' than female?
male > female
运行结果
> male > female
[1] NA
Warning message:
In Ops.factor(male, female) : ‘>’ not meaningful for factors
- 有序因子
(1) R中序数类别变量能比较大小
需添加两个参数
ordered - 是否排序
levels - 顺序:从左至右 从小到大
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))
# Print factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)
运行结果
> factor_speed_vector
[1] medium slow slow medium fast
Levels: slow < medium < fast
> summary(factor_speed_vector)
slow medium fast
2 2 1
序数类别变量能够比较大小
# Create factor_speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "medium", "fast"))
# Factor value for second data analyst
da2 <- factor_speed_vector[2]
# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]
# Is data analyst 2 faster than data analyst 5?
da2 > da5
运行结果
> da2 > da5
[1] FALSE