What's a factor and why would you use it?
In this chapter you dive into the wonderful world of factors.
The term factor refers to a statistical data type used to store categorical(绝对) variables. The difference between a categorical variable and a continuous(连续) variable is that a categorical variable can belong to a limited number of categories(类别). A continuous variable, on the other hand, can correspond(符合) to an infinite(无限) number of values.
It is important that R knows whether it is dealing with a continuous or a categorical variable, as the statistical models you will develop in the future treat both types differently. (You will see later why this is the case.)
theory
the value
"factors for categorical variables"
.
# Assign to the variable theory what this chapter is about!
theory <- "factors for categorical variables"
theory
console:
> # Assign to the variable theory what this chapter is about!
> theory <- "factors for categorical variables"
> theory
[1] "factors for categorical variables"
What's a factor and why would you use it? (2)
To create factors in R, you make use of the function factor()
. First thing that you have to do is create a vector that contains(包含) all the observations(观察值) that belong to a limited number of categories. For example, gender_vector
contains the sex of 5 different individuals(个人):
gender_vector <- c("Male","Female","Female","Male","Male")
It is clear that there are two categories(类), or in R-terms 'factor levels', at work here: "Male" and "Female".
The function factor()
will encode(编译) the vector as a factor:
factor_gender_vector <- factor(gender_vector)
- Convert(转变) the character vector
gender_vector
to a factor withfactor()
and assign the result tofactor_gender_vector
- Print out
factor_gender_vector
and assert that R prints out the factor levels below the actual values.
# Gender vector
gender_vector <- c("Male", "Female", "Female", "Male", "Male")
# Convert gender_vector to a factor
factor_gender_vector <- factor(gender_vector)
# Print out factor_gender_vector
factor_gender_vector
console:
> # Gender vector
> gender_vector <- c("Male", "Female", "Female", "Male", "Male")
>
> # Convert gender_vector to a factor
> factor_gender_vector <- factor(gender_vector)
>
> # Print out factor_gender_vector
> factor_gender_vector
[1] Male Female Female Male Male
Levels: Female Male
What's a factor and why would you use it? (3)
There are two types of categorical variables: a nominal(名义上的) categorical variableand an ordinal(序数) categorical variable.
A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that 'one is worth more than the other'. For example, think of the categorical variable animals_vector
with the categories "Elephant"
, "Giraffe"
, "Donkey"
and "Horse"
. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).
temperature_vector
with the categories:
"Low"
,
"Medium"
and
"High"
. Here it is obvious that
"Medium"
stands above(高于)
"Low"
, and
"High"
stands above
"Medium"
.
# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
factor_temperature_vector
console:
> # Animals
> animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
> factor_animals_vector <- factor(animals_vector)
> factor_animals_vector
[1] Elephant Giraffe Donkey Horse
Levels: Donkey Elephant Giraffe Horse
>
> # Temperature
> temperature_vector <- c("High", "Low", "High","Low", "Medium")
> factor_temperature_vector <- factor(temperature_vector, order = TRUE, levels = c("Low", "Medium", "High"))
> factor_temperature_vector
[1] High Low High Low Medium
Levels: Low < Medium < High
Factor levels(层次)
When you first get a data set, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity(清楚) or other reasons. R allows you to do this with the function levels()
:
levels(factor_vector) <- c("name1", "name2",...)
A good illustration(解释) is the raw(原始) data that is provided to you by a survey(问卷). A standard question for every questionnaire is the gender of the respondent. You remember from the previous question that this is a factor and when performing the questionnaire on the streets its levels are often coded as "M"
and "F"
.
survey_vector <- c("M", "F", "F", "M", "M")
Next, when you want to start your data analysis, your main concern is to keep a nice overview of all the variables and what they mean. At that point, you will often want to change the factor levels to "Male"
and "Female"
instead of "M"
and "F"
to make your life easier.
levels(factor_survey_vector)
, you'll see that it outputs
[1] "F" "M"
. If you don't specify(指出) the levels of the factor when creating the vector,
R
will automatically assign them alphabetically(字母序). To correctly map
"F"
to
"Female"
and
"M"
to
"Male"
, the levels should be set to
c("Female", "Male")
, in this order order.
- Check out the code that builds a factor vector from
survey_vector
. You should usefactor_survey_vector
in the next instruction(命令). - Change the factor levels of
factor_survey_vector
toc("Female", "Male")
. Mind the order of the vector elements here.
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female","Male")
factor_survey_vector
console:
> # Code to build factor_survey_vector
> survey_vector <- c("M", "F", "F", "M", "M")
> factor_survey_vector <- factor(survey_vector)
>
> # Specify the levels of factor_survey_vector
> levels(factor_survey_vector) <- c("Female","Male")
>
> factor_survey_vector
[1] Male Female Female Male Male
Levels: Female Male
> #这个序列是怎么来的呢,有点迷惑,我换了一下levels的 顺序,序列就改变啦
## assign individual levels
> x <- gl(2, 4, 8)
> levels(x)[1] <- "low"
> levels(x)[2] <- "high"
> x
[1] low low low low high high high high
Levels: low high
> # 根据数值8知道共8个人,然后提供两个等级,low和high,所以得到属性序列
>
> ## or as a group
> y <- gl(2, 4, 8)
> levels(y) <- c("low", "high")
> y
[1] low low low low high high high high
Levels: low high
>
> ## combine some levels
> z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
> z
[1] apple apple salad salad orange orange apple apple salad salad
[11] orange orange
Levels: apple salad orange
> levels(z) <- c("fruit", "veg", "fruit")
> z
[1] fruit fruit veg veg fruit fruit fruit fruit veg veg fruit fruit
Levels: fruit veg
>
> ## same, using a named list
> z <- gl(3, 2, 12, labels = c("apple", "salad", "orange"))
> z
[1] apple apple salad salad orange orange apple apple salad salad
[11] orange orange
Levels: apple salad orange
> levels(z) <- list("fruit" = c("apple","orange"),
"veg" = "salad")
> z
[1] fruit fruit veg veg fruit fruit fruit fruit veg veg fruit fruit
Levels: fruit veg
>
> ## we can add levels this way:
> f <- factor(c("a","b"))
> levels(f) <- c("c", "a", "b")
> f
[1] c a
Levels: c a b
>
> f <- factor(c("a","b"))
> levels(f) <- list(C = "C", A = "a", B = "b")
> f
[1] A B
Levels: C A B
After finishing this course, one of your favorite functions in R will be summary()
. This will give you a quick overview of the contents of a variable:
summary(my_var)
"Male"
responses you have in your study, and how many
"Female"
responses. The
summary()
function gives you the answer to this question.
summary()
of the
survey_vector
and
factor_survey_vector
. Interpret(翻译) the results of both vectors. Are they both equally useful in this case?
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
factor_survey_vector
# Generate summary for survey_vector
summary(survey_vector)
# Generate summary for factor_survey_vector
summary(factor_survey_vector)
console:
> # Build factor_survey_vector with clean levels
> survey_vector <- c("M", "F", "F", "M", "M")
> factor_survey_vector <- factor(survey_vector)
> levels(factor_survey_vector) <- c("Female", "Male")
> factor_survey_vector
[1] Male Female Female Male Male
Levels: Female Male
>
> # Generate summary for survey_vector
>
> summary(survey_vector)
Length Class Mode
5 character character
>
> # Generate summary for factor_survey_vector
> summary(factor_survey_vector)
Female Male
2 3
factor_survey_vector
we have a factor with two levels: Male and Female. But how does R value these relatively to each other? In other words, who does R think is better, males or females?
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")
# Male
male <- factor_survey_vector[1]
# Female
female <- factor_survey_vector[2]
# Battle of the sexes: Male 'larger' than female?
male > female
console:
> # Build factor_survey_vector with clean levels
> survey_vector <- c("M", "F", "F", "M", "M")
> factor_survey_vector <- factor(survey_vector)
> levels(factor_survey_vector) <- c("Female", "Male")
>
> # Male
> male <- factor_survey_vector[1]
>
> # Female
> female <- factor_survey_vector[2]
>
> # Battle of the sexes: Male 'larger' than female?
> male > female
Warning message: '>' not meaningful for factors
[1] NA
Since "Male"
and "Female"
are unordered (or nominal) factor levels, R returns a warning message, telling you that the greater than operator(操作符) is not meaningful. As seen before, R attaches an equal value to the levels for such factors.
"slow"
, "fast"
or "insane"
, and save the results in speed_vector
.
As a first step, assign speed_vector
knowing that:
- Analyst 1 is fast,
- Analyst 2 is slow,
- Analyst 3 is slow,
- Analyst 4 is fast and
- Analyst 5 is insane.
# Create speed_vector
speed_vector <- c("fast", "slow", "slow", "fast", "insane")
console:
> # Create speed_vector
> speed_vector <- c("fast", "slow", "slow", "fast", "insane")
> speed_vector
[1] "fast" "slow" "slow" "fast" "insane"
Ordered factors (2)
speed_vector
should be converted(转换) to an ordinal factor since its categories have a natural ordering. By default, the function factor()
transforms speed_vector
into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered
and levels
.
factor(some_vector,
ordered = TRUE,
levels = c("lev1", "lev2" ...))
ordered
to TRUE
in the function factor()
, you indicate that the factor is ordered. With the argument levels
you give the values of the factor in the correct order.
speed_vector
, create an ordered factor vector:
factor_speed_vector
. Set
ordered
to
TRUE
, and set
levels
to
c("slow", "fast", "insane")
.
# Create speed_vector
speed_vector <- c("fast", "slow", "slow", "fast", "insane")
# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))
# Print factor_speed_vector
factor_speed_vector
summary(factor_speed_vector)
console:
> # Create speed_vector
> speed_vector <- c("fast", "slow", "slow", "fast", "insane")
>
> # Convert speed_vector to ordered factor vector
> factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))
>
> # Print factor_speed_vector
> factor_speed_vector
[1] fast slow slow fast insane
Levels: slow < fast < insane
> summary(factor_speed_vector)
slow fast insane
2 2 1
Comparing ordered factors
Having a bad day at work, 'data analyst number two' enters your office and starts complaining that 'data analyst number five' is slowing down the entire project. Since you know that 'data analyst number two' has the reputation of being a smarty-pants, you first decide to check if his statement is true.
factor_speed_vector
is now ordered enables us to compare different elements (the data analysts in this case). You can simply do this by using the well-known operators.
- Use
[2]
to select fromfactor_speed_vector
the factor value for the second data analyst. Store it asda2
. - Use
[5]
to select thefactor_speed_vector
factor value for the fifth data analyst. Store it asda5
. - Check if
da2
is greater thanda5
; simply print out the result. Remember that you can use the>
operator to check whether one element is larger than the other.
# Create factor_speed_vector
speed_vector <- c("fast", "slow", "slow", "fast", "insane")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))
# Factor value for second data analyst
da2 <- factor_speed_vector[2]
# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]
# Is data analyst 2 faster than data analyst 5?
da2 > da5
console:
> # Create factor_speed_vector
> speed_vector <- c("fast", "slow", "slow", "fast", "insane")
> factor_speed_vector <- factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane"))
>
> # Factor value for second data analyst
> da2 <- factor_speed_vector[2]
>
> # Factor value for fifth data analyst
> da5 <- factor_speed_vector[5]
>
> # Is data analyst 2 faster than data analyst 5?
> da2 > da5
[1] FALSE