R语言swirl教程(R Programming)6——Subsetting Vectors
| In this lesson, we’ll see how to extract elements from a vector based on some conditions that we specify.
| For example, we may only be interested in the first 20 elements of a vector, or only the elements that are not NA, or only those that are positive or correspond to a specific variable of interest. By the end of this lesson, you’ll know how to handle each of these scenarios.
| I’ve created for you a vector called x that contains a random ordering of 20 numbers (from a standard normal distribution) and 20 NAs. Type x now to see what it looks like.
x
[1] 0.618804705 NA -0.561717045 NA NA 2.121961845 NA NA
[9] NA NA NA NA -0.116223284 -1.115846510 NA -1.404021991
[17] -0.902626087 NA -1.200279418 -0.171053254 0.729439833 NA 0.353889277 NA
[25] NA NA 1.005925106 NA -1.679218407 -0.670461758 NA -0.443677827
[33] NA -0.276915842 0.007862519 NA -0.047982745 -1.334484562 -1.102239409 NA
| The way you tell R that you want to select some particular elements (i.e. a ‘subset’) from a vector is by placing an ‘index vector’ in square brackets immediately following the name of the vector.
| For a simple example, try x[1:10] to view the first ten elements of x.
x[1:10]
[1] 0.6188047 NA -0.5617170 NA NA 2.1219618 NA NA NA
[10] NA
| Index vectors come in four different flavors – logical vectors, vectors of positive integers, vectors of negative integers, and vectors of character strings – each of which we’ll cover in this lesson.
| Let’s start by indexing with logical vectors. One common scenario when working with real-world data is that we want to extract all elements of a vector that are not NA (i.e. missing data). Recall that is.na(x) yields a vector of logical values the same length as x, with TRUEs corresponding to NA values in x and FALSEs corresponding to non-NA values in x.
| What do you think x[is.na(x)] will give you?
1: A vector with no NAs
2: A vector of length 0
3: A vector of all NAs
4: A vector of TRUEs and FALSEs
Selection: 3
| Prove it to yourself by typing x[is.na(x)].
x[is.na(x)]
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
| Recall that !
gives us the negation of a logical expression, so !is.na(x) can be read as ‘is not NA’. Therefore, if we want to create a vector called y that contains all of the non-NA values from x, we can use y <- x[!is.na(x)]. Give it a try.
y <- x[!is.na(x)]
| Print y to the console.
y
[1] 0.618804705 -0.561717045 2.121961845 -0.116223284 -1.115846510 -1.404021991 -0.902626087 -1.200279418
[9] -0.171053254 0.729439833 0.353889277 1.005925106 -1.679218407 -0.670461758 -0.443677827 -0.276915842
[17] 0.007862519 -0.047982745 -1.334484562 -1.102239409
| Now that we’ve isolated the non-missing values of x and put them in y, we can subset y as we please.
| Recall that the expression y > 0 will give us a vector of logical values the same length as y, with TRUEs corresponding to values of y that are greater than zero and FALSEs corresponding to values of y that are less than or equal to zero. What do you think y[y > 0] will give you?
1: A vector of length 0
2: A vector of all NAs
3: A vector of all the negative elements of y
4: A vector of all the positive elements of y
5: A vector of TRUEs and FALSEs
Selection: 4
| Type y[y > 0] to see that we get all of the positive elements of y, which are also the positive elements of our original vector x.
y[y > 0]
[1] 0.618804705 2.121961845 0.729439833 0.353889277 1.005925106 0.007862519
| Keep working like that and you’ll get there!
| You might wonder why we didn’t just start with x[x > 0] to isolate the positive elements of x. Try that now to see why.
x[x > 0]
[1] 0.618804705 NA NA NA 2.121961845 NA NA NA
[9] NA NA NA NA NA 0.729439833 NA 0.353889277
[17] NA NA NA 1.005925106 NA NA NA 0.007862519
[25] NA NA
| Since NA is not a value, but rather a placeholder for an unknown quantity, the expression NA > 0 evaluates to NA. Hence we get a bunch of NAs mixed in with our positive numbers when we do this.
| Combining our knowledge of logical operators with our new knowledge of subsetting, we could do this – x[!is.na(x) & x > 0]. Try it out.
x[!is.na(x) & x > 0]
[1] 0.618804705 2.121961845 0.729439833 0.353889277 1.005925106 0.007862519
| In this case, we request only values of x that are both non-missing AND greater than zero.
| I’ve already shown you how to subset just the first ten values of x using x[1:10]. In this case, we’re providing a vector of positive integers inside of the square brackets, which tells R to return only the elements of x numbered 1 through 10.
| Many programming languages use what’s called ‘zero-based indexing’, which means that the first element of a vector is considered element 0. R uses ‘one-based indexing’, which (you guessed it!) means the first element of a vector is considered element 1.
| Can you figure out how we’d subset the 3rd, 5th, and 7th elements of x? Hint – Use the c() function to specify the element numbers as a numeric vector.
x[c(1,3,5)]
[1] 0.6188047 -0.5617170 NA
| You almost had it, but not quite. Try again. Or, type info() for more options.
| Create a vector of indexes with c(3, 5, 7), then put that inside of the square brackets.
x[c(3,5,7)]
[1] -0.561717 NA NA
| It’s important that when using integer vectors to subset our vector x, we stick with the set of indexes {1, 2, …, 40} since x only has 40 elements. What happens if we ask for the zeroth element of x (i.e. x[0])? Give it a try.
x[0]
numeric(0)
| As you might expect, we get nothing useful. Unfortunately, R doesn’t prevent us from doing this. What if we ask for the 3000th element of x? Try it out.
x[3000]
[1] NA
| Again, nothing useful, but R doesn’t prevent us from asking for it. This should be a cautionary tale. You should always make sure that what you are asking for is within the bounds of the vector you’re working with.
| What if we’re interested in all elements of x EXCEPT the 2nd and 10th? It would be pretty tedious to construct a vector containing all numbers 1 through 40 EXCEPT 2 and 10.
| Luckily, R accepts negative integer indexes. Whereas x[c(2, 10)] gives us ONLY the 2nd and 10th elements of x, x[c(-2, -10)] gives us all elements of x EXCEPT for the 2nd and 10 elements. Try x[c(-2, -10)] now to see this.
x[c(-2, -10)]
[1] 0.618804705 -0.561717045 NA NA 2.121961845 NA NA NA
[9] NA NA -0.116223284 -1.115846510 NA -1.404021991 -0.902626087 NA
[17] -1.200279418 -0.171053254 0.729439833 NA 0.353889277 NA NA NA
[25] 1.005925106 NA -1.679218407 -0.670461758 NA -0.443677827 NA -0.276915842
[33] 0.007862519 NA -0.047982745 -1.334484562 -1.102239409 NA
| A shorthand way of specifying multiple negative numbers is to put the negative sign out in front of the vector of positive numbers. Type x[-c(2, 10)] to get the exact same result.
x[-c(2, 10)]
[1] 0.618804705 -0.561717045 NA NA 2.121961845 NA NA NA
[9] NA NA -0.116223284 -1.115846510 NA -1.404021991 -0.902626087 NA
[17] -1.200279418 -0.171053254 0.729439833 NA 0.353889277 NA NA NA
[25] 1.005925106 NA -1.679218407 -0.670461758 NA -0.443677827 NA -0.276915842
[33] 0.007862519 NA -0.047982745 -1.334484562 -1.102239409 NA
| So far, we’ve covered three types of index vectors – logical, positive integer, and negative integer. The only remaining type requires us to introduce the concept of ‘named’ elements.
| Create a numeric vector with three named elements using vect <- c(foo = 11, bar = 2, norf = NA).
vect <- c(foo = 11, bar = 2, norf = NA)
| Your dedication is inspiring!
| When we print vect to the console, you’ll see that each element has a name. Try it out.
vect
foo bar norf
11 2 NA
| We can also get the names of vect by passing vect as an argument to the names() function. Give that a try.
names(vect)
[1] “foo” “bar” “norf”
| Alternatively, we can create an unnamed vector vect2 with c(11, 2, NA). Do that now.
c(11, 2, NA)
[1] 11 2 NA
| You’re close…I can feel it! Try it again. Or, type info() for more options.
| Create an ordinary (unnamed) vector called vect2 that contains c(11, 2, NA).
vect2 <- c(11, 2, NA)
| Then, we can add the names
attribute to vect2 after the fact with names(vect2) <- c(“foo”, “bar”, “norf”). Go ahead.
names(vect2) <- c(“foo”, “bar”, “norf”)
| Now, let’s check that vect and vect2 are the same by passing them as arguments to the identical() function.
identical(vect, vect2)
[1] TRUE
| Indeed, vect and vect2 are identical named vectors.
| Now, back to the matter of subsetting a vector by named elements. Which of the following commands do you think would give us the second element of vect?
1: vect[bar]
2: vect[“bar”]
3: vect[“2”]
Selection: 2
| Now, try it out.
vect[“bar”]
bar
2
| Likewise, we can specify a vector of names with vect[c(“foo”, “bar”)]. Try it out.
vect[c(“foo”, “bar”)]
foo bar
11 2
| Now you know all four methods of subsetting data from vectors. Different approaches are best in different scenarios and when in doubt, try it out!