Loop Functions
Writing for, while loops is useful when programming but not particularly easy when working interactively on the command line. There are some functions which implement looping to make life easier.
1. Iapply & sapply
lapply: Loop over a list and evaluate a function on each element
sapply: Same as lapply but try to simplify the result
lapply
lapply takes three arguments: (1) a list x; (2) a function (or the name of a function) FUN; (3) other arguments via its … argument. If x is not a list, it will be coerced to a list using as.list
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<bytecode: 0x7faa481311a0>
<environment: namespace:base>
lapply always returns a list, regardless of the class of the input
> x <- list(a = 1:5, b = rnorm(10))
> lapply(x,mean)
$a
[1] 3
$b
[1] 0.06460763
> x <- list(a = 1:4, b = rnorm(10), c=rnorm(20,1), d=rnorm(100,5))
> lapply(x, mean)
$a
[1] 2.5
$b
[1] 0.08151065
$c
[1] 0.8754845
$d
[1] 4.895723
> x <- 1:4
> lapply(x, runif)
[[1]]
[1] 0.5393189
[[2]]
[1] 0.7646683 0.9085368
[[3]]
[1] 0.2349948 0.2697491 0.9805410
[[4]]
[1] 0.3462296 0.4752699 0.6940489 0.6286985
> x <- 1:4
> lapply(x, runif, min=0, max=10)
[[1]]
[1] 7.896477
[[2]]
[1] 3.310554 7.511894
[[3]]
[1] 2.697588 5.187521 6.146860
[[4]]
[1] 8.548282 3.073971 9.337631 3.123481
lapply and friends make heavy use of anonymous functions
> x <- list(a=matrix(1:4,2,2), b=matrix(1:6,3,2))
> x
$a
[,1] [,2]
[1,] 1 3
[2,] 2 4
$b
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
An anonymous function for extracting the first column of each matrix
> lapply(x, function(elt) elt[,1])
$a
[1] 1 2
$b
[1] 1 2 3
sapply
apply will try to simplify the result of lapply if possible
- if the result is a list where every element is length 1, then a vector is returned
- if the result is a list where every element is a vector of the same length (>1), a matrix is returned
- if it can’t figure things out, a list is returned
2. apply
Apply a function over the margins of an array
apply is used to a evaluate a function (often an anonymous one) over the margins of an array
- It is most often used to apply a function to the rows or columns of a matrix
- It can be used with general arrays, e.g. taking the average of an array of matrices
- It is not really faster than writing a loop, but it works in one line!
> str(apply)
function (X, MARGIN, FUN, ..., simplify = TRUE)
- X is an array
- MARGIN is an integer vector indicating which margins would be “retained”
- FUN is a function to be applied
- … is for other arguments to be passed to FUN
> x <- matrix(rnorm(200), 20, 10)
> apply(x, 2, mean)
[1] -0.21539902 0.25480669 0.29069982 0.17461701
[5] 0.37034020 0.12646704 -0.32566278 -0.22870461
[9] -0.09823548 -0.25911445
> apply(x, 1, sum)
[1] -7.0689350 8.6579044 5.1903690 3.2852374
[5] 1.1214267 5.5725971 1.3352220 -1.2947558
[9] 1.9802187 -0.6288018 4.0929522 -3.6821994
[13] 3.1798243 -6.3139959 3.2065088 1.1054827
[17] -3.2508571 -4.7663893 -7.3932119 -2.5323088
col/row sums and means
For sums and means of matrix dimensions, we have some shortcuts
- rowSums = apply(x, 1, sum)
- rowMeans = apply(x, 1, mean)
- colSums = apply(x, 2, sum)
- colMeans = apply(x, 2, mean)
The shortcut functions are much faster, but you won’t notice unless you’re using a large matrix
> x <- matrix(rnorm(200), 20, 10)
> apply(x, 1, quantile, probs = c(0.25, 0.75))
[,1] [,2] [,3] [,4]
25% -1.4089414 -1.0943368 -0.363817672 -0.8012552
75% 0.1710051 0.4501446 0.007953445 0.5578421
[,5] [,6] [,7] [,8]
25% -0.3777239 -0.3780244 -0.2093595 -0.2881016
75% 0.7154293 1.1488982 0.6749194 0.2354858
[,9] [,10] [,11] [,12]
25% -1.082317 0.03608219 -0.7720221 -1.2136858
75% 0.748666 1.17685990 0.3569027 0.7026888
[,13] [,14] [,15] [,16]
25% -0.4269014 -0.1967092 -0.6222264 -0.9687748
75% 1.2411431 1.1737596 0.4463872 0.3376294
[,17] [,18] [,19] [,20]
25% -0.7954193 -1.1253970 -0.4542702 -0.4136788
75% 0.9285497 0.4474869 0.9182585 0.8779093
Average matrix in an array
> a <- array(rnorm(2 * 2 * 10), c(2,2,10))
> apply(a, c(1,2), mean)
[,1] [,2]
[1,] 0.43270737 -0.5288165
[2,] -0.04156793 0.1450441
> rowMeans(a, dims=2)
[,1] [,2]
[1,] 0.43270737 -0.5288165
[2,] -0.04156793 0.1450441
3. mapply
Multivariate version of lapply
mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments
> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE,
USE.NAMES = TRUE)
- FUN is a function to apply
- … contains arguments to apply over
- MoreArgs is a list of other arguments to FUN
- SIMPLIFY indicates whether the result should be simplified
The following is tedious to type
list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
Instead we can do
> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Vectorizing a Function
> noise <- function(n, mean, sd) {
+ rnorm(n, mean, sd)
+ }
> noise(5, 1, 2)
[1] 1.196562 -1.215140 3.092332 3.702626 5.585344
> noise(1:5, 1:5, 2)
[1] 1.852460 5.124322 3.231025 2.935376 4.500935
> mapply(noise, 1:5, 1:5, 2)
[[1]]
[1] 4.008276
[[2]]
[1] 3.300472 2.697185
[[3]]
[1] 0.9169633 2.3585491 2.6048531
[[4]]
[1] 2.0001993 5.1287517 0.3977513 3.4983892
[[5]]
[1] 6.236692 5.622495 6.975805 6.385192 5.001197
Instant Vectorization
> mapply(noise, 1:5, 1:5, 2)
[[1]]
[1] 0.09917791
[[2]]
[1] 2.544472 2.226350
[[3]]
[1] 1.798817 1.783018 4.084838
[[4]]
[1] 2.463081 2.472774 6.615392 9.716881
[[5]]
[1] 6.534139 4.407679 5.400077 4.202349 3.646371
Which is the same as
> list(noise(1,1,2), noise(2,2,2), noise(3,3,2), noise(4,4,2), noise(5,5,2))
[[1]]
[1] 4.234073
[[2]]
[1] 4.641468 4.531963
[[3]]
[1] 1.039511 4.629856 2.239910
[[4]]
[1] 3.4594122 -0.8233385 1.1815806 4.7479968
[[5]]
[1] 4.727629 6.513389 2.653866 3.381691 6.811396
4. tapply
Apply a function over subsets of a vector
tapply is used to apply a function over subsets of a vector. I don’t know why it’s called tapply
> str(tapply)
function (X, INDEX, FUN = NULL, ..., default = NA,
simplify = TRUE)
- x is a vector
- INDEX is a factor or a list of factors (or else they are coerced to factors)
- FUN is a function to be applied
- ... contains other arguments to be passed FUN
- simplify, should we simplify the result?
take group means
```r
> x <- c(rnorm(10), runif(10), rnorm(10,1))
> f <- gl(3,10)
> f
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
[26] 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
1 2 3
0.1242822 0.4960855 1.6125871
Take group means without simplification
tapply(x, f, mean, simplify = FALSE)
$`1`
[1] 0.1242822
$`2`
[1] 0.4960855
$`3`
[1] 1.612587
Find group ranges
> tapply(x, f, range)
$`1`
[1] -1.537179 2.028251
$`2`
[1] 0.01933642 0.95697515
$`3`
[1] 0.2878385 2.5830202
5. split
split takes a vector or other objects and splits it into groups determined by a factor or list or factors
> str(split)
function (x, f, drop = FALSE, ...)
- x is a vector (or list) or data frame
- f is a factor (or coerced to one) or a list of factors
- drop indicates whether empty factors levels should be dropped
> x <- c(rnorm(10), runif(10), rnorm(10, 1))
> f <- gl(3, 10)
> split(x, f)
$`1`
[1] 0.5330910 0.2794371 0.5029999 1.5984695
[5] -1.0672447 0.3206135 0.5849916 0.3912841
[9] -1.6406344 -1.2607067
$`2`
[1] 0.058477467 0.004412661 0.955095140 0.696776107
[5] 0.918116786 0.578598479 0.406501306 0.733634812
[9] 0.048131931 0.527895519
$`3`
[1] 0.01931473 -1.31360908 2.39626004 0.73020077
[5] -0.25639517 -0.20425572 2.16285391 -0.11501219
[9] -0.71055291 -0.43085714
A common idiom is split followed by an lapply
> lapply(split(x, f), mean)
$`1`
[1] 0.02423008
$`2`
[1] 0.492764
$`3`
[1] 0.2277947
Splitting a Data Frame
> library(datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> s <- split(airquality, airquality$Month)
> lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
$`5`
Ozone Solar.R Wind
NA NA 11.62258
$`6`
Ozone Solar.R Wind
NA 190.16667 10.26667
$`7`
Ozone Solar.R Wind
NA 216.483871 8.941935
$`8`
Ozone Solar.R Wind
NA NA 8.793548
$`9`
Ozone Solar.R Wind
NA 167.4333 10.1800
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
5 6 7 8
Ozone NA NA NA NA
Solar.R NA 190.16667 216.483871 NA
Wind 11.62258 10.26667 8.941935 8.793548
9
Ozone NA
Solar.R 167.4333
Wind 10.1800
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
5 6 7 8
Ozone 23.61538 29.44444 59.115385 59.961538
Solar.R 181.29630 190.16667 216.483871 171.857143
Wind 11.62258 10.26667 8.941935 8.793548
9
Ozone 31.44828
Solar.R 167.43333
Wind 10.18000
Splitting on more than one level
> x <- rnorm(10)
> fl <- gl(2,5)
> f2 <- gl(5,2)
> f1 <- gl(2,5)
> f1
[1] 1 1 1 1 1 2 2 2 2 2
Levels: 1 2
> f2
[1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
> interaction(f1, f2)
[1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5
Interactions can create empty levels
> str(split(x, list(f1,f2)))
List of 10
$ 1.1: num [1:2] -0.3 -1.55
$ 2.1: num(0)
$ 1.2: num [1:2] -0.5569 0.0888
$ 2.2: num(0)
$ 1.3: num 0.426
$ 2.3: num -0.329
$ 1.4: num(0)
$ 2.4: num [1:2] -1.1 2.15
$ 1.5: num(0)
$ 2.5: num [1:2] 0.807 -1.418
Empty levels can be dropped
> str(split(x, list(f1, f2), drop = TRUE))
List of 6
$ 1.1: num [1:2] -0.3 -1.55
$ 1.2: num [1:2] -0.5569 0.0888
$ 1.3: num 0.426
$ 2.3: num -0.329
$ 2.4: num [1:2] -1.1 2.15
$ 2.5: num [1:2] 0.807 -1.418