关闭

R语言笔记三(循环控制)

标签: R语言函数
209人阅读 评论(0) 收藏 举报
分类:

Looping on the Command Line

  • lapply: Loop over a list and evaluate a function on each element
  • sapply: Same as lapply but try to simplify the result
  • apply: Apply a function over the margins of an array
  • tapply: Apply a function over subsets of a vector
  • mapply: Multivariate version of lapply

    An auxiliary function split is also useful, particularly in conjunction with lapply.

lapply

lapply takes three arguments: 1) a list x; 2) a function (or the same of a function) FUN; 3) other arguments via its … argument. if x is not a list, it will be coerced to a list using as.list.

function (x, FUN, ...)
    {
    FUN <- match.fun(FUN)
    if (!is.vector(x) || is.object(x))
        x <- as.list(x)
        .Internal(lapply(x, FUN))
    }

apply always returns a list, regardless of the class of the input

> x <- list(a = 1:5, b = rnorm(10))
> lapply(x, mean)
$a
[1] 3

$b
[1] -0.3547574

-runif
> x <- 1:4
> lapply(x, runif)
[[1]]
[1] 0.5858749

[[2]]
[1] 0.9175867 0.5018911

[[3]]
[1] 0.9469087 0.9433378 0.1517459

[[4]]
[1] 0.09512925 0.74208153 0.15909282 0.87364938

> lapply(x, runif, min = 0, max = 10)
[[1]]
[1] 5.831673

[[2]]
[1] 7.570051 7.305406

[[3]]
[1] 9.368319 5.455867 7.415463

[[4]]
[1] 6.826887 7.357280 3.238416 7.802379

An anonymous function for extracting the first column of each matrix

> x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
> x
$a
     [,1] [,2]
[1,]    1    3
[2,]    2    4

$b
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

> lapply(x, function(elt) elt[, 1])
$a
[1] 1 2

$b
[1] 1 2 3

sapply

sapply will try to simplify the result of lapply if possible.

  • if the result is a list where every element is length 1, then a vector is returned
  • if the result is a list where every element is a vector of the same length (>1), a matrix is returned.
  • if it can't figure things out, a list is returned

apply

apply is used to a evaluate a function (often an anonymous one) over the margins of array.

  • It is most often used to apply a function to the rows or columns of a matrix
  • It can be used with general arrays, e.g. taking the average of an array of matrices
  • It is not really faster than writing a loop, but it works in one line!
> str(apply)
 function (X, MARGIN, FUN, ...)
  • X is an array
  • MARGIN is an integer vector indicating which margins should be "retained".
  • FUN is a function to be applied
  • ... is for other arguments to be passed to FUN
> x <- matrix(rnorm(200), 20, 10)
> apply(x, 2, mean)
 [1]  0.075145604  0.035773125  0.028000869 -0.002397926 -0.093946806 -0.137180745
 [7]  0.185798470  0.176721101  0.333040593 -0.023563408
> apply(x, 1, sum)
 [1]  3.6178468 -1.7521142 -2.8018156 -3.0790132 -1.5047359  1.1693516  2.8311426
 [8] -0.3017269  3.9315377 -3.0270337  1.9042014  0.1322095  3.4130400  4.5420898
[15]  6.5186712  2.2904191 -3.4532168  0.9306271 -2.3215379 -1.4921250

col/row sums and means
For sums and means of matrix dimensions, we have some shortcuts.

  • rowSums = apply(x, 1, sum)
  • rowMeans = apply(x, 1, mean)
  • colSums = apply(x, 2, sum)
  • colMeans = apply(x, 2, mean)
    The shortcut functions are much faster, but you won’t notice unless you’re using a large matrix.

Other Ways to Apply
Quantiles of the rows of a matrix.

> x <- matrix(rnorm(200), 20, 10)
> apply(x, 1, quantile, probs = c(0.25, 0.75))
          [,1]       [,2]       [,3]        [,4]       [,5]       [,6]       [,7]
25% -1.0785842 -0.1784524 -0.5438079 -0.64180635 -1.0005862 -0.6028029 -0.3994210
75%  0.4211757  1.1409644  0.8497324  0.07363245  0.3657194  1.7941782  0.7042971
          [,8]       [,9]      [,10]      [,11]      [,12]      [,13]      [,14]
25% -1.0833555 -0.4054171 -0.7756497 -0.3573221 -0.3508735 -0.5804230 -1.2599964
75%  0.4103444  0.1445452  0.8813965  0.1172065  0.4001036  0.4037903  0.4085925
         [,15]      [,16]      [,17]      [,18]      [,19]      [,20]
25% -0.3023458 -0.8841903 -0.3171798 -0.9172023 -0.5017125 -0.2736942
75%  0.5983026  0.8812674  0.7821295  0.6989358  0.6052281  0.6885168

Average matrix in an array ?

> a <- array(rnorm(2 * 2 * 10), c(2,2,10))
> apply(a,c(1,2),mean)
            [,1]      [,2]
[1,] -0.55440455 0.0617595
[2,] -0.05850618 0.2768596
> rowMeans(a, dims = 2)
            [,1]      [,2]
[1,] -0.55440455 0.0617595
[2,] -0.05850618 0.2768596

mapply

mapply is a multivariate apply of sorts which applies a function in parrallel over a set of arguments.

> str(mapply)
function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, 
          USE.NAMES = TRUE) 
  • FUN is a function to apply
  • ... contains arguments to apply over
  • MoreArgs is a list of other arguments to FUN
  • SIMPLIFY indicates whether the result should be simplified

The following is tedious to type

list(rep(1, 4), rep(2, 3), rep(3,2), rep(4, 1))

Instead we can do

> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

Vectorizing a Function

> noise <- function(n, mean, sd){
+         rnorm(n, mean, sd)
+ }
> noise(5, 1, 2)
[1]  0.8869572 -3.2587213  1.6896915 -2.8099109 -0.6223403
> noise(1:5, 1:5, 2)
[1] 3.648009 3.231274 5.183338 4.613210 4.779682

mapply(noise, 1:5, 1:5, 2)
[[1]]
[1] -0.8486255

[[2]]
[1] 5.185828 2.090021

[[3]]
[1] 1.569743 4.730446 5.148882

[[4]]
[1] 7.791310 2.794005 3.218264 3.167556

[[5]]
[1] 4.248685 4.266738 4.408645 7.883641 3.604923

Instant Vectorization
Which is the same as

> list(noise(1, 1, 2),noise(2, 2, 2),
+      noise(3, 3, 2),noise(4, 4, 2),
+      noise(5, 5, 2))
[[1]]
[1] 0.223665

[[2]]
[1] 3.305073 4.249545

[[3]]
[1] 1.455778 1.983828 4.047241

[[4]]
[1] 6.035508 3.497671 1.140013 7.418242

[[5]]
[1] 7.870139 3.579258 4.869865 1.481063 6.139446

>unique(c(3, 4, 5, 5,5, 6, 6))
[1] 3 4 5 6

> sapply(flags,unique)
$name
  [1] Afghanistan              Albania                 
  [3] Algeria                  American-Samoa          
  [5] Andorra                  Angola                  
  [7] Anguilla                 Antigua-Barbuda         
  [9] Argentina                Argentine               
 [11] Australia                Austria                 
 [13] Bahamas                  Bahrain      



lapply(unique_vals, function(elem) elem[2]) will return a list containing
the second item from each element of the unique_vals list. Note that our
function takes one argument, elem, which is just a 'dummy variable' that
takes on the value of each element of unique_vals, in turn.

> lapply(unique_vals,function(elem) elem[2])
$name
[1] Albania
194 Levels: Afghanistan Albania Algeria American-Samoa Andorra ... Zimbabwe

$landmass
[1] 3

$zone
[1] 3

$area
[1] 29

$population
[1] 3

$language
[1] 6

$religion
[1] 6

$bars
[1] 2

$stripes
[1] 0

tapply

tapply is used to apply a function over subsets of a vector. I don’t know why it’s called tapply

> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
  • X is a vector
  • INDEX is a factor or a list of factors (or else they are coerced to factors)
  • FUN is a function to be applied
  • … contains other arguments to be passed FUN
  • simplify, should we simplify the result?

Take group means.

> x <- c(rnorm(10), runif(10), rnorm(10, 1))
> f <- gl(3, 10)
> f
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3
[24] 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
           1                 2                  3
0.1144464   0.5163468   1.2463678

Take group means without simplification

> tapply(x, f, mean, simplify = FALSE)
‘1‘
[1] 0.1144464
‘2‘
[1] 0.5163468
‘3‘
[1] 1.246368

Find group ranges

> tapply(x, f, range)
‘1‘
[1] -1.097309 2.694970
‘2‘
[1] 0.09479023 0.79107293
‘3‘
[1] 0.4717443 2.5887025

split

split takes a vector or other objects and splits it into groups determined by a factor or list of
factors.

> str(split)
function (x, f, drop = FALSE, ...)
  • x is a vector (or list) or data frame
  • f is a factor (or coerced to one) or a list of factors
  • drop indicates whether empty factors levels should be dropped

    > x <- c(rnorm(10), runif(10), rnorm(10, 1))
    > f <- gl(3, 10)
    > split(x, f)
    ‘1‘
     [1] -0.8493038 -0.5699717 -0.8385255 -0.8842019
     [5] 0.2849881 0.9383361 -1.0973089 2.6949703
     [9] 1.5976789 -0.1321970
    ‘2‘
     [1] 0.09479023 0.79107293 0.45857419 0.74849293
     [5] 0.34936491 0.35842084 0.78541705 0.57732081
     [9] 0.46817559 0.53183823
    ‘3‘
     [1] 0.6795651 0.9293171 1.0318103 0.4717443
     [5] 2.5887025 1.5975774 1.3246333 1.4372701
    

A common idiom is split followed by an lapply

> lapply(split(x, f), mean)
‘1‘
[1] 0.1144464
‘2‘
[1] 0.5163468
‘3‘
[1] 1.246368

Splitting a Data Frame Splitting a Data Frame

> library(datasets)
> head(airquality)
 Ozone    Solar.R    Wind    Temp    Month    Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6

vapply tapply
Whereas sapply() tries to ‘guess’ the correct format of the result, vapply()
allows you to specify it explicitly. If the result doesn’t match the format
you specify, vapply() will throw an error, causing the operation to stop.
This can prevent significant problems in your code that might be caused by
getting unexpected return values from sapply().

vapply() may perform faster than sapply() for large datasets.

If we wish to be explicit about the format of the result we expect, we can
| use vapply(flags, class, character(1)). The 'character(1)' argument tells R
| that we expect the class function to return a character vector of length 1
| when applied to EACH column of the flags dataset. Try it now.

> vapply(flags, class, character(1))
      name   landmass       zone       area population   language   religion 
  "factor"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 

> vapply(flags, unique,numeric(1))
Error in vapply(flags, unique, numeric(1)) : values must be length 1,
 but FUN(X[[1]]) result is length 194
> ok()

| Keep up the great work!

  |=========================                                            |  36%

| Recall from the previous lesson that sapply(flags, class) will return a
| character vector containing the class of each column in the dataset. Try
| that again now to see the result.

> sapply(flags, class)
      name   landmass       zone       area population   language   religion 
  "factor"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
      bars    stripes    colours        red      green       blue       gold 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
     white      black     orange    mainhue    circles    crosses   saltires 
 "integer"  "integer"  "integer"   "factor"  "integer"  "integer"  "integer" 
  quarters   sunstars   crescent   triangle       icon    animate       text 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
   topleft   botright 
  "factor"   "factor" 

| All that hard work is paying off!

  |============================                                         |  40%

| If we wish to be explicit about the format of the result we expect, we can
| use vapply(flags, class, character(1)). The 'character(1)' argument tells R
| that we expect the class function to return a character vector of length 1
| when applied to EACH column of the flags dataset. Try it now.

> vapply(flags, class, character(1))
      name   landmass       zone       area population   language   religion 
  "factor"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
      bars    stripes    colours        red      green       blue       gold 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
     white      black     orange    mainhue    circles    crosses   saltires 
 "integer"  "integer"  "integer"   "factor"  "integer"  "integer"  "integer" 
  quarters   sunstars   crescent   triangle       icon    animate       text 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
   topleft   botright 
  "factor"   "factor" 

| You are amazing!

  |==============================                                       |  44%

| Note that since our expectation was correct (i.e. character(1)), the
| vapply() result is identical to the sapply() result -- a character vector of
| column classes.

...

  |=================================                                    |  48%

| You might think of vapply() as being 'safer' than sapply(), since it
| requires you to specify the format of the output in advance, instead of just
| allowing R to 'guess' what you wanted. In addition, vapply() may perform
| faster than sapply() for large datasets. However, when doing data analysis
| interactively (at the prompt), sapply() saves you some typing and will often
| be good enough.

...

  |====================================                                 |  52%

| As a data analyst, you'll often wish to split your data up into groups based
| on the value of some variable, then apply a function to the members of each
| group. The next function we'll look at, tapply(), does exactly that.

...

  |=======================================                              |  56%

| Use ?tapply to pull up the documentation.

> ?tapply

| Perseverance, that's the answer.

  |=========================================                            |  60%

| The 'landmass' variable in our dataset takes on integer values between 1 and
| 6, each of which represents a different part of the world. Use
| table(flags$landmass) to see how many flags/countries fall into each group.

> table(flags$landmass)

 1  2  3  4  5  6 
31 17 35 52 39 20 

| All that practice is paying off!

  |============================================                         |  64%

| The 'animate' variable in our dataset takes the value 1 if a country's flag
| contains an animate image (e.g. an eagle, a tree, a human hand) and 0
| otherwise. Use table(flags$animate) to see how many flags contain an animate
| image.

> table(flags$animate)

  0   1 
155  39 

| That's correct!

  |===============================================                      |  68%

| This tells us that 39 flags contain an animate object (animate = 1) and 155
| do not (animate = 0).

...

  |==================================================                   |  72%

| If you take the arithmetic mean of a bunch of 0s and 1s, you get the
| proportion of 1s. Use tapply(flags$animate, flags$landmass, mean) to apply
| the mean function to the 'animate' variable separately for each of the six
| landmass groups, thus giving us the proportion of flags containing an
| animate image WITHIN each landmass group.

> tapply(flags$animate,flags$landmass,mean)
        1         2         3         4         5         6 
0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000 

| That's the answer I was looking for.

  |====================================================                 |  76%

| The first landmass group (landmass = 1) corresponds to North America and
| contains the highest proportion of flags with an animate image (0.4194).

...

  |=======================================================              |  80%

| Similarly, we can look at a summary of population values (in round millions)
| for countries with and without the color red on their flag with
| tapply(flags$population, flags$red, summary).

> tapply(flags$population,flags$red,summary)
$`0`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    3.00   27.63    9.00  684.00 

$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    0.0     0.0     4.0    22.1    15.0  1008.0 


| That's the answer I was looking for.

  |==========================================================           |  84%

| What is the median population (in millions) for countries *without* the
| color red on their flag?

1: 3.0
2: 9.0
3: 4.0
4: 22.1
5: 0.0
6: 27.6

Selection: 1

| Great job!

  |=============================================================        |  88%

| Lastly, use the same approach to look at a summary of population values for
| each of the six landmasses.

> tapply(flags$population,flags$landmass,summary)
$`1`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   12.29    4.50  231.00 

$`2`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    6.00   15.71   15.00  119.00 

$`3`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    8.00   13.86   16.00   61.00 

$`4`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   5.000   8.788   9.750  56.000 

$`5`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    2.00   10.00   69.18   39.00 1008.00 

$`6`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   11.30    1.25  157.00 


| You are doing so well!

  |===============================================================      |  92%

| What is the maximum population (in millions) for the fourth landmass group
| (Africa)?

1: 5.00
2: 56.00
3: 119.0
4: 157.00
5: 1010.0

Selection: 2

| All that practice is paying off!

  |==================================================================   |  96%

| In this lesson, you learned how to use vapply() as a safer alternative to
| sapply(), which is most helpful when writing your own functions. You also
| learned how to use tapply() to split your data into groups based on the
| value of some variable, then apply a function to each group. These functions
| will come in handy on your quest to become a better data analyst.
0
0

猜你在找
【直播】机器学习&数据挖掘7周实训--韦玮
【套餐】系统集成项目管理工程师顺利通关--徐朋
【直播】3小时掌握Docker最佳实战-徐西宁
【套餐】机器学习系列套餐(算法+实战)--唐宇迪
【直播】计算机视觉原理及实战--屈教授
【套餐】微信订阅号+服务号Java版 v2.0--翟东平
【直播】机器学习之矩阵--黄博士
【套餐】微信订阅号+服务号Java版 v2.0--翟东平
【直播】机器学习之凸优化--马博士
【套餐】Javascript 设计模式实战--曾亮
查看评论
* 以上用户言论只代表其个人观点,不代表CSDN网站的观点或立场
    个人资料
    • 访问:6308次
    • 积分:180
    • 等级:
    • 排名:千里之外
    • 原创:12篇
    • 转载:1篇
    • 译文:0篇
    • 评论:0条
    文章分类