R Programming - Simulation & Profiling

The str function

str: Compactly display the internal structure of an R object

  • A diagnostic function and an alternative to ‘summary’
  • It is specially well suited to compactly displayed the (abbreviated) contents of (possibly nested) lists
  • Roughly one line per basic object

> str(str)
function (object, ...)  
> str(lm)
function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, 
    x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, 
    offset, ...)  
> str(ls)
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
    pattern, sorted = TRUE)  
> x <- rnorm(100, 2, 4)
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.1496 -0.6004  1.7820  2.1097  4.8787 11.0769 
> str(x)
 num [1:100] 6.6988 -0.0482 -3.1542 -0.1203 3.6359 ...
> f <- gl(40, 10)
> str(f)
 Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(f)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 
28 29 30 31 32 33 34 35 36 37 38 39 40 
10 10 10 10 10 10 10 10 10 10 10 10 10 
> library(datasets)
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> str(airquality)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
> m <- matrix(rnorm(100), 10, 10)
> str(m)
 num [1:10, 1:10] 0.528 -0.295 0.619 0.62 0.123 ...
> m[, 1]
 [1]  0.52753155 -0.29487283  0.61877207  0.61966875  0.12279945  0.06094321
 [7] -0.05693048  0.25212265  0.03397175  1.45011753
> str(s)
Error in str(s) : object 's' not found
> s <- split(airquality, airquality$Month)
> str(s)
List of 5
 $ 5:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 6:'data.frame':	30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
 $ 7:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 8:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 9:'data.frame':	30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...

Simulation - Generating Random Numbers

Functions for probability distributions in R

  • rnorm: generate random Normal variates with a given mean and standard deviation
  • dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or a vector of points)
  • pnorm: evaluate the cumulative distribution function for a Normal distribution
  • rpois: generate random Poisson variates with a given rate

Probability distribution functions usually have four functions associated with them. The functions are prefixed with a

  • d for density
  • r for random number generation
  • p for cumulative distribution
  • q for quantile function (inverse cumulative distribution)

Generating Random Numbers

Working with the Normal distributions requires using these four functions

> dnorm(x, mean = 0, sd = 1, log = FALSE)
  [1] 7.187808e-11 3.984792e-01 2.757201e-03 3.960669e-01 5.373512e-04 2.097601e-13
  [7] 1.926606e-01 1.226585e-20 3.813674e-01 3.988513e-01 6.738565e-14 2.955863e-02
 [13] 7.105746e-04 3.104784e-01 2.098806e-13 1.897479e-06 4.158647e-05 3.932090e-01
 [19] 1.065878e-19 9.851913e-02 4.689308e-05 1.682792e-02 2.125712e-01 2.507108e-02
 [25] 7.796784e-04 5.576236e-03 1.712846e-01 2.198139e-01 3.685826e-01 2.216065e-02
 [31] 4.191433e-06 9.729525e-09 6.733673e-02 8.506511e-02 2.449181e-09 7.810368e-02
 [37] 3.704840e-01 3.401082e-02 3.278842e-04 4.380518e-05 6.335342e-21 1.559798e-03
 [43] 3.044051e-06 9.036626e-03 6.290214e-05 3.958070e-12 4.629443e-03 3.547483e-07
 [49] 2.998722e-12 1.759182e-01 9.098334e-06 7.366268e-08 1.370371e-01 1.231870e-02
 [55] 3.074389e-01 3.588206e-02 5.255136e-02 3.513189e-05 7.813230e-04 1.756514e-01
 [61] 2.029827e-03 3.038831e-01 3.802915e-17 2.891341e-01 9.065887e-28 3.635220e-07
 [67] 3.966053e-01 4.900804e-03 7.516807e-03 1.017334e-01 4.199301e-07 1.908631e-01
 [73] 8.475977e-03 3.336663e-01 6.038725e-03 3.700839e-01 4.296542e-02 3.459423e-01
 [79] 3.764495e-01 1.136650e-04 1.580771e-03 5.349155e-02 3.919150e-01 6.771772e-04
 [85] 1.351935e-02 1.653212e-02 3.289164e-03 2.357015e-12 3.308475e-07 8.883980e-07
 [91] 1.278336e-20 3.126601e-09 3.165500e-04 1.196407e-10 8.160051e-06 1.750200e-06
 [97] 3.401751e-01 2.640385e-01 3.815204e-01 6.035775e-16
> pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> rnorm(n, mean = 0, sd = 1)

If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = φ(q) and qnorm§ = Φ^-1§

> x <- rnorm(10)
> x
 [1]  0.60641305 -1.37254000  0.48166479  0.19117564  1.24776799 -1.76404648
 [7] -0.88679076 -0.01719335 -0.07781784 -0.76860577
> x <- rnorm(10, 20,2)
> x
 [1] 18.76953 22.76850 18.79425 19.99632 19.18692 20.95383 20.71457 21.43382 15.01286
[10] 20.40358
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.01   18.89   20.20   19.80   20.89   22.77 

Setting the random number seed with set.seed ensures reproducibility

set.seed(): It can be used to specify which random number generating algorithm should use, ensuring consistency and reproducibility; It ensures that the sequence of random numbers starts in a specific place and is therefore reproducible.

> set.seed(1)
> rnorm(5)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
> rnorm(5)
[1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884
> set.seed(1)
> rnorm(5)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Always set the random number seed when conducting a simulation!

Generating Poisson data

> rpois(10, 1)
 [1] 0 0 1 1 2 1 1 4 1 2
> rpois(10, 2)
 [1] 4 1 2 0 1 1 0 1 4 1
> rpois(10, 20)
 [1] 19 19 24 23 22 24 23 20 11 22
> ppois(2, 2) ## Cumulative distribution
[1] 0.6766764 ## Pr(x <= 2)
> ppois(4, 2)
[1] 0.947347 ## Pr(x <= 4)
> ppois(6, 2)
[1] 0.9954662 ## Pr(x <= 6)

Simulation - Simulating a Linear Model

Generating Random Numbers From a Linear Model

Suppose we want to simulate from the following linear model
在这里插入图片描述

> set.seed(20)
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + 2
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.279   1.299   2.450   2.510   4.037   6.917 
> plot(x, y)

在这里插入图片描述

What if x is binary?

> set.seed(10)
> x <- rbinom(100, 1, 0.5)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4936 -0.1409  1.5767  1.4322  2.8397  6.9410 
> plot(x, y)

在这里插入图片描述

Generating Random Numbers From a Generalized Linear Model

Suppose we want to simulate from a Poisson model where
在这里插入图片描述
We need to use the ropes function for this

> set.seed(1) # 产生随机数的初始值, 括号里的数字为编号
> x <- rnorm(100)
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    1.00    1.55    2.00    6.00 
> plot(x, y)

在这里插入图片描述

Simulation - Random Sampling

The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions


> set.seed(1)
> sample(1:10, 4)
[1] 9 4 7 1
> sample(1:10, 4)
[1] 2 7 3 6
> sample(letters, 5)
[1] "r" "s" "a" "u" "w"
> sample(1:10) ## permutation
 [1] 10  6  9  2  1  5  8  4  3  7
> sample(1:10)
 [1]  5 10  2  8  6  1  4  3  9  7
> sample(1:10, replace = TRUE) ## Sample w/replacement
 [1]  3  6 10 10  6  4  4 10  9  7
  • Drawing samples from specific probability distributions can be done with r* functions
  • Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc
  • The sample function can be used to draw random samples from arbitrary vectors
  • Setting the random number generator seed via set.seed is critical for reproducibility

R profiler

Why is My Code So Slow?

  • Profiling is a systematic way to examine how much time is spent in different parts of a program
  • Useful when trying to optimize your code
  • Often code runs fine one, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?
  • Profiling is better than guessing

On Optimizing Your Code

  • Getting biggest impact on speeding up code depends on knowing where the code spends most of its time
  • This cannot be done without performance analysis or profiling

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. --Donald Knuth

General Principles of Optimization

  • Design first, then optimize
  • Remember: Premature optimization is the root of all evil
  • Measure (collect data), don’t guess
  • If you’re going to be scientist, you need to apply the same principles here!

Using system.time()

  • Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression
  • Computes the time (in seconds) needed to execute an expression
    • If there’s an error, gives time until the error occured
  • Returns an object of class proc_time
    • user time: time charger to the CPU(s) for this expression
    • elapsed time: “wall clock” time
  • Usually, the user time and elapsed time are relatively close, for straight computing tasks
  • Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
  • Elapsed time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)
    • Multi-threader BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
    • Parallel processing via the parallel package

The R Profiler

  • The Rprof() function starts the profiler in R
    • R must be compiled with profiler support (but this is usually the case)
  • The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)
  • DO NOT use system.time() and Rprof() together or you will be sad
  • Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent in each function
  • Default sampling interval is 0.02 seconds
  • NOTE: If your code runs very quickly, the profiler is not useful, but then you probably don’t need it in that case

Using summaryRprof()

  • The summaryRprof() function tabulates the R profiler output and calculates how much time is spent in which function
  • There are two methods for normalizing the data
  • “by.total” divides the time spend in each function by the total run time
  • “by.sell” does the same but first subtracts out time spent in functions above in the call stack
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值