R Programming - Simulation & Profiling_r programming simulation & profiling-CSDN博客

本文链接：https://blog.csdn.net/jeonghin/article/details/124897058

Simulation & Profiling

The str function
Simulation - Generating Random Numbers
Generating Random Numbers
Simulation - Simulating a Linear Model
- Generating Random Numbers From a Linear Model
- Generating Random Numbers From a Generalized Linear Model
Simulation - Random Sampling
R profiler

The str function

str: Compactly display the internal structure of an R object

A diagnostic function and an alternative to ‘summary’
It is specially well suited to compactly displayed the (abbreviated) contents of (possibly nested) lists
Roughly one line per basic object


> str(str)
function (object, ...)  
> str(lm)
function (formula, data, subset, weights, na.action, method = "qr", model = TRUE, 
    x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, 
    offset, ...)  
> str(ls)
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE, 
    pattern, sorted = TRUE)  
> x <- rnorm(100, 2, 4)
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.1496 -0.6004  1.7820  2.1097  4.8787 11.0769 
> str(x)
 num [1:100] 6.6988 -0.0482 -3.1542 -0.1203 3.6359 ...
> f <- gl(40, 10)
> str(f)
 Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(f)
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 
28 29 30 31 32 33 34 35 36 37 38 39 40 
10 10 10 10 10 10 10 10 10 10 10 10 10 
> library(datasets)
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
> str(airquality)
'data.frame':	153 obs. of  6 variables:
 $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
 $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
 $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
 $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
 $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
 $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
> m <- matrix(rnorm(100), 10, 10)
> str(m)
 num [1:10, 1:10] 0.528 -0.295 0.619 0.62 0.123 ...
> m[, 1]
 [1]  0.52753155 -0.29487283  0.61877207  0.61966875  0.12279945  0.06094321
 [7] -0.05693048  0.25212265  0.03397175  1.45011753
> str(s)
Error in str(s) : object 's' not found
> s <- split(airquality, airquality$Month)
> str(s)
List of 5
 $ 5:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
  ..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
  ..$ Wind   : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
  ..$ Temp   : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
  ..$ Month  : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 6:'data.frame':	30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
  ..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
  ..$ Wind   : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
  ..$ Temp   : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
  ..$ Month  : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
 $ 7:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
  ..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
  ..$ Wind   : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
  ..$ Temp   : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
  ..$ Month  : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 8:'data.frame':	31 obs. of  6 variables:
  ..$ Ozone  : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
  ..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
  ..$ Wind   : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
  ..$ Temp   : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
  ..$ Month  : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
  ..$ Day    : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
 $ 9:'data.frame':	30 obs. of  6 variables:
  ..$ Ozone  : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
  ..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
  ..$ Wind   : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
  ..$ Temp   : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
  ..$ Month  : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
  ..$ Day    : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...

Simulation - Generating Random Numbers

Functions for probability distributions in R

rnorm: generate random Normal variates with a given mean and standard deviation
dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or a vector of points)
pnorm: evaluate the cumulative distribution function for a Normal distribution
rpois: generate random Poisson variates with a given rate

Probability distribution functions usually have four functions associated with them. The functions are prefixed with a

d for density
r for random number generation
p for cumulative distribution
q for quantile function (inverse cumulative distribution)

Generating Random Numbers

Working with the Normal distributions requires using these four functions

> dnorm(x, mean = 0, sd = 1, log = FALSE)
  [1] 7.187808e-11 3.984792e-01 2.757201e-03 3.960669e-01 5.373512e-04 2.097601e-13
  [7] 1.926606e-01 1.226585e-20 3.813674e-01 3.988513e-01 6.738565e-14 2.955863e-02
 [13] 7.105746e-04 3.104784e-01 2.098806e-13 1.897479e-06 4.158647e-05 3.932090e-01
 [19] 1.065878e-19 9.851913e-02 4.689308e-05 1.682792e-02 2.125712e-01 2.507108e-02
 [25] 7.796784e-04 5.576236e-03 1.712846e-01 2.198139e-01 3.685826e-01 2.216065e-02
 [31] 4.191433e-06 9.729525e-09 6.733673e-02 8.506511e-02 2.449181e-09 7.810368e-02
 [37] 3.704840e-01 3.401082e-02 3.278842e-04 4.380518e-05 6.335342e-21 1.559798e-03
 [43] 3.044051e-06 9.036626e-03 6.290214e-05 3.958070e-12 4.629443e-03 3.547483e-07
 [49] 2.998722e-12 1.759182e-01 9.098334e-06 7.366268e-08 1.370371e-01 1.231870e-02
 [55] 3.074389e-01 3.588206e-02 5.255136e-02 3.513189e-05 7.813230e-04 1.756514e-01
 [61] 2.029827e-03 3.038831e-01 3.802915e-17 2.891341e-01 9.065887e-28 3.635220e-07
 [67] 3.966053e-01 4.900804e-03 7.516807e-03 1.017334e-01 4.199301e-07 1.908631e-01
 [73] 8.475977e-03 3.336663e-01 6.038725e-03 3.700839e-01 4.296542e-02 3.459423e-01
 [79] 3.764495e-01 1.136650e-04 1.580771e-03 5.349155e-02 3.919150e-01 6.771772e-04
 [85] 1.351935e-02 1.653212e-02 3.289164e-03 2.357015e-12 3.308475e-07 8.883980e-07
 [91] 1.278336e-20 3.126601e-09 3.165500e-04 1.196407e-10 8.160051e-06 1.750200e-06
 [97] 3.401751e-01 2.640385e-01 3.815204e-01 6.035775e-16
> pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> rnorm(n, mean = 0, sd = 1)

If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = φ(q) and qnorm§ = Φ^-1§

> x <- rnorm(10)
> x
 [1]  0.60641305 -1.37254000  0.48166479  0.19117564  1.24776799 -1.76404648
 [7] -0.88679076 -0.01719335 -0.07781784 -0.76860577
> x <- rnorm(10, 20,2)
> x
 [1] 18.76953 22.76850 18.79425 19.99632 19.18692 20.95383 20.71457 21.43382 15.01286
[10] 20.40358
> summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15.01   18.89   20.20   19.80   20.89   22.77

Setting the random number seed with set.seed ensures reproducibility

set.seed(): It can be used to specify which random number generating algorithm should use, ensuring consistency and reproducibility; It ensures that the sequence of random numbers starts in a specific place and is therefore reproducible.

> set.seed(1)
> rnorm(5)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078
> rnorm(5)
[1] -0.8204684  0.4874291  0.7383247  0.5757814 -0.3053884
> set.seed(1)
> rnorm(5)
[1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078

Always set the random number seed when conducting a simulation!

Generating Poisson data

> rpois(10, 1)
 [1] 0 0 1 1 2 1 1 4 1 2
> rpois(10, 2)
 [1] 4 1 2 0 1 1 0 1 4 1
> rpois(10, 20)
 [1] 19 19 24 23 22 24 23 20 11 22
> ppois(2, 2) ## Cumulative distribution
[1] 0.6766764 ## Pr(x <= 2)
> ppois(4, 2)
[1] 0.947347 ## Pr(x <= 4)
> ppois(6, 2)
[1] 0.9954662 ## Pr(x <= 6)

Simulation - Simulating a Linear Model

Generating Random Numbers From a Linear Model

Suppose we want to simulate from the following linear model
在这里插入图片描述

> set.seed(20)
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + 2
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -3.279   1.299   2.450   2.510   4.037   6.917 
> plot(x, y)

在这里插入图片描述

What if x is binary?

> set.seed(10)
> x <- rbinom(100, 1, 0.5)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.4936 -0.1409  1.5767  1.4322  2.8397  6.9410 
> plot(x, y)

在这里插入图片描述

Generating Random Numbers From a Generalized Linear Model

Suppose we want to simulate from a Poisson model where
在这里插入图片描述
We need to use the ropes function for this

> set.seed(1) # 产生随机数的初始值， 括号里的数字为编号
> x <- rnorm(100)
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    1.00    1.55    2.00    6.00 
> plot(x, y)

在这里插入图片描述

Simulation - Random Sampling

The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions


> set.seed(1)
> sample(1:10, 4)
[1] 9 4 7 1
> sample(1:10, 4)
[1] 2 7 3 6
> sample(letters, 5)
[1] "r" "s" "a" "u" "w"
> sample(1:10) ## permutation
 [1] 10  6  9  2  1  5  8  4  3  7
> sample(1:10)
 [1]  5 10  2  8  6  1  4  3  9  7
> sample(1:10, replace = TRUE) ## Sample w/replacement
 [1]  3  6 10 10  6  4  4 10  9  7

Drawing samples from specific probability distributions can be done with r* functions
Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc
The sample function can be used to draw random samples from arbitrary vectors
Setting the random number generator seed via set.seed is critical for reproducibility

R profiler

Why is My Code So Slow?

Profiling is a systematic way to examine how much time is spent in different parts of a program
Useful when trying to optimize your code
Often code runs fine one, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?
Profiling is better than guessing

On Optimizing Your Code

Getting biggest impact on speeding up code depends on knowing where the code spends most of its time
This cannot be done without performance analysis or profiling

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. --Donald Knuth

General Principles of Optimization

Design first, then optimize
Remember: Premature optimization is the root of all evil
Measure (collect data), don’t guess
If you’re going to be scientist, you need to apply the same principles here!

Using system.time()

Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression
Computes the time (in seconds) needed to execute an expression
- If there’s an error, gives time until the error occured
Returns an object of class proc_time
- user time: time charger to the CPU(s) for this expression
- elapsed time: “wall clock” time
Usually, the user time and elapsed time are relatively close, for straight computing tasks
Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
Elapsed time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)
- Multi-threader BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
- Parallel processing via the parallel package

The R Profiler

The Rprof() function starts the profiler in R
- R must be compiled with profiler support (but this is usually the case)
The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)
DO NOT use system.time() and Rprof() together or you will be sad
Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent in each function
Default sampling interval is 0.02 seconds
NOTE: If your code runs very quickly, the profiler is not useful, but then you probably don’t need it in that case

Using summaryRprof()

The summaryRprof() function tabulates the R profiler output and calculates how much time is spent in which function
There are two methods for normalizing the data
“by.total” divides the time spend in each function by the total run time
“by.sell” does the same but first subtracts out time spent in functions above in the call stack