Simulation & Profiling
The str function
str: Compactly display the internal structure of an R object
- A diagnostic function and an alternative to ‘summary’
- It is specially well suited to compactly displayed the (abbreviated) contents of (possibly nested) lists
- Roughly one line per basic object
> str(str)
function (object, ...)
> str(lm)
function (formula, data, subset, weights, na.action, method = "qr", model = TRUE,
x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL,
offset, ...)
> str(ls)
function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
pattern, sorted = TRUE)
> x <- rnorm(100, 2, 4)
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.1496 -0.6004 1.7820 2.1097 4.8787 11.0769
> str(x)
num [1:100] 6.6988 -0.0482 -3.1542 -0.1203 3.6359 ...
> f <- gl(40, 10)
> str(f)
Factor w/ 40 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
> summary(f)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
28 29 30 31 32 33 34 35 36 37 38 39 40
10 10 10 10 10 10 10 10 10 10 10 10 10
> library(datasets)
> head(airquality)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
> str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
> m <- matrix(rnorm(100), 10, 10)
> str(m)
num [1:10, 1:10] 0.528 -0.295 0.619 0.62 0.123 ...
> m[, 1]
[1] 0.52753155 -0.29487283 0.61877207 0.61966875 0.12279945 0.06094321
[7] -0.05693048 0.25212265 0.03397175 1.45011753
> str(s)
Error in str(s) : object 's' not found
> s <- split(airquality, airquality$Month)
> str(s)
List of 5
$ 5:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 41 36 12 18 NA 28 23 19 8 NA ...
..$ Solar.R: int [1:31] 190 118 149 313 NA NA 299 99 19 194 ...
..$ Wind : num [1:31] 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
..$ Temp : int [1:31] 67 72 74 62 56 66 65 59 61 69 ...
..$ Month : int [1:31] 5 5 5 5 5 5 5 5 5 5 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 6:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] NA NA NA NA NA NA 29 NA 71 39 ...
..$ Solar.R: int [1:30] 286 287 242 186 220 264 127 273 291 323 ...
..$ Wind : num [1:30] 8.6 9.7 16.1 9.2 8.6 14.3 9.7 6.9 13.8 11.5 ...
..$ Temp : int [1:30] 78 74 67 84 85 79 82 87 90 87 ...
..$ Month : int [1:30] 6 6 6 6 6 6 6 6 6 6 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
$ 7:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 135 49 32 NA 64 40 77 97 97 85 ...
..$ Solar.R: int [1:31] 269 248 236 101 175 314 276 267 272 175 ...
..$ Wind : num [1:31] 4.1 9.2 9.2 10.9 4.6 10.9 5.1 6.3 5.7 7.4 ...
..$ Temp : int [1:31] 84 85 81 84 83 83 88 92 92 89 ...
..$ Month : int [1:31] 7 7 7 7 7 7 7 7 7 7 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 8:'data.frame': 31 obs. of 6 variables:
..$ Ozone : int [1:31] 39 9 16 78 35 66 122 89 110 NA ...
..$ Solar.R: int [1:31] 83 24 77 NA NA NA 255 229 207 222 ...
..$ Wind : num [1:31] 6.9 13.8 7.4 6.9 7.4 4.6 4 10.3 8 8.6 ...
..$ Temp : int [1:31] 81 81 82 86 85 87 89 90 90 92 ...
..$ Month : int [1:31] 8 8 8 8 8 8 8 8 8 8 ...
..$ Day : int [1:31] 1 2 3 4 5 6 7 8 9 10 ...
$ 9:'data.frame': 30 obs. of 6 variables:
..$ Ozone : int [1:30] 96 78 73 91 47 32 20 23 21 24 ...
..$ Solar.R: int [1:30] 167 197 183 189 95 92 252 220 230 259 ...
..$ Wind : num [1:30] 6.9 5.1 2.8 4.6 7.4 15.5 10.9 10.3 10.9 9.7 ...
..$ Temp : int [1:30] 91 92 93 93 87 84 80 78 75 73 ...
..$ Month : int [1:30] 9 9 9 9 9 9 9 9 9 9 ...
..$ Day : int [1:30] 1 2 3 4 5 6 7 8 9 10 ...
Simulation - Generating Random Numbers
Functions for probability distributions in R
- rnorm: generate random Normal variates with a given mean and standard deviation
- dnorm: evaluate the Normal probability density (with a given mean/SD) at a point (or a vector of points)
- pnorm: evaluate the cumulative distribution function for a Normal distribution
- rpois: generate random Poisson variates with a given rate
Probability distribution functions usually have four functions associated with them. The functions are prefixed with a
- d for density
- r for random number generation
- p for cumulative distribution
- q for quantile function (inverse cumulative distribution)
Generating Random Numbers
Working with the Normal distributions requires using these four functions
> dnorm(x, mean = 0, sd = 1, log = FALSE)
[1] 7.187808e-11 3.984792e-01 2.757201e-03 3.960669e-01 5.373512e-04 2.097601e-13
[7] 1.926606e-01 1.226585e-20 3.813674e-01 3.988513e-01 6.738565e-14 2.955863e-02
[13] 7.105746e-04 3.104784e-01 2.098806e-13 1.897479e-06 4.158647e-05 3.932090e-01
[19] 1.065878e-19 9.851913e-02 4.689308e-05 1.682792e-02 2.125712e-01 2.507108e-02
[25] 7.796784e-04 5.576236e-03 1.712846e-01 2.198139e-01 3.685826e-01 2.216065e-02
[31] 4.191433e-06 9.729525e-09 6.733673e-02 8.506511e-02 2.449181e-09 7.810368e-02
[37] 3.704840e-01 3.401082e-02 3.278842e-04 4.380518e-05 6.335342e-21 1.559798e-03
[43] 3.044051e-06 9.036626e-03 6.290214e-05 3.958070e-12 4.629443e-03 3.547483e-07
[49] 2.998722e-12 1.759182e-01 9.098334e-06 7.366268e-08 1.370371e-01 1.231870e-02
[55] 3.074389e-01 3.588206e-02 5.255136e-02 3.513189e-05 7.813230e-04 1.756514e-01
[61] 2.029827e-03 3.038831e-01 3.802915e-17 2.891341e-01 9.065887e-28 3.635220e-07
[67] 3.966053e-01 4.900804e-03 7.516807e-03 1.017334e-01 4.199301e-07 1.908631e-01
[73] 8.475977e-03 3.336663e-01 6.038725e-03 3.700839e-01 4.296542e-02 3.459423e-01
[79] 3.764495e-01 1.136650e-04 1.580771e-03 5.349155e-02 3.919150e-01 6.771772e-04
[85] 1.351935e-02 1.653212e-02 3.289164e-03 2.357015e-12 3.308475e-07 8.883980e-07
[91] 1.278336e-20 3.126601e-09 3.165500e-04 1.196407e-10 8.160051e-06 1.750200e-06
[97] 3.401751e-01 2.640385e-01 3.815204e-01 6.035775e-16
> pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
> rnorm(n, mean = 0, sd = 1)
If Φ is the cumulative distribution function for a standard Normal distribution, then pnorm(q) = φ(q) and qnorm§ = Φ^-1§
> x <- rnorm(10)
> x
[1] 0.60641305 -1.37254000 0.48166479 0.19117564 1.24776799 -1.76404648
[7] -0.88679076 -0.01719335 -0.07781784 -0.76860577
> x <- rnorm(10, 20,2)
> x
[1] 18.76953 22.76850 18.79425 19.99632 19.18692 20.95383 20.71457 21.43382 15.01286
[10] 20.40358
> summary(x)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.01 18.89 20.20 19.80 20.89 22.77
Setting the random number seed with set.seed ensures reproducibility
set.seed(): It can be used to specify which random number generating algorithm should use, ensuring consistency and reproducibility; It ensures that the sequence of random numbers starts in a specific place and is therefore reproducible.
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
> rnorm(5)
[1] -0.8204684 0.4874291 0.7383247 0.5757814 -0.3053884
> set.seed(1)
> rnorm(5)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Always set the random number seed when conducting a simulation!
Generating Poisson data
> rpois(10, 1)
[1] 0 0 1 1 2 1 1 4 1 2
> rpois(10, 2)
[1] 4 1 2 0 1 1 0 1 4 1
> rpois(10, 20)
[1] 19 19 24 23 22 24 23 20 11 22
> ppois(2, 2) ## Cumulative distribution
[1] 0.6766764 ## Pr(x <= 2)
> ppois(4, 2)
[1] 0.947347 ## Pr(x <= 4)
> ppois(6, 2)
[1] 0.9954662 ## Pr(x <= 6)
Simulation - Simulating a Linear Model
Generating Random Numbers From a Linear Model
Suppose we want to simulate from the following linear model
> set.seed(20)
> x <- rnorm(100)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + 2
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.279 1.299 2.450 2.510 4.037 6.917
> plot(x, y)
What if x is binary?
> set.seed(10)
> x <- rbinom(100, 1, 0.5)
> e <- rnorm(100, 0, 2)
> y <- 0.5 + 2 * x + e
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.4936 -0.1409 1.5767 1.4322 2.8397 6.9410
> plot(x, y)
Generating Random Numbers From a Generalized Linear Model
Suppose we want to simulate from a Poisson model where
We need to use the ropes function for this
> set.seed(1) # 产生随机数的初始值, 括号里的数字为编号
> x <- rnorm(100)
> log.mu <- 0.5 + 0.3 * x
> y <- rpois(100, exp(log.mu))
> summary(y)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 1.00 1.00 1.55 2.00 6.00
> plot(x, y)
Simulation - Random Sampling
The sample function draws randomly from a specified set of (scalar) objects allowing you to sample from arbitrary distributions
> set.seed(1)
> sample(1:10, 4)
[1] 9 4 7 1
> sample(1:10, 4)
[1] 2 7 3 6
> sample(letters, 5)
[1] "r" "s" "a" "u" "w"
> sample(1:10) ## permutation
[1] 10 6 9 2 1 5 8 4 3 7
> sample(1:10)
[1] 5 10 2 8 6 1 4 3 9 7
> sample(1:10, replace = TRUE) ## Sample w/replacement
[1] 3 6 10 10 6 4 4 10 9 7
- Drawing samples from specific probability distributions can be done with r* functions
- Standard distributions are built in: Normal, Poisson, Binomial, Exponential, Gamma, etc
- The sample function can be used to draw random samples from arbitrary vectors
- Setting the random number generator seed via set.seed is critical for reproducibility
R profiler
Why is My Code So Slow?
- Profiling is a systematic way to examine how much time is spent in different parts of a program
- Useful when trying to optimize your code
- Often code runs fine one, but what if you have to put it in a loop for 1,000 iterations? Is it still fast enough?
- Profiling is better than guessing
On Optimizing Your Code
- Getting biggest impact on speeding up code depends on knowing where the code spends most of its time
- This cannot be done without performance analysis or profiling
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. --Donald Knuth
General Principles of Optimization
- Design first, then optimize
- Remember: Premature optimization is the root of all evil
- Measure (collect data), don’t guess
- If you’re going to be scientist, you need to apply the same principles here!
Using system.time()
- Takes an arbitrary R expression as input (can be wrapped in curly braces) and returns the amount of time taken to evaluate the expression
- Computes the time (in seconds) needed to execute an expression
- If there’s an error, gives time until the error occured
- Returns an object of class proc_time
- user time: time charger to the CPU(s) for this expression
- elapsed time: “wall clock” time
- Usually, the user time and elapsed time are relatively close, for straight computing tasks
- Elapsed time may be greater than user time if the CPU spends a lot of time waiting around
- Elapsed time may be smaller than the user time if your machine has multiple cores/processors (and is capable of using them)
- Multi-threader BLAS libraries (vecLib/Accelerate, ATLAS, ACML, MKL)
- Parallel processing via the parallel package
The R Profiler
- The Rprof() function starts the profiler in R
- R must be compiled with profiler support (but this is usually the case)
- The summaryRprof() function summarizes the output from Rprof() (otherwise it’s not readable)
- DO NOT use system.time() and Rprof() together or you will be sad
- Rprof() keeps track of the function call stack at regularly sampled intervals and tabulates how much time is spent in each function
- Default sampling interval is 0.02 seconds
- NOTE: If your code runs very quickly, the profiler is not useful, but then you probably don’t need it in that case
Using summaryRprof()
- The summaryRprof() function tabulates the R profiler output and calculates how much time is spent in which function
- There are two methods for normalizing the data
- “by.total” divides the time spend in each function by the total run time
- “by.sell” does the same but first subtracts out time spent in functions above in the call stack