R语言学习笔记——swirl包

本文链接：https://blog.csdn.net/weixin_43262379/article/details/86989747

这里贴上我的swirl包学习笔记。实验需要，不得已学习R语言，在edx找到哈佛的PH525x系列课程，看首章的介绍，这里面会列出很多优质的学习资源，就和一些名校的mooc一样（看看大师怎么说还是有好处的）。学了一点点，课程强烈建议先学R，然后给了下面这三个网站：

swirl ： https://swirlstats.com/

Quick-R： https://www.statmethods.net/

pluralsight https://www.pluralsight.com/

其中最后一个早有耳闻，10天试用免费。而第一个swirl是首推的，而且打开网站有点喜欢，所以就开始学swirl了。

swirl循序渐进，完全开源且是在Rsudio中操作，这是我喜欢的学习方式——不管学没学会先上手。

下面是基础课程 R programming：

1: Basic Building Blocks 2: Workspace and Files 3: Sequences of Numbers 4: Vectors 5: Missing Values
6: Subsetting Vectors 7: Matrices and Data Frames 8: Logic 9: Functions 10: lapply and sapply
11: vapply and tapply 12: Looking at Data 13: Simulation 14: Dates and Times 15: Base Graphics

1.: Basic Building Blocks
When given two vectors of the same length, R simply performs the
| specified arithmetic operation (+, -, *, etc.)
| element-by-element. If the vectors are of different lengths, R
| ‘recycles’ the shorter vector until it is the same length as the
| longer vector.

2: Workspace and Files
working directory: getwd() object: ls()
#List all the files in your working director:
list.files() or dir().
the arguments to list.files(): args(list.files)

old.dir <- getwd()

dir.create("testdir")  setwd("testdir")

file.create("mytest.R")
list.files()

if “mytest.R” exists in the working directory: file.exists(“mytest.R”)

You can use the $ operator — e.g.,
file.info("mytest.R")$mode — to grab specific items.

 file.rename("mytest.R","mytest2.R")
 file.copy("mytest2.R","mytest3.R")

Provide the relative path to the file “mytest3.R” by using file.path(): file.path(“mytest3.R”)
platform-independent pathname:

file.path('folder1', 'folder2')
[1] “folder1/folder2”

In order to create nested directories, ‘recursive’ must be set to TRUE.

dir.create(file.path('testdir2','testdir3'), recursive = TRUE)

setwd(old.dir)

3: Sequences of Numbers
in the case of an [ operator ] like the colon used above, you must enclose the symbol in backticks like this: ?: for help
we don’t care what the increment is and we just want a sequence of 30 numbers between 5 and 10. seq(5, 10, length=30) does the trick.

seq_along(my_seq): 1 to length(my_seq), seperated by 1.
replicate numbers, vectors,…: rep(c(0, 1, 2), times = 10) : 0 1 2 0 1 2 …
contain 10 zeros, then 10 ones, then 10 twos: rep(c(0,1,2), each = 10)

4: Vectors
atomic vectors and lists
An atomic vector contains exactly one data type, whereas a list may contain multiple data types.

num_vect <- c(0.5, 55, -10, 6)

num_vect >= 6
[1] FALSE TRUE FALSE TRUE

The < and >= symbols in these examples are called
| ‘logical operators’. Other logical operators include
| >, <=, == for exact equality, and != for
| inequality.

A | B  or   A&B and !A not
%% reminder
sum()
length()

my_char
[1] “My” “name” “is”
paste(my_char, collapse = " ") single variable
[1] “My name is”
[v.s]
my_name <- c(my_char, “Jinlu”)
my_name
[1] “My” “name” “is” “Jinlu”
[v.s]
paste(“Hello”, “world!”,sep = " ") two strings
[1] “Hello world!”
[v.s]
paste(1:3, c(“X”, “Y”, “Z”), sep = “”) two vectors
[1] “1X” “2Y” “3Z”

5: Missing Values
NaN: not a number ( Inf - Inf, 0/0, etc.)
rnorm(5, 100, 1 ) numbers, mean, standard distribution
y <- rnorm(1000) random choose
1000 NAs: z <- rep(NA, 1000) replicate
my_data <- sample(c(y, z), 100) sample random choose, random sequence

my_na <- is.na(my_data)
is.na() function tells us whether each element of a vector is NA. return TRUE FALSE TRUE, …
!is.na(x) can be read as ‘is not NA’
my_data == NA NA is not really a value, but just a placeholder for a quantity that is not available return vectors contains all NAs
NA > 0 evaluates to NA.

R represents TRUE as the number 1 and FALSE as the
| number 0. Therefore, if we take the sum of a bunch of
| TRUEs and FALSEs, we get the total number of TRUEs. sum(my_na)

6: Subsetting Vectors
x[1:10] 1st to 10th !
(OR: Use the c() function to specify the element numbers as a numeric vector: x[c(3, 5, 7)])

y <- x[!is.na(x)] :isolated the non-missing values of x and put them in y.
y[y > 0]: A vector of all the positive elements of y corresponds to TRUE, return; otherwise, doesn’t return.
x[!is.na(x) & x > 0] also is right.

x[0]
numeric(0) not useful, but R cannot prevent this.

x[c(-2, -10)]: gives us all elements of x EXCEPT for the 2nd and 10 elements
equal to : x[-c(2,10)]

named elements:
vect <- c(foo = 11, bar = 2, norf = NA):
vect
foo bar norf
11 2 NA
get the names of vect: names(vect) : “foo” “bar” “norf”
give names: names(vect2) <- c(“foo”, “bar”, “norf”)

vect[c(“foo”, “bar”)]:
foo bar
11 2

7: Matrices and Data Frames

matrices can only contain a single class of data, while data frames can consist of many different classes of data.
The dim() function allows you to get OR set the dim attribute for an R object <-
($ get a specific variables )
dim(my_vector) <- c(4,5)
my_vector
[,1] [,2] [,3] [,4] [,5] from each columns to next columns byrow: logical. If FALSE (the default) the matrix is filled by columns, otherwise the matrix is filled by rows.
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
OR: a <- matrix(1:20, nrow = 4, ncol = 5, byrow = TRUE, dimnames = list(c(2, 4,6,8), c(1,3,5,7,9)))

patients <- c(“Bill”,“Gina”,“Kelly”, “Sean”)
cbind(patients, my_matrix) notice the order of patients and my_matrix
( patients are conbined into first column, (as from colunm to column)
matrics have only one type of data, so numbers in my_matrix have to be coerced into characters.)
my_data <-data.frame(patients, my_matrix) :
patients X1 X2 X3 X4 X5
1 Bill 1 5 9 13 17
2 Gina 2 6 10 14 18
3 Kelly 3 7 11 15 19
4 Sean 4 8 12 16 20
cnames <- c(“patient”, “age”, “weight”, “bp”, “rating”,“test”)
colnames(my_data) <- cnames create column names.
my_data
patient age weight bp rating test
1 Bill 1 5 9 13 17
2 Gina 2 6 10 14 18
3 Kelly 3 7 11 15 19
4 Sean 4 8 12 16 20

8. Logic
R language doesn’t have the NOT expression, but “!”.
logical operation contains one formular or expression rather than a equation.

TRUE & c(TRUE, FALSE, FALSE) :TRUE FALSE FALSE recycle TRUE
TRUE && c(TRUE, FALSE, FALSE): TRUE take first one to compare
|| version of OR only evaluates the first member of a vector as well.

5 > 8 || 6 != 8 && 4 > 3.9
[1] TRUE
6!= 8 : TRUE 4 > 9: TRUE so TRUE && TRUE : TRUE then FALSE || TRUE: TRUE

9. function:
isTRUE()
identical()
xor() xor(5 == 6, !FALSE) : TRUE
which() returns TRUE indices which(c(TRUE, FALSE, TRUE)) would return the vector c(1, 3): 1 3
any() returns TRUE if one is true
all() returns TRUE if every is true
range() returns the minimum and maximum of its first argument
If you take the arithmetic mean of a bunch of 0s and 1s, you get the proportion of 1s.

When you explicitly designate argument values by name, the ordering of the arguments becomes unimportant. You can try this out by typing: remainder(divisor = 11, num = 5).
(otherwise, take the first number into first argument) ???
you can pass functions as arguments!

all arguments after an ellipses must have default values.

creating new binary operators:
%[whatever]% (any valid variable name)
“%p%” <- function(x, y){ paste(x, y, sep = " ") x: left variable y: right variable
“I”%p%“love”%p%“R!”
“I love R!”

10. lapply and sapply
the lapply() function takes a list as input, applies a function to each element of the list, then returns a list of the same length as the original one.
a data frame is really just a list of vectors each column is a “list” vector

sapply() allows you to automate this process by calling lapply() behind the scenes, but then attempting to simplify (hence the ‘s’ in ‘sapply’) the result for you. (like, list can be simplified to a character vector, no integer, etc. exist.)
In general, if the result is a list where every element is of length one, then sapply() returns a vector. If the result is a list where every element is a vector of the same length (> 1), sapply() returns a matrix. If sapply() can’t figure things out, then it just returns a list, no different from what lapply() would give you.
circles crosses saltires quarters sunstars
[1,] 0 0 0 0 0 name) circles… is a list, element is a vector
[2,] 4 2 1 4 50 length 2.

use flag_colors <- flags[, 11:17] to extract the columns containing the color data and store them in a new data frame called flag_colors. (Note the comma before 11:17. This subsetting command tells R that we want all rows, but only columns 11 through 17.)

lapply(flag_colors, sum)
the second argument is just the name of the function with no parentheses, etc.

11.vapply and tapply
sapply() tries to ‘guess’ the correct format of the result, vapply() allows you to specify it explicitly. If the result doesn’t match the format you specify, vapply() will throw an error, causing the operation to stop. This can prevent significant problems in your code that might be caused by getting unexpected return values from sapply().
vapply(flags, unique, numeric(1))
Error in vapply(flags, unique, numeric(1)) : values must be length 1,
but FUN(X[[1]]) result is length 194

tapply: As a data analyst, you’ll often wish to split your data up into groups based on the value of some variable, then apply a function to the members of each group.

The ‘landmass’ variable in our dataset takes on integer values
| between 1 and 6, each of which represents a different part of the
| world. Use table(flags$landmass) to see how many flags/countries count
| fall into each group.

table(flags $l a n d m a s s) 123456311735523920 t a b l e (f l a g s$ animate)
0 1
155 39
giving us the proportion of flags containing an animate image WITHIN each landmass group:
tapply(flags $a n i m a t e, f l a g s$ landmass, mean) animate mean, then multiply landmass correspond to landmass. lassmass has value, so multiply as a mean function
1 2 3 4 5 6
0.4193548 0.1764706 0.1142857 0.1346154 0.1538462 0.3000000

12: Looking at Data
head(plants, 10) look 10 row
tail(plants, 15)
nrow ncol
names() column has name, so return column name
summary(plants) :how each variable is distributed and how much of the dataset is missing.
summary() provides different output for each variable, depending on its class. For numeric data such as Precip_Min, summary() displays the minimum, 1st quartile, median, mean, 3rd quartile, and maximum.
For categorical variables (called ‘factor’ variables in R), summary() displays the number of times each value (or ‘level’) occurs in the data.

table(plants$Active_Growth_Period) could count of those categorical/factor variable.

str(plants): Any time you want to understand the structure of something (a dataset, function, etc.), str() is a good place to start

13: Simulation
sample(x, size, replace = FALSE, prob = NULL)
Sampling with replacement simply means that each number is “replaced” after it is selected, so that the same number can show up more than once
The sample() function can also be used to permute, or rearrange, the elements of a vector：
sample(LETTERS)
This is identical to taking a sample of size 26 from LETTERS, without
| replacement. When the ‘size’ argument to sample() is not specified, R
| takes a sample equal in size to the vector from which you are sampling.

we want to simulate 100 flips of an unfair two-sided coin.
| This particular coin has a 0.3 probability of landing ‘tails’ and a 0.7
| probability of landing ‘heads’：
sample(c(0,1), 100, replace = TRUE, prob=c(0.3, 0.7))

Each probability distribution in R has an r*** function (for “random”),
| a d*** function (for “density”), a p*** (for “probability”), and q***
| (for “quantile”).
rbinom() ： only specify the probability of ‘success’ (heads) ：
rbinom(1, size = 100, prob= 0.7) 输出1次 observation，做100次
rbinom(100, size = 100, prob = 0.7) 输出100次，每次只做1次（0/1）

The standard normal distribution has mean 0 and standard deviation 1.
| As you can see under the ‘Usage’ section in the documentation, the
| default values for the ‘mean’ and ‘sd’ arguments to rnorm() are 0 and
| 1, respectively. Thus, rnorm(10) will generate 10 random numbers from a
| standard normal distribution

rpois(n, lambda) 泊松分布 lambda=mean
replicate(100, rpois(5, 10)) 重复100次

exponential (rexp()), chi-squared (rchisq()), gamma
| (rgamma())

14: Dates and Times
class()
unclass() 计算机中存储的类

date objects： 2019-2-12
time objects (POSIXct)：2019-2-12 xxh:xxm: xxsec
POSIXlt objects: a list of values that make up the date and time.
t2 <- as.POSIXlt() t2$min只取minutes

current created time : Sys.time()
Sys.time() - t1 Time difference of 13.09417 hours
difftime(Sys.time(), t1, units = “days”) Time difference of 0.5464803 days

The weekdays() function will return the day of week from any date or time object.

strptime() : converts character vectors to POSIXlt:
t3 <- “October 17, 1986 08:24”
t4 <- strptime(t3, “%B %d, %Y %H: %M”)
t4 == “1986-10-17 08:24:00 CST”
class(t4) == “POSIXlt” “POSIXt”

15: Base Graphics
head(cars) first 6 row of data.
first to do : get a sense of the data: dim(), names(), head(), tail() and summary()
plot(cars) (scatterplot)
R was smart
| enough to know that the first element (i.e., the
| first column) in cars should be assigned to the x
| argument and the second element to the y
| argument.

First, R notes that
| the data frame you have given it has just two
| columns, so it assumes that you want to plot
| one column versus the other.
Second, since we do not provide labels for
| either axis, R uses the names of the columns.
| Third, it creates axis tick marks at nice round
| numbers and labels them accordingly. Fourth, it
| uses the other defaults supplied in plot().

do not type
| plot(cars $s p e e d, c a r s$ dist), although that will
| work. Instead, use plot(x = cars $s p e e d, y = ∣ c a r s$ dist).

col = 2 plotted points are colored red.
plot(cars, xlim = c(10,15)) limit the x-axis to 10 through 15
plot(cars, pch= 2) triangles

boxplot(), like many R functions, also takes a
| “formula” argument, generally an expression with
| a tilde ("~") which indicates the relationship
| between the input variables. This allows you to
| enter something like mpg ~ cyl to plot the
| relationship between cyl (number of cylinders) on
| the x-axis and mpg (miles per gallon) on the
| y-axis.

boxplot(mpg ~ cyl, data= mtcars)
hist() is the associated R function.