【R语言基础笔记】

最新推荐文章于 2022-03-17 22:01:11 发布

Guangshan Hu

最新推荐文章于 2022-03-17 22:01:11 发布

阅读量647

点赞数

文章标签： r语言开发语言

本文链接：https://blog.csdn.net/xhy1996/article/details/121852027

版权

基础概念

Data problem

-Missing data 缺失的数据
-Confounding variable (confounders) 混杂变量
variable that influences both the dependent variable and independent variable causing a spurious association
既影响因变量又影响自变量，导致虚假关联的变量
-Bias 偏差
-Outliers 离群值
Outliers are data points that lie well beyond the bulk of samples for a variable on one or more
离群值是一个或多个维度上的变量的数据点，它们远远超出了样本的范围；

Variable 变量

Numerical (quantitative) 数值（定量）
-Continuous 连续
（在一定区间内可以任意取值的变量叫连续变量，其数值是连续不断的，相邻两个数值可作无限分割，即可取无限个数值。）
-Discrete 离散
（离散变量指变量值可以按一定顺序一一列举，通常以整数位取值的变量）
Categorical (qualitative)分类（定性）
-Ordinal 定序变量
（变量的值不仅能够代表事物的分类，还能代表事物按某种特性的排序，这样的变量叫定序变量。问卷的人口特征中最常使用的问题“教育程度“，以及态度量表题目等都是定序变量，定序变量的值之间可以比较大小，或者有强弱顺序）
-Nominal 定类变量
（变量的不同取值仅仅代表了不同类的事物，这样的变量叫定类变量。问卷的人口特征中最常使用的问题，而调查被访对象的“性别”，就是定类变量。）

常见计量单位

Mean 平均数
Median 中位数
Variance 方差
Standard deviation 标准差
Range 极差
(inter-quartile range) IQR 四分位距
Quartiles 四分位数（Q1,Q2,Q3）
– Quartiles are 3 points that divide into 4 equal groups
四分位数是三个点，分成四个相等的组
– Each group is a quarter of data每组是四分之一的数据

Boxplot for R箱线图

Boxplot for R 箱线图
represent numerical data through quartiles (lower hinge is Q1, upper is Q3)通过四分位数表示数值数据(下铰链为Q1，上为Q3)
– lower inner fence is Q1 − 1.5 × IQR
– upper inner fence is Q3 + 1.5 × IQR
whiskers at min/max data values inside fences
highlights outliers outside fences
突出栅栏外的异常值
whiskers indicating variability outside the upper and lower quartiles
晶须表示上下四分位数之外的可变性

图片: BoxPlot

Covariance and Correlation协方差和相关系数

Covariance of two random variables shows how they are related:
两个随机变量的协方差显示了它们之间的关系:
Positive covariance, then they are positively related正协方差，则它们是正相关
Negative covariance, then they are negatively related负协方差，则负相关
在这里插入图片描述 The correlation coefficient of two random variables is covariance divided by the product of their standard deviations: 两个随机变量的相关系数是协方差除以各自标准差的乘积:
it shows how the two random variable are linearly related
它显示了这两个随机变量是如何线性相关的
if close to 1, then they are positively linearly related接近1，那么它们是正线性相关
if close to −1, then they are negatively linearly related接近-1，那么它们是负线性相关
if close to 0, then they are weakly related如果接近0，那么它们是弱相关
在这里插入图片描述

R代码段及结果输出

1. 查看类型及创建序列

#查看各字段类型
> typeof(2)
[1] "double"
> typeof(2L) #数值加了L后变为integer 否则是double
[1] "integer"
> typeof(3.14)
[1] "double"
> typeof(TRUE)
[1] "logical"
> typeof("TRUE")
[1] "character"
#######
> round(2/3,4)#保留到小数点后4位
[1] 0.6667
> seq(from=3, to=10, by=2) #创建序列
[1] 3 5 7 9
> seq(from = 2, by = -0.1, length.out = 4) #步长为0.1,长度4
[1] 2.0 1.9 1.8 1.7

> x <- c(1,3,6,9,0)
> x[-2] #除了第二个的所有元素
[1] 1 6 9 0
> rep(3,4)  # 3复制4遍
[1] 3 3 3 3
> rep(4,2)
[1] 4 4
> rep(1:5,3) #1-5 复制3遍
 [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
# 合并2变量
> c1 <- "Helo" # character variable 
> c2 <- "The World!" # another character variable 
> paste(c1, c2)
[1] "Helo The World!"
# 数字类型也一样
> c1 <- 1  
> c2 <- 2   
> paste(c(c1,c2))
[1] "1" "2"

2. 分类factor

# 分类factor
> x <- factor(c("male", "fmale", "male", "male", "fmale", "male")) 
> print(x)  #展示结果
[1] male  fmale male  male  fmale male 
Levels: fmale male
> nlevels(x) #分类个数
[1] 2
> unclass(x) # 每个元素的类别
[1] 2 1 2 2 1 2
attr(,"levels") # 去重后元素
[1] "fmale" "male" 
> table(x) #元素出现频率
x
fmale  male 
    2     4 
> ordered(x, levels = c("male", "fmale"))#factory排序
[1] male  fmale male  male  fmale male 
Levels: male < fmale
# 修改分类的值
> levels(x) <- c("男", "女")
> x
[1] 男 女 男 男 女 男
Levels: 男 女

3. 常用计算

> x1 <- c(4, 2.5, 3, NA, 1)
> summary(x1) # Works with NA 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  1.000   2.125   2.750   2.625   3.250   4.000       1 
> mean(x1) # Doesn't work 有NA无法计算平均值
[1] NA
> mean(x1, na.rm=TRUE) #忽视NA，计算平均值
[1] 2.625
> is.na(x1) #找出NA值
[1] FALSE FALSE FALSE  TRUE FALSE
> which(is.na(x1)) #同上，但是更加方便
[1] 4
> x1[is.na(x1)] <- 0 #将x1中的NA用0替代

4. 矩阵

#创建matrices
> m <- matrix(nrow=2, ncol=3) #empty matrix with dimension
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
> attributes(m) 
$dim
[1] 2 3
> dim(m)
[1] 2 3
> m <- matrix(c(1,3,6,2,8,4), nrow=2, ncol=3 ) #matrices are build column-wise
     [,1] [,2] [,3]
[1,]    1    6    8
[2,]    3    2    4
str(m) #structure缩写 使显示的更紧凑
num [1:2, 1:3] 1 3 6 2 8 4

#创建matrices 的两种方式：
x <- c(1,11,111)                 
y <- c(2,22,222)
m1 <- cbind(x,y)  #column-building
     x   y
[1,]   1   2
[2,]  11  22
[3,] 111 222
m2 <- rbind(x,y)  #row-building
  [,1] [,2] [,3]
x    1   11  111
y    2   22  222

5. Data Frame

> x <- c(1,2,3) 
> y <- c("a", "b", "c") 
> z <- c(TRUE, TRUE, FALSE) 
> df <- data.frame(x,y,z)
> df
  x y     z
1 1 a  TRUE
2 2 b  TRUE
3 3 c FALSE

> x <- c(1,3,5)
> names(x) <- c('a','b','c') #命名(列名)
> x
a b c 
1 3 5
> names(x) <- NULL
1 3 5 

> y <- list(low=3, med=5, high=7)
$low
[1] 3
$med
[1] 5
$high
[1] 7
> y$low    #通过名字搜索list里的值
[1] 3


> m <- matrix(1:6, nrow=3, ncol=2) 
> dimnames(m)<- list(c("a", "b", "c"), c("d", "e"))
> m
  d e 
a 1 4 
b 2 5 
c 3 6

> rownames(df) <- c("f1", "f2", "f3")
> colnames(df) <- c("rank", "character", "value")
> df
   rank character value
f1    1         a  TRUE
f2    2         b  TRUE
f3    3         c FALSE

#table()用于读取.txt数据文件，csv()用于读取.csv文件
#source()将.r文件带进来，并使文件中的代码可用
#table()， write.csv()将数据导出到文件中。

6. 逻辑表达

# &（AND）   
> x <- c(1,2,3,4,5,6,7,8,9)
> x>2 & x<5
[1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
# （与）&&同& 仅用于长度为1的 vector，因此它只会返回 vector 中第一项的比较结果
> x>2 && x<5
[1] FALSE
#  (或) |   OR   ||  同 &&
> x <- seq(10)
> x
 [1]  1  2  3  4  5  6  7  8  9 10
 # 如果。。。就.。。否则
> ifelse(x %% 2 == 0,"even","odd")
 [1] "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even"
# if，else，else if
> x <- 10
> x
[1] 10
> if(x > 2){ 
+   print("Greater") 
+ } else if(x < 2) { 
+   print("Smaller") 
+ } else { 
+   print("Equal") 
+ }
[1] "Greater"

7. 循环

# for 循环
for (i in 1:5){ 
  print(i) 
}
[1] 1 
[1] 2 
[1] 3 
[1] 4 
[1] 5
# 初始化一个2*3矩阵1
> m <- matrix(nrow=2, ncol=3)
> m
     [,1] [,2] [,3]
[1,]   NA   NA   NA
[2,]   NA   NA   NA
#使用for循环给矩阵赋值
for (i in 1:nrow(m)){
  for(j in 1:ncol(m)){
    m[i,j] <- i*j
  }
}
> m
[,1] [,2] [,3]
[1,]    1    2    3
[2,]    2    4    6
#While 循环
count <- 5 
while(count >0){ 
  print(count) 
  count <- count -1 
}
[1] 5
[1] 4
[1] 3
[1] 2
[1] 1
# For if next
for(i in 1:10){
  if(i %% 2==0){
    next
  }
  print(i)
}
[1] 1
[1] 3
[1] 5
[1] 7
[1] 9

8. 函数

# function()构造函数
# 例1
myfunc1 <- function(n){ 
  n*n 
}
> myfunc1(5)
[1] 25
> t <- 1:5 
> myfunc1(t)
[1]  1  4  9 16 25
> y <- myfunc1(10)
> y
[1] 100

# 例2
f2 <- function(x,y=2){ 
  x+y 
}
# function自带默认值 2， 2 
> f2(5,5)
[1] 10
# 2+5 = 7
> f2(5)
[1] 7

#求平均数的function
col.mean <- function(y, removeNA=TRUE) {    
  nc <- ncol(y)   # 求列数
  means <- numeric(nc) # a vector of size nc containing 0 使其每一列的平均数为0
  for(i in 1:nc) {
    means[i] <- mean(y[,i], na.rm=removeNA) #记得要remove NA
  }
  means
}

#求x^n的函数 一个函数套着一个函数
make.power <- function(n) { #a function returns a function
  pow <- function(x) {
    x^n
  }
  pow #最后输出值
}
cube <- make.power(3)  #3 为n
cube(2)
[1] 8

#函数带有作图代码
plot 生成图
drawFun <- function(f){
  x <- seq(-5, 5, len=1000)
  #sapply返回的是一个vector，但是lapply返回的是一个list
  #第一个值为参数,第二个为公式
  y <- sapply(x, f)
  # 作线图
  plot(x, y, type="l", col="blue")
}
drawFun(cos)

9. 生成随机数

这是r中的概率分布函数，它们帮助我们模拟给定概率分布中的变量。
• rnorm生成随机正态分布序列
• pnorm可以输出正态分布的分布函数
• dnorm可以输出正态分布的概率密度
• qnorm给定分位数的正太分布
d密度
r表示随机数发生器
p表示累积分布
q是分位数函数

q ：分位数向量vector of quantiles.
p ：概率向量vector of probabilities.
n ：表示产生几个数，length(n) > 1
mean ：向量均值 vector of means
sd ：向量的标准变异 vector of standard deviations
log, log.p ：逻辑值 logical; 为真时概率取对数 if TRUE, probabilities p are given as log§.
lower.tail ：逻辑值logical; 为真取小部分概率 if TRUE (default), probabilities are P[X ≤ x] otherwise, P[X > x].

#生成10个随机正态分布数， mean为20 ， 方差为2
> x1 = rnorm(10,20,2)
#set.seed(1) 保存随机数下次调用set.seed(1)还是这些随机数 
> set.seed(1)
> x1
 [1] 18.74709 20.36729 18.32874 23.19056 20.65902 18.35906 20.97486 21.47665 21.15156 19.38922
 # 重复上面代码会发现X1的随机10个值是一样的
> x1 = rnorm(10,20,2)
> set.seed(1)
> x1
 [1] 18.74709 20.36729 18.32874 23.19056 20.65902 18.35906 20.97486 21.47665 21.15156 19.38922

sample()函数从指定的一组(标量)对象中随机抽取

10. Plotting 作图

plot(): plots based on the object type of the input
lines(): add lines to the plot (just connect dots)
points(): add points
text(): add text labels to a plot using x,y coordinates
title(): add titles
mtext():add arbitrary text to the margin
axis(): adding axis ticks/labels

Different values for type
“p” - points (defult) 散点图
“l” - lines 实线
“b” - both points and lines 所有点被实线连接
“c” - empty points joined by lines 所有空点被实线连接
“o” - overplotted points and lines 实线通过的所有点
“s” and “S” - stair steps 绘出阶梯形曲线
“h” - histogram-like vertical lines 绘出点到x轴的竖线
“n” - does not produce any points or lines 不绘任何点或者曲线

# 例1:
x <- seq(-2*pi,2*pi,0.1) 
# 线图 + 主标题 + x,y轴的标签
plot(x, sin(x), 
	# 主标题
	main="my Sine function", 
	# x,y轴的标签
	xlab="the values", 
	ylab="the sine values",
	# 图类型
 	type="p", 
 	# 颜色
 	col="blue")

在这里插入图片描述

# 例2:
points(x, 0.5*x,
     type="l", col="green") 
plot(x, sin(x), 
     main="Overlaying Graphs", 
     type="l", col="blue") 
lines(x,cos(x), col="red") 
points(x, 0.5*x,
       type="l", col="green") 
#legend 加上说明（位置，字样，颜色）
legend("topleft", 
       c("sin(x)","cos(x)","0.5x"), 
       fill=c("blue","red","green") )

在这里插入图片描述

11. par 分层作图

函数par()的使用格式如下：

par(..., no.readonly= FALSE)

其中...表示所有类似于tag=value形式的参数。下面会具体的对这些参数进行描述。当参数no.readonly=TRUE时，函数par()就只允许有这一个参数了，并且会返回当前绘图设备中各个参数的参数值

opar<-par(no.readonly = TRUE)
# 第一个fig=将散点图设定为占据横向范围0~0.8，纵向范围0~0.8。
par(fig=c(0,0.8,0,0.9))
plot(mtcars$wt,mtcars$mpg,xlab = "x",ylab = "y")
# 上方的箱线图横向占据0~0.8，纵向0.55~1。右侧的箱线图横向占据0.65~1，纵向0~0.8。fig=默认会新建一幅图形，所以在添加一幅图到一幅现有图形上时，请设定参数new=TRUE。
par(fig=c(0,0.8,0.55,1),new=TRUE)
boxplot(mtcars$wt,horizontal = TRUE,axes=FALSE)
# 我将参数选择为0.55而不是0.8，这样上方的图形就不会和散点图拉得太远。类似地，我选择
了参数0.65以拉近右侧箱线图和散点图的距离（你需要不断尝试找到合适的位置参数）
par(fig=c(0.65,1,0,0.8),new=TRUE)
boxplot(mtcars$mpg,axes=FALSE)

在这里插入图片描述函数par()中的参数可以分为三大类：
1、只能读取，不能进行设置。包括参数cin，cra，csi，cxy，din。

2、只能通过函数par()进行设置。包括参数：

"ask",

"fig","fin",

"lheight",

"mai","mar","mex","mfcol","mfrow","mfg",

"new",

"oma","omd","omi",

"pin","plt","ps","pty",

"usr",

"xlog","ylog"

3、剩下的参数除了函数par()外，还可以通过各种高级绘图函数进行设置，如函数plot，points，lines，abline，title，text，axis，image，box，contour，rect，arrows等。

adj。该参数值用于设定在text、mtext、title中字符串的对齐方向。0表示左对齐，0.5（默认值）表示居中，而1表示右对齐(说明一下，区间[0,1]内的任何值都可以作为参数adj的有效值，并且在大部分的图形设备中，介于区间外的值也是有效的)。注意一下，函数text中的参数adj的值可以以类似于形式adj=c(x,y)调整方向。但是在text中该参数的值影响的是对点的标记，对函数mtext和title来说，参数adj的值影响的是整个图像或设备区域。

# 按1行2列来作图
par(mfrow=c(1,2))
plot(1:4)
title("plot(1:4)",adj=0)
plot(1:4)
title("plot(1:4)",adj=1)

在这里插入图片描述

更多内容不具体展示，Par()函数参考自此

12. 条形图+线形图

x <- mtcars$mpg
h <- hist(x, col="red", breaks=10, freq=FALSE)
#hist 条形图
xfit<-seq(min(x)-10,max(x)+10,length=40)
yfit<-dnorm(xfit,mean=mean(x),sd=sd(x)) 
#  dnorm 正态分布
d <- density(mtcars$mpg) 
# lwd 线的宽度乘以2
lines(d, col="green", lwd=2)
lines(yfit, col="blue", lwd=2)

在这里插入图片描述

13. 绘图细节

abline() then adds a line to the current graph
向当前图形添加一条线

lines() gets a vector of x values and a vector of y values, and joins the ponits to each other
获取一个x值向量和一个y值向量，并将这些连接到彼此

points() function adds a set of (x,y)-points
添加一系列点(x,y)

legend() is used to add a legend to a multicurve graph
legend(“topleft”, c(“sin(x)”,“cos(x)”), fill=c(“blue”,“red”) )
legend 加上说明（位置，字样，颜色）

text() function places some text anywhere in the current graph
在当前图形的任意位置放置一些文本
mtext() adds text in the margins
页边距中添加文本

polygon() draws arbitrary polygonal objects
曲线内填充颜色

abline(a = NULL, b = NULL, h = NULL, v = NULL, reg = NULL, coef = NULL, untf = FALSE, …)
a,b ：截距，斜率 h： y水平线 v：x垂直线

# 线
x <- c(0,2,3)
y <- c(1,3,8)
plot(x,y)
fit <- lm(y ~ x) 
abline(fit) #adds a line to a plot.
abline(h=1, col="red")
abline(v=2, col="blue")
abline(3,4, col="green") # y=4x+3

在这里插入图片描述

f <- function(x) return(sin(x))
curve(f,0,2)
polygon(c(1.2,1.4,1.4,1.2),
c(0,0,f(1.3),f(1.3)),
col="gray")
text(1,0.4,"r is cool", col="green")
mtext(123, col="blue")

在这里插入图片描述

14. 保存图片

#to save as a jpeg  保存为jpeg
#Saves to the currnt directory 
jpeg(file="plot1.jpeg") 
hist(Temperature, col="darkgreen") 
dev.off()

#saving as a png 保存为png
png(file="plot2.png", width=600, height=350) 
hist(Temperature, col="gold") 
dev.off() 

#saving as a pdf file  保存为pdf
pdf(file="saving_plot4.pdf") 
hist(Temperature, col="violet") 
dev.off()

Guangshan Hu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
【R语言基础笔记】

R语言基础语法基础概念Data problemVariable 变量常见计量单位Boxplot for R箱线图Covariance and Correlation协方差和相关系数R代码段及结果输出1. 查看类型及创建序列2. 分类factor3. 常用计算4. 矩阵5. Data Frame6. 逻辑表达7. 循环8. 函数9. 生成随机数10. Plotting 作图11. par 分层作图12. 条形图+线形图13. 绘图细节14. 保存图片基础概念Data problem-Missing da
复制链接

扫一扫