国内绝无仅有的R语言教程R for mathematical statistics（I)

最新推荐文章于 2025-02-19 17:31:25 发布

wxy_158

最新推荐文章于 2025-02-19 17:31:25 发布

阅读量151

点赞数 1

文章标签： r语言开发语言

本文链接：https://blog.csdn.net/wxy_158/article/details/131496024

版权

本文介绍了R语言的基本数据输入，包括向量操作、统计函数的使用，如mean、median和var，以及数据的修改和测试。此外，还讲解了数据可视化，如直方图、饼图和箱线图的绘制，以及如何处理分类和连续数据。文章还强调了R语言中自定义函数的定义和使用，以及数据输入图表的相关操作。

摘要由CSDN通过智能技术生成

基本操作

1输入数据（向量）

常规操作

语句输入的提示符，要输入的命令写在其后；+是继续这一行内容。
注意：所有代码有[1]的或者蓝色的为电脑输出结果，其余为输入每行前面默认>，
输入数据用从c，c是R自带函数用法如下(别忘记中间有，呀，非数字每一项要加”“）
没有c会出现Error: unexpected ‘,’ in…
typos = c(2,3,0,3,1,0,0,1)
typos
[1] 2 3 0 3 1 0 0 1

也可以用scan输入数据具体见分类数据
其他函数如平均值mean，中位数median，方差var

median(typos)
[1] 1
var(typos)
[1] 1.642857

至于mean还有特殊的如去掉首尾的1/10，我们可以写作mean（typos，trim=1/10）
类似的有 IQR 函数和mad函数
修改某组数据中的某一个数据(由于一组数据也相当于一个向量，元素是有顺序的，也就类似一个数组)

typos.draft1 = c(2,3,0,3,1,0,0,1)
typos.draft2 = typos.draft1 # make a copy
typos.draft2[1] = 0

同时也看到如果我们想做注解用#，放在#后的不会被执行。括号（）表示函数，方括号[]表示向量。

typos.draft2
[1] 0 3 0 3 1 0 0 1
typos.draft2[2]
[1] 3
typos.draft2[4]
[1] 3
typos.draft2[-4]
[1] 0 3 0 1 0 0 1#除了第三个
typos.draft2[c(1,2,3)]
[1] 0 3 0

如何测试typos.draft2的所有值，看看它们是否等于3（注意判断时是==）
方法1（so brief)

which(typos.draft2 == 3)
[1] 2 4

方法2（too sophisticated）

n = length(typos.draft2)
pages = 1:n
pages
[1] 1 2 3 4 5 6 7 8
pages[typos.draft2 == 3]
[1] 2 4

法二用R语言自带的length函数
1：8#从1到8等差数列也可以（1：8）还可以c（1：8）
推广seq函数用法：seq(首，尾，间距)
思路就是怎样得到对应序号？创建page就可以pages[typos.draft2 == 3]就相当于
page=1输出吗page=2输出吗…
所以也可以写成

(1:length(typos.draft2))[typos.draft2 == max(typos.draft2)]
[1] 2 4

以下问题触类旁通猜一猜吧！

sum(typos.draft2)
[1] 8
sum(typos.draft2>0) # How many pages with typos?
[1] 4
typos.draft1 - typos.draft2 # difference between the two
[1] 2 0 0 0 0 0 0 0

注意第一条和第二条第一条是求和，第二个是统计个数以下例子不会引起歧义
sort 用法
a = c(3,9,16,6,7,4,22,5,10,13)

sort(a)
[1] 3 4 5 6 7 9 10 13 16 22
sort(a,decreasing = F)
[1] 3 4 5 6 7 9 10 13 16 22

以上两种完全等价
同时a还是原来未排序的顺序上面操作只改变这个操作的返回值
a=c(1:8)

a
[1] 1 2 3 4 5 6 7 8
sum(a)
[1] 36
sum(a>7)
[1] 1

第三条就是对应位置相减（一般要个数相同，否则报错）

x=c(48,49,51,50,49,41,40,38,35,40)
x=c(x,48,49,51,50,49)
length(x)
[1] 15
x[16]=41
x[17:20] = c(40,38,35,40)
length(x)
[1] 20

x=c(x,48,49,51,50,49)，与C语言类似，=是赋值
x[17:20] = c(40,38,35,40)类似的批量的不连续的也可以修改a[c(1,2,3)]=c(4,5,6)

fivenum(sals) # 最小值前一半中位数中位数后一半中位数最大值
[1] 0.25 1.00 3.50 8.00 50.00
summary(sals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.250 1.250 3.500 8.565 7.250 50.000

summary和fivenum有细小差别

而中位数则分奇偶odd or even

自定义函数

易于定义您自己的函数是R的一个非常吸引人的特性

std = function(x) sqrt(var(x))
std(whale)
[1] 71.50789

第一行就是在定义标准差R中本身不自带（方差开根号）
屏幕截图 2023-06-30 021516.png 注意是n-1而不是n
或者sqrt( sum( (whale - mean(whale))^2 /(length(whale)-1)))
但这么常用且重要的东西怎末可能不具备呢！被摆了一道。
其实sd函数就足够了。

2输入数据（图表）

单变量数据

单变量数据有三种类型：分类数据，离散数据，连续数据

常用表（table），柱状图（bar），饼状图（pie）之表

x=c(“Yes”,“No”,“No”,“Yes”,“Yes”)
table(x)
x
No Yes
2 3

factor(x)
[1] Yes No No Yes Yes
Levels: No Yes

柱状图

beer = scan()
1: 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1
26:
Read 25 items
barplot(beer) # this isn’t correct这是错误的
barplot(table(beer)) # Yes, call with summarized data纵坐标：得到个数
barplot(table(beer)/length(beer)) # divide by n for proportion纵坐标：得到比例

以上代码先用到scan函数前25个数据直接粘贴，换行后出现26：直接按Enter则R会告诉你读到25个，其次barplot要输入整理好的table数据。绘制的图在右下plot看。

饼状图

beer.counts = table(beer) # store the table result说明不用考虑类型问题
pie(beer.counts) # first pie – kind of dull 1
names(beer.counts) = c(“domestic\n can”,“Domestic\n bottle”,
“Microbrew”,“Import”) # give names
pie(beer.counts) # prints out names 2
pie(beer.counts,col=c(“purple”,“green2”,“cyan”,“white”)) #3

其中命名时用到\n起换行作用
names命名函数，把一个向量命名（逻辑是名字=值）

也有茎叶图不太常用

scores = scan()
1: 2 3 16 23 14 12 4 13 2 0 0 0 6 28 31 14 4 8 2 5
21:
Read 20 items

stem(scores)
The decimal point is 1 digit(s) to the right of the |
0 | 000222344568
1 | 23446
2 | 38
3 | 1

前面是是为后面是个位 stem(scores,scale=2)就是把0-9分成0-4和5-9
遇到不知道的可以用？加你想问的单词，或者help或者apropos

sals = c(12, .4, 5, 2, 50, 8, 3, 1, 4, .25) # enter data

cats = cut(sals,breaks=c(0,1,5,max(sals))) # specify the breaks
cats # view the values
[1] (5,50] (0,1] (1,5] (1,5] (5,50] (5,50] (1,5] (0,1] (1,5] (0,1]
Levels: (0,1] (1,5] (5,50]
table(cats) # organize
cats
(0,1] (1,5] (5,50]
3 4 3
levels(cats) = c(“poor”,“rich”,“rolling in it”) # change labels
table(cats)
cats
poor rich rolling in it
3 4 3

分区间用cut函数breaks=c(0,1,5,max(sals)）把首尾间断点都写在里面赋给cats，cats就是对应元素所在区间的区间集下有level，至于下面重命名，和表格之前的懂了很容易懂了。

直方图

x=scan()
1: 29.6 28.2 19.6 13.7 13.0 7.8 3.4 2.0 1.9 1.0 0.7 0.4 0.4 0.3 0.3
16: 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1
27:
Read 26 items
hist(x) # frequencies频数
hist(x,probability=TRUE) # proportions (or probabilities)概率
rug(jitter(x)) # add tick marks

hist(x,breaks=10)
hist(x,breaks=c(0,1,2,3,4,5,10,20,max(x))）

注意：breaks等于一个数是建议的分组组数，但R不一定会采用。

方框图

举个例子

library(“Simple”) # read in library for these notes
data(movies) # read in data set for gross.
names(movies)
[1] “title” “current” “previous” “gross”
attach(movies) # to access the names above
boxplot(current,main=“current receipts”,horizontal=TRUE)
boxplot(gross,main=“gross receipts”,horizontal=TRUE)
detach(movies)

main=是标题 horizontal=TRUE说明是水平呈现，加了attach说明就对movies而言，下面无需写movies.detach最后解除。
运行结果（boxplot(current,main=“current receipts”,horizontal=TRUE)

密度图

data(faithful)
attach(faithful)
hist(eruptions,15,prob=T)
lines(density(eruptions))
lines(density(eruptions,bw=“SJ”),col=“red”)

双变量数据

table

smokes = c(“Y”,“N”,“N”,“Y”,“N”,“Y”,“Y”,“Y”,“N”,“Y”)
amount = c(1,2,2,3,3,1,2,1,3,2)
table(smokes,amount)
amount
smokes 1 2 3
N 0 2 2
Y 3 2 1

以上很容易看懂

tmp=table(smokes,amount) # 把表储存在tmp之中
old.digits = options(“digits”) # store the number of digits
options(digits=3) # 输出小数点后3位
prop.table(tmp,1) #行和为1
amount
smokes 1 2 3
N 0.0 0.500 0.500
Y 0.5 0.333 0.167
prop.table(tmp,2) # 列和为1
amount
smokes 1 2 3
N 0 0.5 0.667
Y 1 0.5 0.333
prop.table(tmp)
amount # 表格和为1
smokes 1 2 3
N 0.0 0.2 0.2
Y 0.3 0.2 0.1
options(digits=old.digits) # 放回digits

数字位数的设置，options(digits=n)，n一般默认情况下是7位，但实际上的范围是1~22，可以随意设置位数。

barplot

barplot(table(smokes,amount))
barplot(table(amount,smokes))
smokes=factor(smokes) # for names
barplot(table(smokes,amount),beside=TRUE, legend.text=T) barplot(table(amount,smokes),main=“table(amount,smokes)”,beside=TRUE,legend.text=c(“less than 5”,“5-10”,“more than 10”))#一句

boxplot

x = c(5, 5, 5, 13, 7, 11, 11, 9, 8, 9)
y = c(11, 8, 4, 5, 9, 5, 10, 5, 4, 10)
boxplot(x,y)

amount = scan()
1: 5 5 5 13 7 11 11 9 8 9 11 8 4 5 9 5 10 5 4 10
21:
Read 20 items
category = scan()
1: 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
21:
Read 20 items
boxplot(amount ~ category)

比较以上两个boxplot用法，☞同类左右对比
~☞后面的作x轴
屏幕截图 2023-07-01 164046.png ，
~

散点图

data(home);attach(home)
plot(old,new)
detach(home)

plot（density）

1 plot（density(bmi,na.rm=TRUE),lwd=4,col=“blue”)#lwd线条宽度
2 plot(sort(bmi),pch=“.”,cex=4,col=“green”)#pch是绘制图形，cex是pch的大小
屏幕截图 2023-07-01 180251.png

同时类比boxplot
plot也可对双变量作图plot(a~b,pima) pima中的一列b作横坐标得到散点图
重点在于你输入什么样的数据，它产生甚麽样的图

重点在于你输入什么样的数据，它产生甚麽样的图
如何理解呢？
如果上文中的b是像negative和positive这种levels数据则如下

补充

pairs

如果一个data由x1，x2, x3组成则 pairs（data）输出如下

主队角线为3个向量，两个向量夹着的是以其为横纵坐标的图，则关于主队角线位置对称的应是横纵坐标互换。

dataframe

ttime =1:3
value1 =c(1, 2, 2)
value2 = c(2, 0, 2)

data = data.frame(time, value1, value2)ime

time	value1	value2
1		1		 1		 2
2		2		 2		 0
3		3		 2		 2

$
从表格中提取一列成为向量

# x <- c(1,2,NA,3)

mean(x) 
# returns NA

mean(x, na.rm=TRUE) 
# returns 2

na.rm=TRUE 去掉无效值很常用

一些重要的函数

1掷骰子sample函数

Although Einstein said that god does not play dice, R can. -A Stanford Professor

sample(1:6,10,replace=T)#replace=T,或者replace=TRUE有放回不写或者replace=FALSE就是不放回
[1] 6 4 4 3 5 2 3 3 5 4
以上为方法一（极其重要）

以下为方法二
RollDie = function(n) sample(1:6,n,replace=T)#做关于n的函数
RollDie(5)
[1] 3 6 1 2 2