R的一些常用函数【基于尚学堂的部分总结】

时间之龙

已于 2022-08-28 22:59:22 修改

阅读量925

点赞数 2

文章标签： r语言开发语言

于 2022-08-07 16:32:07 首次发布

本文链接：https://blog.csdn.net/qq_64265869/article/details/126212010

版权

mode(x)
查看数据结构类型

na.omit(df)
将df中具有NA的一行删除

sum(x, na.rm=TRUE)
求和时忽略x中NA的存在

as.xxxx(list) -> as.Date(df$date, "%Y--%m--%d")
转换列表的数据格式，类似的类型还有：
numeric、character、vector、matrix、data.frame、factor、logical

is.xxxx(list) -> is.Date(df$date)
转换列表的数据格式，类似的类型还有：
numeric、character、vector、matrix、data.frame、factor、logical、NA

Sys.Date()
得到当前时间 "xxxx-xx-xx"

date()
得到当前详细时间 "Sun Aug x xx:xx:xx xxxx"

format(Sys.Date(), format="%B %d %Y)
按格式输出时间"x月 02 xxxx"

order排序
df <- df[ order(df$grade, -df$age) , ] -->按照grade由小到大排序df整个元组，一致的再按照age由大到小排序（在SQLDF包中也可以排序）

sample随机抽样
sample(df, size, replace=FALSE) ->对df的列抽取size个列，且不放回抽样
sample(list,size replace=FALSE) -> 对列表进行size次不放回随机抽样
df[sample(0:size(df) ,size,replace=FALSE) , ] -> 在df中随机抽样size次

celling(x)
向上取整

floor(x)
向下取整

round(x)
四舍五入

mean(list)
求该list的平均值

length(list)
求list大小

sd(x)
求标准差

var(x)
方差

scale(x)
min-max标准化

pretty(c(-3.3), 30)
在[-3,3]上等分取值

概率函数（摘自尚学堂）

set.seed(x)，设置随机数种子

字符处理

nchar(str)
字符长度

substr(str, begin_index, end_index)
截取[begin_index, end_index]的字串

grep(str, list)
在list中搜索str的位置

sub(strA, strB, str)
将str中的所有strA替换为strB

strsplit(str, strA)
以strA作为标志，分割字符串str -> strsplit(str,"") 每个字符均分为一个字符串

paste(strA, strB)
将strB追加在strA末尾

toupper(str)
tolower(str)
转换大小写

其它

length(list)
得到list的长度

seq(a,b,s)
生成以a开始、步长为s、不大于b的序列

rep(list, n)
将list重复n遍

apply(数据, 维度, 函数)
将函数应用到数据(matrix、list、data.frame）的某维度（1行2列）上

for( i in 1:5) {xxxx}
for循环

while(谓词) {}
while循环

if(谓词) {xxx}
else {xxx}

ifelse(谓词, stateA, stateB)
即谓词?stateA:stateB

switch( x, 谓词A=值, ...)
按x取值不同传出不同的值

t(x)
转置x

summary(x)
展示df的相关统计信息

df[ !complete.cases(df) , ]
将df中非齐全的数据显示出来

manyNAs(df, NA属性数占比)
将df中比较多的NA的行号以list形式返回出来

cor(df[, a:b] , use="complete.obs")
得到df各个列间的相关系数

symnum( cor(xxx))
将cor结果可视化

R多元分析

lm(col_a~col_b, data=df)
lm(col_a ~.，data=x) 表示cola_a和所有的进行线性关系探索
lm(col_a ~ col_b+col_c，data=x）代表探索col_a和col_b以及col_c两个的关系
注意：在生成了之后，可能会出现之前没有的变量名称，这些叫做辅助变量，前缀互斥
注意诊断信息：
调整后的R方：越接近1，预测就越科学准确。
获得这两个属性的现象系数，按自变量系数从低到高排列
摘自尚学堂：

knnImputation(df, k=int)
填补df中的缺失值，返回一个df

anova( resOf_lm )
对进行多元线性回归结果进行消元
注意结果：
Sum Sq：代表该变量对减少模型拟合误差的贡献度，越小越说明不应当加入到分析中
anova( resOf_lm1, resOf_lm2 )
比较两个模型
Sum Of Sq：代表比较误差是否增减，负则减小了误差

update( resOf_lm , .~. -a)
更新lm的结果，使其减去a属性

step( resOf_lm )
对lm结果进行一次性消元，得到最好的结果

fitted( resOf_lm )
得到回归模型的预测值，等同于 resOf_lm$fit，等同于predict( resOf_lm )

predict( resOf_lm, xOf_df, interval="confidence" )
使用模型对xOf_df中的自变量数据进行预测，例子如下：
Model=lm(y~x,data=…)
predict(Model,newdata=data.frame(x=2),interval="confidence")

corr.test( df )
对df的各个列进行线性相关性计算

*最后的多元线性回归，虽然只是线性，但是可以在df中添加各个列的各个次幂，达到多元多项式回归的效果

R的主成分分析法PCA--prcomp()

prcomp(df, scale=FALSE) -> 将df的各列进行标准化再PCA

prcomp( ~ xx+yy , df ,scale=FALSE) -> 对df的xx和yy列进行主成分分析

注意，最好将df个各列归一化再进行PCA分析

结果是一个结构体，解释如下：

Standard deviations (1, .., p=13):
[1] 1.272379e+02 7.307220e-01 1.631651e-01 1.377232e-01 9.427311e-02 8.718071e-02
[7] 6.587986e-02 6.231248e-02 5.026467e-02 3.940015e-02 2.054129e-02 1.902779e-02
[13] 8.591091e-17

Rotation (n x k) = (13 x 13):
PC1 PC2 PC3 PC4
rank 0.9999788557 0.006358575 -0.0003632571 -0.0002588301
total_score -0.0061180870 0.945619472 0.0153983907 -0.0874790832
student_quality -0.0012430645 0.051322553 0.1806667064 0.2689579833
employment_rate -0.0003042158 0.014838783 0.3784966445 -0.8538773131
social_prestige -0.0001359347 0.062218087 0.0168377838 0.0200246079
scientific_scale -0.0006859586 0.158548431 0.0522595793 0.1732400062
scientific_quality -0.0010340273 0.064954615 -0.8866745835 -0.2771439131
top_achievements -0.0005109573 0.130393099 -0.0279210248 0.0881592765
top_talent -0.0003346369 0.115422120 0.0046016639 0.0825661656
science_and_technology_service -0.0004564151 0.115313790 0.1437741077 0.1354073802
achievement_transformation -0.0001568028 0.060661626 0.0661266562 0.0097506208
student_internationalization_rate -0.0002630827 0.024341676 0.0936509082 0.0448743781
class_vertify -0.0009929913 0.147604692 -0.0064200511 0.2205617246
PC5 PC6 PC7 PC8
rank 7.616306e-05 0.0006446037 -5.196425e-05 -5.761073e-05
total_score -8.427594e-02 0.0644729306 -8.367281e-03 2.049632e-02
student_quality -5.784077e-02 0.3277766120 2.443437e-01 -2.448156e-01
employment_rate 1.828609e-01 0.0175812655 1.698198e-02 -3.723634e-02
social_prestige -2.150878e-01 0.0065250559 2.803245e-01 7.484693e-02
scientific_scale 3.226188e-01 -0.0960507292 1.749448e-02 -2.365627e-01
scientific_quality -2.063775e-02 0.0225452924 -7.638285e-02 -6.683350e-02
top_achievements 5.324411e-02 -0.0182032612 1.816142e-01 -2.633010e-01
top_talent -5.554419e-02 0.0608997527 3.676247e-01 -1.820939e-01
science_and_technology_service -1.456663e-01 -0.5585237488 -5.876097e-01 -3.104935e-01
achievement_transformation -4.941258e-01 -0.3617470647 2.025722e-01 5.755558e-01
student_internationalization_rate -3.139874e-01 0.6568279398 -5.350708e-01 1.447587e-01
class_vertify 6.598902e-01 0.0068418161 -1.202598e-01 5.666715e-01
PC9 PC10 PC11 PC12
rank -0.001069498 0.0002411415 -8.164717e-05 -0.000147586
total_score -0.049947229 -0.0078832051 8.001965e-04 -0.015379701
student_quality -0.731560411 0.1735183625 -6.434065e-02 -0.095925363
employment_rate -0.085510363 0.0300660035 -2.066149e-02 0.003895324
social_prestige 0.074368377 -0.6961312017 3.596216e-01 -0.402696939
scientific_scale 0.393419355 0.4265456153 4.006264e-02 -0.592354479
scientific_quality -0.167064409 0.0571738703 -5.292646e-02 -0.065039131
top_achievements 0.185884692 0.1411059205 6.021060e-01 0.607156761
top_talent 0.345732534 -0.2034103661 -6.960707e-01 0.275653201
science_and_technology_service -0.172756085 -0.2076133552 -1.078509e-01 0.070174643
achievement_transformation -0.013863478 0.4003888997 -2.487435e-02 0.008694558
student_internationalization_rate 0.254738978 0.0493125875 1.299347e-02 0.023031044
class_vertify -0.133336419 -0.1788395415 -4.725893e-02 0.152030679
PC13
rank 2.017986e-18
total_score 2.886751e-01
student_quality -2.886751e-01
employment_rate -2.886751e-01
social_prestige -2.886751e-01
scientific_scale -2.886751e-01
scientific_quality -2.886751e-01
top_achievements -2.886751e-01
top_talent -2.886751e-01
science_and_technology_service -2.886751e-01
achievement_transformation -2.886751e-01
student_internationalization_rate -2.886751e-01
class_vertify -2.886751e-01

在结果中，第一个是标准差，作用是：（略）

第二个是rotation，是一个旋转矩阵，其中PCxx指的是一个主成分，关于该PCxx的一列是表示各个量的分量系数，例如，PC1=0.999*rank+(-0.00612)*total_score+...

> summary(prcomp(res2.order_rank.matrix))
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
Standard deviation 127.2 0.73072 0.1632 0.1377 0.09427 0.08718 0.06588 0.06231
Proportion of Variance 1.0 0.00003 0.0000 0.0000 0.00000 0.00000 0.00000 0.00000
Cumulative Proportion 1.0 1.00000 1.0000 1.0000 1.00000 1.00000 1.00000 1.00000
PC9 PC10 PC11 PC12 PC13
Standard deviation 0.05026 0.0394 0.02054 0.01903 8.591e-17
Proportion of Variance 0.00000 0.0000 0.00000 0.00000 0.000e+00
Cumulative Proportion 1.00000 1.0000 1.00000 1.00000 1.000e+00

Standard deviation:同上文

Proportion of Variance:表示这个主成分可以描述样本索引依据的程度，即，该函数认为，索引的排列依据是PCxx之和。

Cumulative Proportion：是cumsum的Proportion of Variance，表示考虑这个和之前所有的PC，可以解释其中百分之多少的样本索引。

而PCxx本身在描述什么呢？ -> 在描述该样本的索引，本例中，由于df按照rank列排列，所以要描述df中样本的索引，只要参照rank大小就行了，因此，在PCxx中，只需要PC1，足以100%描述，因此在summary结果中，PC1的Cumulative Proportion就是100%

rotation矩阵的每一列在说什么？ -> 表示如何得到PCxx的，例如，PC1=0.999*rank+(-0.00612)*total_score+...，这些系数也表示了该主成分中那个或那些属性影响PCxx的程度，在本例的PC1中，由于rank是0.999，说明rank对PC1的值其决定性作用，结合上文PC1的决定度是100%，因此，可以认为df的样本索引就是按照rank排列的（实际上，df的每一行就是按照rank的大小排列的）

R之时间预测

简单易行的方法是:

使用auto.arima函数确定阶数
构建相应模型
forecast得到数据
绘图

> library(tseries)
> library(forecast)
> air <- AirPassengers #自带数据

> sair<-ts(as.vector(air[1:132]),frequency=12,start=c(1949,1))

> tsdisplay(air)

> auto.arima(sair)
Series: sair
ARIMA(1,1,0)(0,1,0)[12]

Coefficients:
ar1
-0.2431
s.e. 0.0894

sigma^2 = 109.8: log likelihood = -447.95
AIC=899.9 AICc=900.01 BIC=905.46
> fit2<-arima(sair,order=c(1,1,0),seasonal=list(order=c(0,1,0),period=12))
> f.p2<-forecast(fit2,h=12,level=c(99.5))
> plot(f.p2,ylim=c(100,700))
> lines(f.p2$fitted,col="green")
> lines(air,col="red")

另外一种也很简单:直接forecast(ts对象)，得到数据之后开始作图

> library(tseries)
> library(forecast)
> air <- AirPassengers #自带数据

> sair<-ts(as.vector(air[1:132]),frequency=12,start=c(1949,1))

> forecast(sair)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
Jan 1960 411.9115 391.9778 431.8452 381.4255 442.3974
Feb 1960 406.9694 382.0344 431.9043 368.8346 445.1041
Mar 1960 467.3486 433.5448 501.1524 415.6501 519.0470
Apr 1960 450.7386 413.6375 487.8397 393.9974 507.4798
May 1960 451.5327 410.1879 492.8776 388.3012 514.7642
Jun 1960 513.2314 461.7576 564.7052 434.5091 591.9538
Jul 1960 569.8868 507.9846 631.7890 475.2156 664.5580
Aug 1960 567.5873 501.3865 633.7882 466.3419 668.8328
Sep 1960 496.1471 434.4291 557.8652 401.7575 590.5368
Oct 1960 432.2153 375.1873 489.2432 344.9985 519.4320
Nov 1960 376.6263 324.1565 429.0961 296.3807 456.8719
Dec 1960 424.7869 362.5403 487.0334 329.5889 519.9848
Jan 1961 431.0659 364.8422 497.2895 329.7855 532.3462
Feb 1961 425.4357 357.1091 493.7623 320.9392 529.9322
Mar 1961 488.0432 406.3061 569.7804 363.0371 613.0494
Apr 1961 470.2184 388.2757 552.1611 344.8978 595.5390
May 1961 470.5801 385.4188 555.7414 340.3371 600.8231
Jun 1961 534.3657 434.1164 634.6150 381.0476 687.6838
Jul 1961 592.7970 477.6925 707.9016 416.7599 768.8342
Aug 1961 589.8656 471.4917 708.2395 408.8284 770.9028
Sep 1961 515.1625 408.4559 621.8690 351.9689 678.3560
Oct 1961 448.3914 352.6449 544.1380 301.9597 594.8231
Nov 1961 390.3922 304.5497 476.2347 259.1074 521.6770
Dec 1961 439.9511 340.4348 539.4673 287.7541 592.1481
> plot(forecast(sair),ylim=c(100,700))
> lines(forecast(sair)$fitted,col="green")
> lines(air,col="red")