randomForest函数
randomForest在RStudio中的Documentation:
randomForest {randomForest} R Documentation
Classification and Regression with Random Forest
Description
randomForest implements Breiman's random forest algorithm (based on Breiman and Cutler's original Fortran code) for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.Usage
## S3 method for class 'formula'
randomForest(formula, data=NULL, ..., subset, na.action=na.fail)
## Default S3 method:
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,
mtry=if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
replace=TRUE, classwt=NULL, cutoff, strata,
sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),
nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,
maxnodes = NULL,
importance=FALSE, localImp=FALSE, nPerm=1,
proximity, oob.prox=proximity,
norm.votes=TRUE, do.trace=FALSE,
keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,
keep.inbag=FALSE, ...)
## S3 method for class 'randomForest'
print(x, ...)
Arguments
data
an optional data frame containing the variables in the model. By default the variables are taken from the environment which randomForest is called from.subset
an index vector indicating which rows should be used. (NOTE: If given, this argument must be named.)na.action
A function to specify the action to be taken if NAs are found. (NOTE: If given, this argument must be named.)默认na.action = na.fail,即不允许有na存在。如果数据集中没有missing value,就不需要修改此argument.
实际应用时若数据集中有missing value,则可修改为na.action = na.omit(忽略NA)或者na.action = na.roughfix(简单填充缺失值)。
用rfImpute()能够获得更优的拟合填充值。
x, formula
a data frame or a matrix of predictors, or a formula describing the model to be fitted (for the print method, an randomForest object).y
A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, randomForest will run in unsupervised mode.xtest
a data frame or matrix (like x) containing predictors for the test set.ytest
response for the test set.ntree
Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.注意到这里的ntree即是tree总数,即B的值。R中默认是500,Python中默认为100.
mtry
Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)这里的mtry即是RSF size, 也就是M的取值。
对Categorical Y默认取值为RSF size = floor(sqrt(M));对Continuous Y默认取值为RSF size = floor(M/3)
replace
Should sampling of cases be done with or