[Machine Learning] Random Forest

最新推荐文章于 2022-05-12 09:05:12 发布

艳艳儿

最新推荐文章于 2022-05-12 09:05:12 发布

阅读量952

点赞数

分类专栏： machine learning 文章标签：机器学习

本文链接：https://blog.csdn.net/COMEYAN/article/details/50279981

版权

machine learning 专栏收录该内容

7 篇文章 0 订阅

订阅专栏

Random Forest

Random Forest

(一). Review of random forest

Random forest algorithm is a classifier based on two primarily method-bagging and random subspace method
Firstly, I give a review of two main ensemble algorithm: bagging and boosting:

bagging builds more approximating unbiased models using bootstrap samples and same features, then averaging or voting these models gives final prediction value. It is used for reducing variance.
boosting starts with a weak learner, and gradually improving it by refitting the data giving higher weights to the misclassified samples. The final classifier is built by weighted voting.

Random forest is a substantial modification of bagging on CART trees. It uses another randomness compared with bagging, which only samples cases with replacement (bootstrap). When splitting at each node in each CART tree, random forest also sample a subset of features without replacement and test only this subset features to find the best performing feature to split the data at this node.

As all we know, averaging B i.i.d random variable with variance $σ^2$ has variance $\frac{\sigma^2}{B}$ , but if these B variables are only identically distribution (bagging), the averaging variance is

ρ σ 2 + (1 - ρ) σ 2 B

$ρσ^2+(1-\rho)\frac{σ^2}{B}$
proof:

V a r [1 B \sum 1 B X i] = 1 B 2 ⎡ ⎣ \sum i B V a r (X i) + \sum i \neq j C o v (X i X j) ⎤ ⎦ = 1 B 2 [B σ 2 + B (B - 1) ρ σ 2] = ρ σ 2 + (1 - ρ) σ 2 B

$\mathrm{Var}\left[\frac{1}{B}\sum_1^B X_i\right] = \frac{1}{B^2}\left[\sum_i^B\mathrm{Var}(X_i)+\sum_{i \neq j}\mathrm{Cov}(X_iX_j)\right]=\frac{1}{B^2}\left[B\sigma^2 + B(B-1)\rho \sigma^2\right] =ρσ^2+(1-\rho)\frac{σ^2}{B}$

(ρ is the correlation of each pair of bootstrap samples), so as inceasing B, the averaging variance is decreasing. As seen, if we can reduce the correlation ρ without increasing $σ^2$ too much, we can reduce the averaging variance. Random forest is to achieve this idea by random selection of input features: before each split, select $m<p$ of the input variables as candidates for splitting, and for regression, m is usually set to $[\frac{p}{3}]$ , for classification, m set $\sqrt{p}$ .

So now we can declare that

bootstrapping samples is used to reduce the variance of each individual tree.
random selection subset of features is used to reduce correlation between each pair of bootstrap samples.

pseudo-code for random forest:

For b=1 to B
(a) Draw a bootstrap sample $Z^*$ of size N from the training data.
(b) Grow a random-forest tree $T_b$ of the bootstrapped data, by recursively repeating following steps in the terminal node of the tree until the minimum node size $n_\min$ is reached.

Select m variables at random from the p variables.
Pick the best variable/split-point among the m.
Split the node into two daughter nodes.
Output the ensemble of trees $\{T_i\}_{i=1}^B$ .

(二). Models random forest can be used

Notice that: random forest just performs well on non-liner model, such as decision tree, but is not suitable for linear model.
Since bagging is an additive ensemble technique and averaging linear model is also linear. Note that fitting linear model is a convex problem, and we can find the best possible solution. With that said, since bagging produces a linear model, it can’t beat the best possible solution. Here we give an example using the sample mean which is linear: suppose $x_1,x_2,..,x_N$ are i.i.d. $(μ, σ^2).$ Let $\bar x_i^*$ be the bootstrap realization of sample mean( $i \in 1:B$ ). And the bagging model for sample mean is:

1 B \sum 1 B x ¯ * i

$\frac{1}{B}\sum_1^B\bar x_i^*$
Here we declare that:

v a r (x ̅ * i) = 2 N - 1 N 2 σ 2 .

$\mathrm{var}(x ̅_i^∗ )=\frac{2N−1}{N^2}σ^2.$

c o r (x ̅ * i, x ̅ * j) = N 2 N - 1 .

$\mathrm{cor}(x ̅_i^∗,x ̅_j^∗ )=\frac{N}{2N−1}.$

proof:

E X ¯ * i = E X E [X ¯ * i | X] = E X [X ¯] = μ

$\mathrm{E}\bar X_{i}^* = \mathrm{E}_X\mathrm{E}[\bar X_{i}^*|X] = \mathrm{E}_X[\bar X] = \mu$

E X ¯ * 1 2 = E X E [X ¯ * 1 2 | X] = E X [1 N (1 N \sum i = 1 N X 2 i - X ¯ 2) + X ¯ 2] = 2 N - 1 N 2 σ 2 + μ 2

$\mathrm{E}{\bar X_{1}^*}{}^2 = \mathrm{E}_X\mathrm{E}[{\bar X_{1}^*}{}^2|X] =\mathrm{E}_X\left[\frac{1}{N}\left(\frac{1}{N}\sum_{i=1}^N X_i^2 - {\bar X}^2\right)+{\bar X}^2\right] =\frac{2N-1}{N^2}\sigma^2+\mu^2$

V a r (X ¯ * 1) = E X ¯ * 1 2 - [E X ¯ * 1] 2 = 2 N - 1 N 2 σ 2

$\mathrm{Var}(\bar X_{1}^*) = \mathrm{E} \bar X_{1}^*{}^2 -\left[\mathrm{E}\bar X_{1}^*\right]^2 = \frac{2N-1}{N^2}\sigma^2$

E (X ¯ * 1 X ¯ * 2) = E X E [X ¯ * 1 X ¯ * 2 | X] = E X E [X ¯ * 1 | X] 2 = E X [X ¯] 2 = 1 N σ 2 + μ 2

$\mathrm{E}(\bar X_1^* \bar X_2^*)= \mathrm{E}_X\mathrm{E}\left[\bar X_1^* \bar X_2^*|X\right]=\mathrm{E}_X\mathrm{E}[\bar X_1^*|X]^2 = \mathrm{E}_X [\bar X]^2=\frac{1}{N}\sigma^2+\mu^2$

C o v (X ¯ * 1 X ¯ * 2) = E (X ¯ * 1 X ¯ * 2) - E X ¯ * 1 E X ¯ * 2 = 1 N σ 2

$\mathrm{Cov}(\bar X_1^* \bar X_2^*) = \mathrm{E}(\bar X_1^* \bar X_2^*) - \mathrm{E}\bar X_1^* \mathrm{E}\bar X_2^* =\frac{1}{N}\sigma^2$

C o r (X ¯ * 1 X ¯ * 2) = C o v ( X ¯ * 1 X ¯ * 2 ) V a r ( X ¯ * 1 ) V a r ( X ¯ * 2 ) - - - - - - - - - - - - - - \sqrt = N 2 N - 1

$\mathrm{Cor}(\bar X_1^* \bar X_2^*) = \frac{\mathrm{Cov}(\bar X_1^* \bar X_2^*)}{\sqrt{\mathrm{Var}(\bar X_1^*)\mathrm{Var}(\bar X_2^*)}}=\frac{N}{2N-1}$
Then using formula

ρ σ 2 + (1 - ρ) σ 2 B

$ρσ^2+(1-\rho)\frac{σ^2}{B}$
we can get the variance of bagging is

1 N σ 2 + N - 1 B N 2 σ 2 > 1 N σ 2 = X ¯,

$\frac{1}{N} σ^2+\frac{N-1}{BN^2} σ^2 >\frac{1}{N} σ^2=\mathrm{\bar X},$
which says bagging of linear model cannot reduce variance.

(三). OOB(out of bag) estimation error

An important properties of random forest is the use of out-of-bag samples, which are the samples are not bootstrapped in building trees. OOB error is

For each observation $z_i=(x_i,y_i)$ , construct its random forest predictor by averaging only those trees corresponding to bootstrap samples without $z_i$ .

Formula to understand the process of computing this error: suppose there have been built B trees $\{T_1,T_2,... T_B\}$ and for each individual sample $x_i$ , there exits a subset trees $S_i$ built without it, then we predict label for $x_i$ using these $S_i$ trees, $\hat y_i = \arg\max\sum_{S_i}1_{\{\hat y_i = k\}}$ , and oob error is the average over all samples:

OOBerror=1N∑i=1N1{y^i≠yi}

$OOB error = \frac{1}{N}\sum_{i=1}^N 1_{\{\hat y_i \neq y_i\}}$

OOB error is close to N-fold cross validation error. So we do not need to perform cross validation along tree building, and once the OOB error stabilizes, the training can be terminated.[OOB error can be used to select the number of trees need to build.]

(四). Variable importance

There are two ways to evaluate variable importance:

Gini importance: at each split in each tree, the improvement in the split criteria is the importance measure attributed to the splitting variable, and is accumulated over all the trees separately for each variable. Formula to understand: in each decision tree $T_b, b\in 1:B$ , the square importance measure for $\ell$ variable is defined as
$VI2ℓ(Tb)=∑t=1J−1i^2tI{v(t)=ℓ}$ $VI_\ell^2(T_b) = \sum_{t=1}^{J-1} \hat i_t^2 I_{\{v(t)=\ell\}}$
$v(t)$ is the index of variable used to split node $t$ into two daughter nodes and the sum is over the $J-1$ internal nodes of the tree. $i_t^2$ is the split criteria such as squared error risk or Gini impurity. Then averaging over all B trees gives the total squared importance measure of $\ell$ variable:
$VI2ℓ=1B∑b=1BVI2ℓ(Tb)=1B∑b=1B∑t=1J−1i^2tI{v(t)=ℓ}$ $VI_\ell^2 = \frac{1}{B}\sum_{b=1}^B VI_\ell^2(T_b) = \frac{1}{B}\sum_{b=1}^B \sum_{t=1}^{J-1} \hat i_t^2 I_{\{v(t)=\ell\}}$
Permutation importance: using OOB samples to construct a different importance measure. When $T_b$ is built, the OOB samples are passed down the tree and record the prediction accuracy, denoted as $C^b=\frac{1}{|OOB^{(b)}|}\sum_{i\in OOB^{(b)}} I_{\{\hat y_i\neq y_i\}}$ . Then the values for the $\ell$ th variable are randomly permuted in the OOB samples, and again the prediction accuracy is obtained, denoted as $C^b_\ell=\frac{1}{|OOB^{(b)}|}\sum_{i\in OOB^{(b)}} I_{\{\hat y_{i,\pi_\ell}\neq y_i\}}$ . The difference $\mathrm{VI}_\ell(T_b)=C^b-C^b_\ell$ is as the permutation importance of $\ell$ th variable in the $T_b$ tree. Finally, the importance measure of $\ell$ th variable is the averaging of $\mathrm{VI}_\ell(T_b)$ over B trees.
$V I ℓ = 1 B \sum i = 1 B V I ℓ (T b)$ $\mathrm{VI}_\ell=\frac{1}{B}\sum_{i=1}^B \mathrm{VI}_\ell(T_b)$
summary: Gini importance measures how many times variable $\ell$ is used to split node and how much useful is this split. Permutation importance measures how much useful of variable $\ell$ to predict test data.

(五). Proximity plot

In growing a random forest, a $N\times N$ matrix is accumulated for the training data. For every tree, any pair of OOB samples sharing a terminal node has there proximity by one. This is to say that: for any pair of training samples, the proximity represents the number of trees which they are classified to the same terminal node in trees built without them.

（六）. R code for random forest

Here gives a simple R code for random forest using randomForest package in R.

rm(list=ls(all = TRUE))##remove all objects
#install.packages("randomForest")#insall randomForest package
library(randomForest)#library package
data(iris)#data to analysis

n <- nrow(iris)#number of samples
p <- ncol(iris)#number of variable
test_inx <- sample(1:n, n/5)#use n/5 samples as test data

iris_train <- iris[-test_inx, ]#train data
iris_test <- iris[test_inx, ]#test data

iris_rf <- randomForest(iris_train[,-5], iris_train[,5], ntree = 1000, mtry = sqrt(p), replace = TRUE, importance =  TRUE, proximity = TRUE)#model on train data

print(iris_rf)#view result
iris_rf$importance#importance of variables
varImpPlot(iris_rf)#plot variable importance

iris_predict <- predict(iris_rf, iris_test[,-5], type = "response")#predict class labels for test data
table(observed = iris_test[, 5], predicted = iris_predict)#look at prediction result
error <- sum(iris_predict!=iris_test[,5])/length(test_inx)#prediction errro
error

艳艳儿

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[Machine Learning] Random Forest

Random ForestRandom Forest一Review of random forest二Models random forest can be used(一)Review of random forestFirstly, I give a review of two main ensemble algorithm: bagging and boosting:bagging bui
复制链接

扫一扫