为什么机器学习（八）——来一发随机森林_为什么会产生随机森林-CSDN博客

本文链接：https://blog.csdn.net/qq_37477357/article/details/106447824

随机森林是一种典型的集成学习算法。顾名思义，森林是很多棵树构成的，随机森林是多棵决策树构成的。类似治病会诊，会诊时会有很多医生每人给一个意见，最终投票选出最多人认可的意见。随机森林则是由每个决策树给出一个意见，最终投票最多的意见作为预测值。

决策树的原理本质是一种贪心算法，每次遍历特征值可分割点，通过基尼系数等方式计算信息增益，找到最大信息增益并以此特征分割点分割。知道最大信息增益低于阈值时停止。

随机森林使用了Bootstrap随机抽样，本质是一种抽取n次（n为样本数量）有放回的抽样。其中每次没抽中的概率是 $1-\frac{1}{n}$ ,一个样本每次都没有被抽到的概率时 $(1-\frac{1}{n})^n$ ,n趋近无限时可以得到极限为 $\frac{1}{e}$ 。因此，每轮抽样大约有36.8%的样本不会被抽中，这些数据叫做包外数据。可以将包外数据作为交叉验证数据集，当这部分数据预测准确值趋近平衡时停止训练。

随机森林的训练流程如下：
for 1,2,…,T(T为随机森林中集成)
Bootstrap随机抽样，得到训练集
用该训练集训练一棵决策树
对测试集每个样本，让每棵决策树给出答案，并投票选择最多的答案作为答案

可以认为随机森林这种联合预测的方法可以降低方差：
$D(\frac{1}{n}\sum_i^nx_i)=\frac{\sigma^2}{n}$

以mnist数据集分类为例，实现手动实现随机森林（决策树部分利用sklearn），并比较两者的效果：

import numpy as np
from sklearn import tree
from sklearn import datasets

def load_mnist():
    #define the directory where mnist.npz is(Please watch the '\'!)
    path = r'mnist.npz'
    f = np.load(path)
    x_train, y_train = f['x_train'],f['y_train']
    x_test, y_test = f['x_test'],f['y_test']
    f.close()
    return (x_train, y_train), (x_test, y_test)

(x_train,y_train),(x_test,y_test) = load_mnist()
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255.
x_test /= 255.

x_train = x_train.reshape(-1,784)
x_test = x_test.reshape(-1,784)

#决策树实现
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(x_train,y_train)
score = clf.score(x_test,y_test)
print(score)

#随机森林实现
ntree = 6
clfTrees = []
index = [i for i in range(x_train.shape[0])]
#训练
for i in range(ntree):
    randomIndex = np.random.choice(index,size=x_train.shape[0],replace=True)
    random_x = x_train[randomIndex]
    random_y = y_train[randomIndex]
    clfTree = tree.DecisionTreeClassifier()
    clfTree = clfTree.fit(random_x,random_y)
    clfTrees.append(clfTree)

#测试
corrNum = 0
for i in range(x_test.shape[0]):
    answer = []
    #每个决策树做一个决策
    for j in clfTrees:
        answer.append(j.predict(x_test[i].reshape(1,-1)))
    #投票最多的答案作为预测
    pred = max(answer,key=answer.count)
    if pred == y_test[i]:
        corrNum += 1
print("Acc:",corrNum/x_test.shape[0])