深入RandomFroest

最新推荐文章于 2022-07-14 16:35:09 发布

mmc2015

最新推荐文章于 2022-07-14 16:35:09 发布

阅读量938

点赞数 1

分类专栏： ML in coding 文章标签：深入RandomFroest 随机森林随机森林的随机实现随机森林python实现

本文链接：https://blog.csdn.net/mmc2015/article/details/51872434

版权

ML in coding 专栏收录该内容

13 篇文章 1 订阅

订阅专栏

为什么建树采用GBDT而非RF？

RF也是多棵树，但从效果上有实践证明不如GBDT。且GBDT前面的树，特征分裂主要体现对多数样本有区分度的特征；后面的树，主要体现的是经过前N颗树，残差仍然较大的少数样本。优先选用在整体上有区分度的特征，再选用针对少数样本有区分度的特征，思路更加合理，这应该也是用GBDT的原因。

随机森林体现在随机上，台湾林老师讲了三种随机方式：

1）样本bootstrap

2）特征sample

3）特征交叉组合

看了sklearn的代码，实现了前两者，但没有第三种

看了karpathy的代码（https://github.com/karpathy/Random-Forest-Matlab），没有bootstrap样本，feature组合部分写的也真是太“”随机“”了。大致分为四种方法

1）decision-stump，随机选一列特征

2）随机选两列进行线性组合，包括bias项

3）随机选两列，进行二次组合，但也包括了一次项和bias项

4）基于当前样本与所有样本的RBF距离计算分割threshold

自己写了一下，也比较简单，试了前两种，第三种还没做，不过还是先改成并行的吧，之后写相应的博客。

这里先大概写一下已经完成的RandomForest的设计。。。

先介绍数据结构

一个随机森林由多个树组成，所以设计树的结构是关键，下面只介绍树的设计。

首先，一个TreeClassifer的基本结构大致应该如下：

class MY_TreeClassifier:

def __init__(self,

criterion="entropy", # "entropy", "gini"

max_depth=None, # None, int

min_leaf_split=None, # None, int

max_feature="all", # "all", "sqrt", int

bootstrap=False, # True, False

参数分别是：分裂的标准（基尼、信息熵），每棵树的最大高度，叶子节点数少于min_leaf_split时停止分裂，建树时考虑的特征个数，是否需要bootstrap样本。

另外还可以添加一些特殊的参数进行个性化设计。

其次，树是由一个个节点构成，因此设计每个node的结构也很关键，一个TreeNode的基本结构应该如下：

class MY_TreeNode:

def __init__(self, depth=1):

self.depth=depth

self.spliter=None

self.left_child=None

self.right_child=None

# for leaf node

self.y_distribution={}

参数分别是：当前节点对应的深度（用于判断是否停止生长）；spliter是一个分裂模型，详细参考MY_Spliter数据结构；左子树对应的TreeNode；右子树对应的TreeNode；另外的y_distribution是叶子节点才有，保存该叶子节点对每个类别的预测概率。

在树的构建过程中，分裂操作是最关键的，因此可以单独构建一个分裂类，保存分裂的标准（最好的分裂feature，最好的threshold，最好的criterion增益等信息）；具体的，一个Spliter的基本结构应该如下：

class MY_Spliter:

def __init__(self):

self.gain=0

self.feature_index=0

self.threshold=0

self.left_index=[]

self.right_index=[]

其中left_index和right_index分别保存该spliter在划分样本时，哪些样本分到了左子树，哪些分到了右子树；该参数非必须，只是为了便于调试。

总结：MY_TreeClassifier只用于存储和训练树有关的参数；MY_TreeNode保存最经典的左子树指针，右子树指针即可；另外由于构建树是用来预测的，所以每个MY_TreeNode还应该保存预测标准，具体由spliter这个结构体实现；MY_Spliter存储预测时必不可少的feature_id及对应的threshold即可。

再介绍关键函数，以及对应的输入输出。

首先，MY_TreeClassifier至少需要两个函数，分别是train和predict。

Train(X_train, y_train)函数

输入：至少要有三个参数，分别是X_train，y_train和MY_TreeClassifier。

输出：是一个构建好的树（实际上只要是一个指向root_node的指针就好）。

大致逻辑：

1）根据MY_TreeClassifier保存的参数对X_train进行bootstrap和feature sample等预处理。

2）调用tree=build_tree(X_train, y_train)递归的构建树即可。

Predict(X_test)函数

输入：至少要有三个参数，分别是X_test和已经训练好的树tree，另外还要有MY_TreeClassifier（需要对X_test进行和X_train完全一样的预处理操作，而这些预处理操作的参数由MY_TreeClassifier存储）。

输出：每个样本属于各个类别的概率。

大致逻辑：

1）根据MY_TreeClassifier保存的参数对X_test进行bootstrap和feature sample等预处理。

2）根据训练好的树tree的结构（主要是根据每个TreeNode节点中Spliter的信息），将X_test递归的划分到tree的叶子节点。

3）根据tree的叶子节点中保存的y_distribution，输出该样本对于每个类别的预测概率。

其次是Train函数中调用的build_tree(X_train, y_train)函数。

build_tree (X_train, y_train)函数

输入：X_train, y_train, MY_TreeClassifier（终止条件都在MY_TreeClassifier中存储）

输出：构建好的tree（实际上是指向root_node的指针）

大致逻辑：

1）首先根据输入X_train, y_train判断是否满足停止条件（高度、y_train是否purity等），如果满足停止条件，说明X_train, y_train就可以看作是一个leaf_node了，根据X_train, y_train计算对应的y_distribution，返回该leaf_node即可。否则：

2）调用best_spliter=find_best_spliter(X_train, y_train)函数找到针对X_train, y_train最好的划分feature_id, threshold等信息（保存在best_spliter中）。

3）根据best_spliter中的left_index和right_index将X_train, y_train划分成left_X_train, left_y_train和right_X_train, right_y_train。

4）递归的调用build_tree(left_X_train, left_y_train)和build_tree(right_X_train, right_y_train)就可以把树构件好。

最后是find_best_spliter和与之相关的entropy计算函数，这些函数逻辑很简单。

find_best_spliter(X_train, y_train)函数

输入：X_train, y_train, MY_TreeClassifier（criterion="entropy"/"gini"在MY_TreeClassifier中存储）

输出：针对X_train, y_train最好的划分feature_id, threshold等信息

大致逻辑：

1）调用calculate_entropy(y_train)计算原始数据的entropy。

2）遍历所有可能切割的feature_id和threshold，调用calculate_split_entropy(feature_id, threshold)计算切割之后的split_entropy。

3）将entropy - split_entropy取值最大的feature_id, threshold作为best_spliter返回即可。

calculate_entropy以及calculate_split_entropy不细说了，就是根据公式计算值即可。

代码也贴一些：

#!usr/bin/env python
# -*- coding:utf-8 -*-


from MY_TreeClassifier import MY_TreeClassifier

import numpy as np


from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier





class MY_RandomForestClassifier:
    
    def __init__(self,
                 criterion="entropy", # "entropy", "gini"
                 max_depth=None, # None, int
                 min_leaf_split=None, # None, int
                 max_feature="all", # "all", "sqrt", int
                 bootstrap=False, # True, False
                 seed=0, # int
                 n_jobs=None, # None, int
                 warm_start=False, # True, False
                 n_estimator=10 # int
                 ):
        self.criterion=criterion
        self.max_depth=max_depth
        self.min_leaf_split=min_leaf_split
        self.max_feature=max_feature
        self.bootstrap=bootstrap
        self.seed=seed
        self.n_jobs=n_jobs
        
        self.warm_start=warm_start
        self.n_estimator=n_estimator
    
    def fit(self, X, y):
        self.classes_=np.unique(y)
        
        if not self.warm_start: # free all the estimators if any
            self.estimator_=[]
        n_more_estimator=self.n_estimator-len(self.estimator_)
        if n_more_estimator<=0:
            ValueError('n_estimator=%d must be larger or equal to'
                        'len(self.estimator_)=%d when warm_start==True'
                        % ( self.n_estimator, len(self.estimator_) )  )
        else:
            trees=[]
            for i in range(n_more_estimator):
                tree=MY_TreeClassifier(
                    criterion=self.criterion,
                    max_depth=self.max_depth,
                    min_leaf_split=self.min_leaf_split,
                    max_feature=self.max_feature,
                    bootstrap=self.bootstrap,
                    seed=self.seed,
                    n_jobs=self.n_jobs
                    )
                tree=tree.fit(X, y)
                trees.append(tree)
            # collect newly grown trees
            self.estimator_.extend(trees)
        
        return self
    
    def predict_proba(self, X):
        probas=np.zeros([X.shape[0], len(self.classes_)])
        for e in self.estimator_:
            probas+=e.predict_proba(X)
        probas /= len(self.estimator_)
        return probas
    
    def predict(self, X):
        probas = self.predict_proba(X)
        return self.classes_.take(np.argmax(probas, axis=1), axis=0)



if __name__=="__main__":
    '''
    np.random.seed(34567)
    X=np.random.random((50,5))
    y=np.array([1]*25+[0]*25)
    '''
    np.random.seed(34567)
    X=np.random.random((300,5))
    y=np.array([1]*100+[0]*100+[2]*100)
    
    
    rf=MY_RandomForestClassifier(criterion="entropy", max_depth=3, n_estimator=3)
    rf=rf.fit(X,y)
    
    p=rf.predict(X)
    print metrics.accuracy_score(p,y)
    
    sllearn_rf=RandomForestClassifier(n_estimators=3, criterion="entropy",
                 max_depth=3, max_features=None,
                 bootstrap=True, random_state=None, warm_start=False)
    sllearn_rf.fit(X, y)
    p=sllearn_rf.predict(X)
    print metrics.accuracy_score(p,y)