随机森林是一种很常用的机器学习算法,“随机”表示每棵树的训练样本随机以及训练时的特征随机。
训练形成的多棵决策树形成了“森林”,计算时我们把每棵树的投票或取均值的方式得到最终结果,体现了集成学习的思想。
不多说,下面根据代码一点一点分析,我们传入的数据集有60个特征,一个标签(M或者R)
0.02,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.066,0.2273,0.31,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.555,0.6711,0.6415,0.7104,0.808,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.051,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.018,0.0084,0.009,0.0032,R
0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.325,0.32,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.184,0.197,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.053,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.014,0.0049,0.0052,0.0044,R
0.0262,0.0582,0.1099,0.1083,0.0974,0.228,0.2431,0.3771,0.5598,0.6194,0.6333,0.706,0.5544,0.532,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.4293,0.3648,0.5331,0.2413,0.507,0.8533,0.6036,0.8514,0.8512,0.5045,0.1862,0.2709,0.4232,0.3043,0.6116,0.6756,0.5375,0.4719,0.4647,0.2587,0.2129,0.2222,0.2111,0.0176,0.1348,0.0744,0.013,0.0106,0.0033,0.0232,0.0166,0.0095,0.018,0.0244,0.0316,0.0164,0.0095,0.0078,R
前面说了,随机森林的随机体现在每棵树的训练样本随机和特征随机,我们要做的第一步就是随机划分数据集。
详解下面代码里面有:
#随机划分数据集
def cross_validation_split(dataset,n_folds):
"""
这里只是划分了份数,但是所有特征都在,没有把处理特征
(将数据集进行抽重抽样 n_folds 份,数据可以重复重复抽取,每一次list的元素是无重复的)
Args:
dataset 原始数据集
n_folds 数据集dataset分成n_flods份
Returns:
dataset_split list集合,存放的是: 将数据集进行抽重抽样 n_folds 份,数据可以重复重复抽取,每一次list的元素是无重复的
"""
dataset_split = []
dataset_copy = list(dataset)# 复制一份 dataset,防止 dataset 的内容改变
fold_size = len(dataset)/n_folds
for i in range(n_folds):
fold = [] #每次循环fold清零,防止重复导入
while len(fold) < fold_size:
# 有放回的随机采样,有一些样本被重复采样,从而在训练集中多次出现,有的则从未在训练集中出现,此则自助采样法。从而保证每棵决策树训练集的差异性
#randrange跟randint没啥区别,只不过这randrange可以设定步长
index = randrange(len(dataset_copy))
# pop() 函数用于移除列表中的一个元素(默认最后一个元素),并且返回该元素的值。
# fold.append(dataset_copy.pop(index)) # 无放回的方式
fold.append(dataset_copy[index]) #有放回的方式
dataset_split.append(fold)
# 由dataset分割出的n_folds个数据构成的列表,为了用于交叉验证
return dataset_split
第二步就开始建树了:
随机森林的基本是要有一棵棵决策树吧:
我们要想得到这样一棵决策树,就要设计决策树的特征选择,在每个分支点选择当前特征中“重要”的特征,决策树的特征选择一般有三种方法:信息增益,信息增益率,基尼指数。详见认真的聊一聊决策树和随机森林
我们传入的数据集假设是这样的,橙色的是标签,我们不可能对每个特征都进行选取和计算,上面说了,我们这个数据集有60个特征,如果全都计算的话计算量会很大,也不符合随机森林以小博大的特点。
而经过我们随机对数据集的特征进行随机分割后,我们的数据集就变成一小块一小块的,假如我们设定的n_features = 4,那么数据集就会变成只有ACDF、CEFJ、DFHI等一小块一小块的数据。
下面就开始根据特征建树:
def get_split(dataset,n_features):
#这里得到标签的集合,然后把它换成list,但是其实也就M和R两个标签['M', 'R']
class_values = list(set(row[-1] for row in dataset))
b_index, b_value, b_score, b_groups = 999, 999, 999, None
features = []
while len(features) < n_features:
# 往 features 添加 n_features 个特征( n_feature 等于特征数的根号),特征索引从 dataset 中随机取
#我们的数据集有60个特征,这里随机分配这60个特征
index = randrange(len(dataset[0]) - 1)
if index not in features:
#这里添加的是我们随机找的特征的索引号
features.append(index)
for index in features: #针对每一个特征进行遍历
for row in dataset:#针对数据集的每一行进行操作
groups = test_split(index,row[index],dataset)
gini = gini_index(groups,class_values)
if gini < b_score:
#我们找到基尼指数最小的特征,并且得到那个特征所在的那一行的值,根据这个进行下一步划分
# 最后得到最优的分类特征:b_index特征的索引,b_value,分类结果的值,b_groups,子集。
b_index, b_value, b_score, b_groups = index, row[index], gini, groups
return {'index': b_index,'value':b_value,'groups':b_groups}
# 根据特征和特征值分割数据集
def test_split(index,value,dataset):
left ,right = [],[]
for row in dataset:
if row[index] < value:
left.append(row)
else:
right.append(row)
return left,right
def gini_index(groups,class_values):
gini = 0.0
D = len(groups[0])+len(groups[1])
for class_value in class_values:
for group in groups: # groups = (left, right)
size = len(group)
if size == 0 :
continue
proportion = [row[-1] for row in group].count(class_value)/float(size)
gini += float(size)/D * (proportion*(1.0- proportion))
return gini
我们所构建的决策树的节点有5个属性值:
index:指的是每个节点的编号,这个用处不大
value:指的是每个叶子结点的输出值,也可以称为该节点的预测值或分类结果。我们在进行预测的时,如果一个新的数据进入决策树,根据其特征值在树上进行遍历,所依据的特征值就是这里的value。
groups:存储的是对于当前节点来说,它的左子树和右子树所剩余待分割的数据集。groups是包含left和right的
left:存的就是左子树剩余待分割的数据集
right:存的就是右子树剩余待分割的数据集
对于每一个节点来说,有一个index节点编号,通过gini_index()计算出value特征值,然后根据这个特征值通过test_split()函数将剩余数据分类左右两个子集,并存储在groups中。在test_split()中就通过groups得到left和right两个值。
但是我们将上面的代码运行一遍后,得到的仅是一个节点的值和左右子树,也就是在这里我们只得到了根节点,要想建立完整的树,我们还要写一个子分割器,通过递归进行分类,直到分类结束,代码如下:
def split(node,max_depth,min_size,n_features,depth):
left,right = node['groups']
del(node['groups'])
if not left or not right:
node['left'] = node['right'] = to_terminal(left+right)
return
if depth >= max_depth:
# max_depth=10 表示递归十次,若分类还未结束,则选取数据中分类标签较多的作为结果,使分类提前结束,防止过拟合
node['left'],node['right'] = to_terminal(left),to_terminal(right)
return
if len(left) <= min_size:
node['left'] = to_terminal(left)
else:
node['left'] = get_split(left,n_features)
# 递归,depth+1计算递归层数
split(node['left'], max_depth, min_size, n_features, depth+1)
if len(right) <= min_size:
node['right'] = to_terminal(right)
else:
node['right'] = get_split(right, n_features)
split(node['right'], max_depth, min_size, n_features, depth+1)
#我们发现某个结点没有左子树和右子树的话,说明它到了末端,那么它输出的就是在group中出现次数较多的标签
def to_terminal(group):
outcomes = [row[-1] for row in group]
# max() 函数中,当 key 参数不为空时,就以 key 的函数对象为判断的标准
return max(set(outcomes),key= outcomes.count)
上面这段代码涉及到递归建树。
至此,建树的过程结束,我们加上最后一步,先通过get_split()得到根节点,然后再进行递归建立完整的树:
def build_tree(train,max_depth,min_size,n_features):
"""
build_tree(创建一个决策树)
Args:
train 训练数据集
max_depth 决策树深度不能太深,不然容易导致过拟合
min_size 叶子节点的大小
n_features 选取的特征的个数
Returns:
root 返回决策树
"""
root = get_split(train,n_features)
split(root,max_depth,min_size,n_features,1)
return root
我们回顾一下整个过程:
首先,我们想要的是通过传入一个数据集,由于数据集的特征太多,我们先将数据集的特征进行随机选取,然后经过训练得到一棵决策树。
在训练决策树的过程中,我们要计算当前节点所剩余数据的所有特征的基尼指数,找到最优特征的最优特征值,根据这个值划分当前节点的左子树和右子树,反复进行上述操作,最终就得到了一棵决策树,这棵决策树上的每个节点都有对应的值,我们在预测的时候,输入数据遍历数据集,根据这些值进行分类。
简单花了个图,大概就是这个意思:
得到树之后我们就进行预测,下面是预测的代码:
#根据模型预测分类结果
def predict(node,row):
if row[node['index']] < node['value']:
if isinstance(node['left'], dict):
return predict(node['left'], row)
else:
return node['left']
else:
if isinstance(node['right'], dict):
return predict(node['right'], row)
else:
return node['right']
上面说过了,很好理解,遍历树,然后最终就得到分类结果了。
后续就是将单颗树变成森林了,应该很好理解,代码附下:
def subsample(dataset,ratio):
"""
创建数据集的随机子样本
Args:
dataset 训练数据集
ratio 训练数据集的样本比例
Returns:
sample 随机抽样的训练样本
"""
sample = []
n_sample = round(len(dataset)*ratio)
while len(sample) < n_sample:
# 有放回的随机采样,有一些样本被重复采样,从而在训练集中多次出现,有的则从未在训练集中出现,此则自助采样法。从而保证每棵决策树训练集的差异性
index = randrange(len(dataset))
sample.append(dataset[index])
return sample
下面是随机森林的代码,我们输出的是预测值,像这样:
['R', 'M', 'R', 'M', 'M', 'M', 'R', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'R', 'R', 'M', 'R', 'M', 'M', 'R', 'M', 'M', 'M', 'R', 'R', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R', 'M', 'M', 'R', 'R', 'M', 'R', 'M']
def random_forest(train,test,max_depth,min_size,sample_size,n_trees,n_features):
"""
random_forest
(评估算法性能,返回模型得分)
Args:
train 训练数据集
test 测试数据集
max_depth 决策树深度不能太深,不然容易导致过拟合
min_size 叶子节点的大小
sample_size 训练数据集的样本比例
n_trees 决策树的个数
n_features 选取的特征的个数
Returns:
predictions 每一行的预测结果,bagging 预测最后的分类结果
"""
trees = []
for i in range(n_trees):
# 随机抽样的训练样本, 随机采样保证了每棵决策树训练集的差异性
sample = subsample(train, sample_size)
# 然后创建一棵决策树
tree = build_tree(sample,max_depth,min_size,n_features)
trees.append(tree)
# bagging_predict是返回决策树中出现次数最多的标签,就是返回分类标签
predictions = [bagging_predict(trees,row) for row in test]
return predictions
根据我们的预测值计算准确率:
def accuracy_metric(actual,predicted):
correct = 0
for i in range(len(actual)):
if actual[i] == predicted[i]:
correct += 1
return correct / float(len(actual)) *100
最终的模型评估函数
def evaluate_algorithm(dataset, algorithm, n_folds, *args):
"""
evaluate_algorithm(评估算法性能,返回模型得分)
Args:
dataset 原始数据集
algorithm 使用的算法
n_folds 数据的份数
*args 其他的参数
Returns:
scores 模型得分
"""
# 将数据集进行抽重抽样 n_folds 份,数据可以重复重复抽取,每一次 list 的元素是无重复的
folds = cross_validation_split(dataset, n_folds)
scores = []
for fold in folds :
train_set = list(folds)
train_set.remove(fold)
train_set = sum(train_set, [])
test_set = []
# fold 表示从原始数据集 dataset 提取出来的测试集
for row in fold:
row_copy = list(row)
row_copy[-1] = None
test_set.append(row_copy)
print('--------------------')
#这里的algorithm是我们传入的,这里我们传入的是random_forest函数
predicted = algorithm(train_set, test_set, *args)
actual = [row[-1] for row in fold]
accuracy = accuracy_metric(actual,predicted)
scores.append(accuracy)
return scores
测试代码如下:
别忘了在代码第一行加上这个:
from random import seed, randrange,random
if __name__ == '__main__':
filename = '你自己的文件地址'
dataset = loadDataSet(filename)
n_folds = 5 # 分成5份数据,进行交叉验证
max_depth = 20 # 调参(自己修改) #决策树深度不能太深,不然容易导致过拟合
min_size = 1 # 决策树的叶子节点最少的元素数量
sample_size = 1.0 # 做决策树时候的样本的比例
n_features = 15 # 准确性与多样性之间的权衡
#n_trees的意思是,我们上面n_folds分成几份数据,每份数据建几棵树
for n_trees in [1,5,10]:
scores = evaluate_algorithm(dataset,random_forest,n_folds,max_depth,min_size,sample_size,n_trees,n_features)
#设置初识的随机数种子,这样子每一次执行该文件时都能产生相同的一个随机数
seed(1)
print('random=',random())
print('当前树的数量 : %d'% n_trees)
print('准确率 : %s'%scores)
print('平均准确率 : %.3f%%' % (sum(scores)/float(len(scores))) )
print('------------------------------------------')
运行结果如下
可以自己试着调参然后对比不同结果。
把上面所有代码复制下来放一起,修改一下数据集位置就可以运行了。
我的代码是自己注释过的,完整代码地址:https://github.com/apachecn/AiLearning/blob/master/src/py2.x/ml/7.RandomForest/randomForest.py
参考资料:
[1]认真的聊一聊决策树和随机森林 - 知乎 (zhihu.com)
[2]李航《统计学习方法》第二版,第五章 决策树.(p67-p88)
[3]《机器学习实战》Peter Harrington著