boosting集成算法 adaboost原理及基于numpy的代码实现
1. 算法原理
adaboost属于集成学习中的boosting算法,即通过不断对单个或多个弱分类器进行优化和矫正,最终生成一个强分类器的方法。
adaboost的核心是在每一轮的训练中,通过对弱分类器错分类的样本,加大其样本权重,并减少正确分类样本的权重,使在下一轮新的弱分类器进行分类时,可更注重错分类样本的分类。最终将多个弱分类器预测的结果根据弱分类器的权重线性加权合成出最后的预测结果。
2. 计算流程
算法实现的过程中有3个要点:
- 错分类样本的权重更新
- 弱分类器的权重计算
- 多个弱分类器叠加的方法
2.1 错分类样本的权重更新
第m轮训练时:
上
一
轮
的
样
本
权
重
:
w
m
−
1
上一轮的样本权重:w_{m-1}
上一轮的样本权重:wm−1
上
一
轮
的
弱
分
类
器
权
重
:
α
m
−
1
上一轮的弱分类器权重: \alpha_{m-1}
上一轮的弱分类器权重:αm−1
上
一
轮
弱
分
类
器
所
预
测
的
样
本
标
签
(
−
1
或
1
)
:
y
^
m
−
1
上一轮弱分类器所预测的样本标签(-1或1):\hat{y}_{m-1}
上一轮弱分类器所预测的样本标签(−1或1):y^m−1
样
本
真
实
标
签
(
−
1
或
1
)
:
y
t
样本真实标签(-1或1):y_t
样本真实标签(−1或1):yt
该 轮 的 样 本 更 新 : 该轮的样本更新: 该轮的样本更新:
w
m
=
w
m
−
1
Z
m
e
−
α
m
−
1
y
^
m
−
1
y
t
w_m=\frac{w_{m-1}}{Z_m}e^{- \alpha_{m-1}\hat{y}_{m-1}y_t}
wm=Zmwm−1e−αm−1y^m−1yt
其
中
Z
m
为
规
范
化
项
:
其中Z_m为规范化项:
其中Zm为规范化项:
Z m = ∑ i n w m − 1 e − α m − 1 y ^ m − 1 y t Z_m=\sum_i^n{w_{m-1}e^{-\alpha_{m-1} \hat{y}_{m-1} y_t}} Zm=i∑nwm−1e−αm−1y^m−1yt
参考代码部分中 calc_normalization_factor 和 update_sample_weight 2个方法
2.2 弱分类器的权重计算
第m轮训练时:
上 一 轮 弱 分 类 器 的 分 类 错 误 率 : e m − 1 上一轮弱分类器的分类错误率:e_{m-1} 上一轮弱分类器的分类错误率:em−1
α m = 1 2 l o g 1 − e m − 1 e m − 1 \alpha_m= \frac{1}{2}log\frac{1-e_{m-1}}{e_{m-1}} αm=21logem−11−em−1
参考代码部分中 calc_estimator_weight 方法
2.3 多个弱分类器叠加的方法
采用线性叠加:
第
i
轮
预
测
的
样
本
标
签
:
y
i
^
第i轮预测的样本标签:\hat{y_i}
第i轮预测的样本标签:yi^
第
i
轮
弱
分
类
器
的
权
重
:
α
i
第i轮弱分类器的权重:\alpha_{i}
第i轮弱分类器的权重:αi
强
分
类
器
:
强分类器:
强分类器:
G
(
x
)
=
s
i
g
n
(
∑
i
=
1
n
α
i
y
^
i
)
G(x)=sign(\sum_{i=1}^n\alpha_i \hat{y}_i)
G(x)=sign(i=1∑nαiy^i)
参考代码部分中 adaboost 类里的 predict 方法
3. numpy代码实现
代码是挺早之前写的了,弱分类器的决策树采用的是单层决策树,最佳分裂点也是直接用错分率来选择的,过程比较简化了,有很多地方还可以继续优化的,不过还是实现了adaboost的主体思想。
3.1 代码
# -*- coding: utf-8 -*-
"""
Created on Mon Oct 19 11:25:21 2019
author: Irvinfaith
email: Irvinfaith@hotmail.com
"""
import numpy as np
class adaboost_base_estimator():
def __init__(self, sample_weight):
"""
Initialize the base estimator.
Parameters:
-----------
sample_weight: list
A list of the weight for each observations.
"""
self.sample_weight = sample_weight
self.tree = {}
def choose_variable(self, data):
"""
Return an index of a variable for each estimator randomly.
Parameters:
-----------
data: array,
The array of the dataset.
Returns:
-------
int:
An index of variable.
"""
return np.random.choice(data.shape[1])
def set_variable_sample(self, combine):
"""
Generate a list with the median points of continuous variables.
Parameters:
-----------
combine: array,
The array of the dataset.
Returns:
-------
node_list: list,
A list with median points.
"""
sorted_data = np.sort(combine[:, 0], axis=0)
sorted_set = sorted(list(set(sorted_data)))
node_list = [(sorted_set[i] + sorted_set[i + 1]) / 2 for i in range(len(sorted_set)) if
i <= len(sorted_set) - 2]
return node_list
def get_sample_error_index(self, combine, node):
"""
Generate a list of index for those observations was misclassified.
Parameters:
-----------
combine: array,
The array of the dataset.
node: int,
The split points.
Returns:
-------
list:
A list of index for observations was misclassified.
If left child label is equal to right child label,then return None.
"""
left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
right_count = np.bincount(combine[np.where(combine[:, 0] > node), 1][0].astype(int))
left_label = left_count.argmax()
right_label = right_count.argmax()
if left_label == right_label:
return None
else:
left_error_index_array = np.where((combine[:, 0] <= node) & (combine[:, 1] != left_label))[0]
right_error_index_array = np.where((combine[:, 0] > node) & (combine[:, 1] != right_label))[0]
return np.append(left_error_index_array, right_error_index_array)
def calc_error_ratio(self, sample_error_index):
"""
Calculate the error ratio.
Parameters:
-----------
sample_error_index: list,
A list of index for those observations was misclassified.
Returns:
-------
float:
Error ratio.
"""
error_ratio = 0
for index_ in sample_error_index:
error_ratio += self.sample_weight[index_]
return error_ratio
def find_best_node(self, combine, node_list):
"""
Find the best split points for the variable.
Also update the tree dict.
Parameters:
-----------
combine: array,
The array of the dataset.
node_list: list,
A list with the median points of continuous varibales.
Returns:
-------
best_node: float,
The best split point.
min_error_ratio: float,
The error ratio of this split point.
best_sample_error_index: list
A list of index for observations was misclassified.
predict_label: int,
Prediction label
"""
min_error_ratio = np.Inf
best_node = None
for node in node_list:
sample_error_index = self.get_sample_error_index(combine, node)
if sample_error_index is not None:
error_ratio = self.calc_error_ratio(sample_error_index)
if error_ratio < min_error_ratio:
min_error_ratio = error_ratio
best_node = node
best_sample_error_index = sample_error_index
try:
left_count = np.bincount(combine[np.where(combine[:, 0] <= node), 1][0].astype(int))
left_label = left_count.argmax()
self.tree['node'] = best_node
self.tree['error_ratio'] = min_error_ratio
if left_label == 0:
predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
[-1, 1])
self.tree['left'] = -1
self.tree['right'] = 1
else:
predict_label = np.piecewise(combine[:, 0], [combine[:, 0] <= best_node, combine[:, 0] > best_node],
[1, -1])
self.tree['left'] = 1
self.tree['right'] = -1
return best_node, min_error_ratio, best_sample_error_index, predict_label
except TypeError:
return None
def fit(self, data, label):
"""
Training model.
Parameters:
-----------
data: array,
The array of datasets.
label: array,
The array of labels.
Returns:
-------
variable_index: int,
The index of variable.
best_node: float,
The best split point.
min_error_ratio: float,
The error ratio of this split point.
best_sample_error_index: list
A list of index for observations was misclassified.
predict_label: int,
Prediction label
"""
variable_index = self.choose_variable(data)
self.tree['variable_index'] = variable_index
combine = np.column_stack((data[:, variable_index], label))
node_list = self.set_variable_sample(combine)
while self.find_best_node(combine, node_list) is None:
variable_index = self.choose_variable(data)
self.tree['variable_index'] = variable_index
combine = np.column_stack((data[:, variable_index], label))
node_list = self.set_variable_sample(combine)
best_node, min_error_ratio, best_sample_error_index, predict_label = self.find_best_node(combine, node_list)
return variable_index, best_node, min_error_ratio, best_sample_error_index, predict_label
@staticmethod
def predict(data, tree):
index = tree['variable_index']
y_predict = np.piecewise(data[:, index], [data[:, index] <= tree['node'], data[:, index] > tree['node']],
[tree['left'], tree['right']])
return y_predict
class adaboost():
def __init__(self, n_estimators=50):
"""
Initialize the adaboost class.
Parameters:
-----------
n_estimators: int (default=50)
The total amount of estimators.
"""
self.n_estimators = n_estimators
self.trees = []
def get_initial_sample_weight(self, n):
"""
Initialize the weight list for observations.
Parameters:
-----------
n: int
Amount of observations.
Returns:
-------
array,
An array of sample weight.
"""
return np.array([1 / n] * n)
def calc_estimator_weight(self, sample_error):
"""
Calculate the weight of estimator.
Parameters:
-----------
sample_error: float
The error ratio of the classification.
Returns:
-------
float,
Estimator weight.
"""
return 1 / 2 * np.log((1 - sample_error) / sample_error)
def get_correct_sample(self, sample_error_index, n):
"""
Return a list combined with 1 or -1,
1 represents this sample was correctly classified, otherwise -1.
Parameters:
-----------
sample_error_index: list,
A list of index for observations was misclassified.
n: int
Amount of observations.
Returns:
-------
array,
An array combined with 1 or -1,
1 if this sample was correctly classified else return -1.
"""
return np.array([i if index not in sample_error_index else -i for index, i in enumerate(np.ones(n))])
def calc_normalization_factor(self, sample_weight, correct_sample, estimator_weight):
"""
Calculate normalization factor. This is the denominator
when update the observation weight,
to make the sum of weights is equal to 1.
Parameters:
-----------
sample_weight: list,
A list of the weight for each observations.
correct_sample: list,
A list combined with 1 or -1,
1 if this sample was correctly classified else return -1.
estimator_weight: float,
Estimator weight.
Returns:
-------
float,
Normalization factor.
"""
normalization_factor = np.sum(sample_weight * np.exp(-estimator_weight * correct_sample))
return normalization_factor
def update_sample_weight(self, sample_weight, correct_sample, estimator_weight, normalization_factor):
"""
Update the weight for each obeservations, the weight of
misclassifed observations will be increased, otherwise decreased.
Parameters:
-----------
sample_weight: list,
A list of the weight for each observations.
correct_sample: list,
A list combined with 1 or -1,
1 if this sample was correctly classified else return -1.
estimator_weight: float,
Estimator weight.
normalization_factor: array,
Normalization factor.
Returns:
-------
array,
Updated sample weight.
"""
return sample_weight / normalization_factor * np.exp(-estimator_weight * correct_sample)
def base_estimator(self, data, label, _iter):
"""
Using adaboost_base_estimator class to fit model.
Parameters:
-----------
data: array,
The array of datasets.
label: array,
The array of labels.
_iter: int,
the nums of estimator.
Returns:
-------
See adaboost_base_estimator.fit
"""
return self.abe.fit(data, label)
def boost(self, data, label):
"""
Main boost funciton.
Parameters:
-----------
data: array,
The array of datasets.
label: array,
The array of labels.
Returns:
-------
variable_index_list: list,
The list of variable index.
best_node_list: list,
The list of best split point.
estimator_weight_list: list,
The list of estimators' weight.
predict_label_list: list,
The list of prediction label
"""
sample_error = np.Inf
self.sample_weight = self.get_initial_sample_weight(data.shape[0])
variable_index_list = []
best_node_list = []
estimator_weight_list = []
predict_label_list = []
_iter = 0
while sample_error != 0 and _iter < self.n_estimators:
self.abe = adaboost_base_estimator(self.sample_weight)
self.tree = self.abe.tree
variable_index, best_node, sample_error, sample_error_index, predict_label = self.base_estimator(data,
label,
_iter)
estimator_weight = self.calc_estimator_weight(sample_error)
# append estimator information to the tree list
self.trees.append({'estimator_num': _iter, 'weight': estimator_weight, 'tree': self.abe.tree})
variable_index_list.append(variable_index)
best_node_list.append(best_node)
estimator_weight_list.append(estimator_weight)
predict_label_list.append(predict_label)
correct_sample = self.get_correct_sample(sample_error_index, self.sample_weight.shape[0])
normalization_factor = self.calc_normalization_factor(self.sample_weight, correct_sample, estimator_weight)
updated_sample_weight = self.update_sample_weight(self.sample_weight, correct_sample, estimator_weight,
normalization_factor)
# update sample weight
self.sample_weight = updated_sample_weight
_iter += 1
return variable_index_list, best_node_list, estimator_weight_list, predict_label_list
def fit(self, data, label):
"""
Fit function.
Parameters:
-----------
data: array,
The array of datasets.
label: array,
The array of labels.
"""
variable_index_list, best_node_list, estimator_weight_list, predict_label_list = self.boost(data, label)
self.boost_tree = [variable_index_list, best_node_list, estimator_weight_list, predict_label_list]
self.variable_index_list = self.boost_tree[0]
self.best_node_list = self.boost_tree[1]
self.estimator_weight_list = self.boost_tree[2]
def predict(self, data):
"""
Predict function
Parameters:
-----------
data: array,
The array of datasets.
Returns:
--------
strong_predict: array,
The array of predictions.
"""
strong_predict_sum = np.zeros(data.shape[0])
predict_label_list = []
for tree_dict in self.trees:
predict_label = adaboost_base_estimator.predict(data, tree_dict['tree'])
predict_label_list.append(predict_label)
for estimator_weight, predict_label in zip(self.estimator_weight_list, predict_label_list):
week_estimator_predict_ = np.multiply(estimator_weight, predict_label)
strong_predict_sum += week_estimator_predict_
# using signal function to get final prediction
strong_predict = np.sign(strong_predict_sum)
strong_predict[strong_predict == -1] = 0
return strong_predict
3.2 测试
3.2.1 获取测试数据,以及拆分训练集测试集
import sklearn.datasets as ds
import pandas as pd
d = ds.load_breast_cancer()
data = d['data']
label = d['target']
def get_train_test_data(data,label,percentile=0.8):
data_df = pd.DataFrame(data)
label_df = pd.DataFrame(label,columns=['label'])
combine_df = pd.concat([data_df,label_df],axis=1)
label_count = label_df.groupby(label).count()
train_df = pd.DataFrame()
for label_name in label_count.index.tolist():
tmp = combine_df[combine_df['label'] == label_name]
index_list = tmp.index.tolist()
random_select_index = np.random.choice(index_list,round(len(index_list)*percentile), replace=False)
tmp_df = tmp.loc[random_select_index]
train_df = pd.concat([train_df,tmp_df],axis=0)
test_df = combine_df.drop(train_df.index)
train_data,train_label,test_data,test_label = train_df[train_df.columns[:-1]],train_df['label'],test_df[test_df.columns[:-1]],test_df['label']
return np.array(train_data),np.array(train_label),np.array(test_data),np.array(test_label)
def compare_result(predict,test):
count = 0
for i,j in zip(predict.tolist(),test.tolist()):
if i == j:
count += 1
return count/len(predict)
train_data,train_label,test_data,test_label = get_train_test_data(data,label)
3.2.2 训练及预测
由于没有优化计算效率,所以运算速度比较慢,100个estimator差不多需要6到8秒的时间。
可以看到accuracy为0.9292,效果还算ok。
ada = adaboost(100)
ada.fit(train_data,train_label)
y_predict = ada.predict(test_data)
compare_result(y_predict,test_label)
3.2.3 查看每个弱分类器的权重、错分率以及树的分支详情
通过调用trees属性,查看弱分类器的详情。
trees = ada.trees