随机森林算法及其实现(2)

随机森林算法及其实现

决策树部分参考了我导师的文章,日撸 Java 三百行(61-70天,决策树与集成学习)

1. 实现连续值的处理(参考西瓜书)

由于连续值的可取值数目不像离散值那样有明确的且数量较小的数值,所以不能够根据连续属性的可取值来对节点划分。因此采用二分法的策略对连续属性进行处理。

以下是我对西瓜书上的内容的理解以及在代码上的设计:

1.1 理论

给定样本集 D D D 和连续属性 a a a, 假定 a a a D D D 上出现了 n n n 个不同的取值,将这些值按照升序排列,记为 { a 1 , a 2 , … , a n } . \{a^1, a^2, \dots, a^n\}. {a1,a2,,an}. 基于划分点 t t t D D D 分为子集 D t − D^-_t Dt D t + D^+_t Dt+,其中 D t − D^-_t Dt 表示在属性 a a a 上取值不大于 t t t 的样本,而 D t + D^+_t Dt+ 表示大于 t t t 的样本。 对于相邻属性的取值 a i a^i ai a i + 1 a^{i+1} ai+1 来说, t t t 在区间 [ a i , a i + 1 ) [a^i,a^{i+1}) [ai,ai+1) 中取任意值所产生的划分结果相同. 因此,对连续属性 a a a ,我们可以划分 n − 1 n-1 n1 个元素作为划分点集合
T n = { a i + a i + 1 2 ∣ 1 ≤ i ≤ n − 1 } , T_n = \left\{\frac{a^i+a^{i+1}}{2}|1\leq i \leq n-1\right\}, Tn={2ai+ai+11in1},
就是把区间 a i , a i + 1 a^i,a^{i+1} ai,ai+1 的中位点 a i + a i + 1 2 \frac{a^i+a^{i+1}}{2} 2ai+ai+1 作为候选划分点,然后通过对离散值划分方式一样来选择最优的划分点对样本集划分,可将用于离散值的划分选择公式加以改造:
G a i n ( D , a ) = max ⁡ t ∈ T a G a i n ( D , a , t ) = max ⁡ t ∈ T a E n t ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ E n t ( D t λ ) , \begin{aligned} \mathrm{Gain}(D,a)&=\max_{t\in T_a}\mathrm{Gain}(D,a,t)\\ &=\max_{t\in T_a}\mathrm{Ent}(D) -\sum_{\lambda\in\{-,+\}}\frac{|D^{\lambda}_t|}{|D|}\mathrm{Ent}(D^{\lambda}_t), \end{aligned} Gain(D,a)=tTamaxGain(D,a,t)=tTamaxEnt(D)λ{,+}DDtλEnt(Dtλ)

其中 G a i n ( D , a , t ) \mathrm{Gain}(D,a,t) Gain(D,a,t) 是样本集 D D D 基于划分点 t t t 二分后的信息增益, 我们根据每个划分点的计算结果选取最大的即可拿来作为划分点。

1.2 代码实现

/*
* Use Binary Partition to deal with continuous values.
* param@paraAttribute: The attributes that need to get the number of values.
*/
double Tree::binaryPartition(int paraAttribute)
{
	// Sort.
	int length = availableInstances->getLength();
	double* tempValues = new double[length];
	IntArray* tempIndices = new IntArray(length);
	tempIndices->cloneToMe(availableInstances);
	// Bubble sort.
	bool flag;
	double temp;
	int tempIndex;
	for (int i = 0; i < length; i++)
	{
		tempValues[i] = trainingSet->getValue(availableInstances->getValue(i), paraAttribute);
	}// Of for

	for (int i = 0; i < length; i++)
	{
		flag = true;
		for (int j = length - 1; j > i; j--)
		{
			if (tempValues[j] < tempValues[j - 1])
			{
				temp = tempValues[j];
				tempIndex = tempIndices->getValue(j);
				tempValues[j] = tempValues[j - 1];
				tempIndices->setValue(j, tempIndices->getValue(j - 1));
				tempValues[j - 1] = temp;
				tempIndices->setValue(j - 1, tempIndex);
				flag = false;
			}// Of if
		}// Of for
		if (flag)
		{
			break;
		}// Of if
	}// Of for

	// Gain
	double splitValue = 0;
	double tempSplitValue;
	double result = 1e100;
	// Statistics, calculate the category distribution under this feature.
	int tempNumValues = 2;
	int tempNumInstances = length;
	// The number of instances storing each value under the attribute.
	IntArray* tempValueCounts = new IntArray(tempNumValues);
	DoubleMatrix* tempCountMatrix = new DoubleMatrix(tempNumValues, numClasses);


	for (int i = 0; i < length - 1; i++)
	{
		tempSplitValue = (tempValues[i] + tempValues[i + 1]) / 2;
		// Store the number of label distributions for each value under the attribute.
		tempValueCounts->setValue(0, i + 1);
		tempValueCounts->setValue(1, length - i - 1);
		tempCountMatrix->setZero();

		int tempClass;
		for (int j = 0; j < tempValueCounts->getValue(0); j++) {
			tempClass = trainingLables->getValue(tempIndices->getValue(j));
			tempCountMatrix->setValue(0, tempClass, tempCountMatrix->getValue(0, tempClass) + 1);
		}// Of for
		for (int j = tempValueCounts->getValue(0) + 1; j < tempNumInstances; j++) {
			tempClass = trainingLables->getValue(tempIndices->getValue(j));
			tempCountMatrix->setValue(1, tempClass, tempCountMatrix->getValue(1, tempClass) + 1);
		}// Of for

		// Calculate.
		double resultEntropyT = 0;
		double tempEntropyT, tempFractionT;

		for (int j = 0; j < tempNumValues; j++) {
			tempEntropyT = 0;
			for (int k = 0; k < numClasses; k++) {
				//Probability of one of the cases under this feature.
				tempFractionT = tempCountMatrix->getValue(j, k) / tempValueCounts->getValue(j);
				if (tempFractionT == 0) {
					continue;
				}// Of if
				tempEntropyT += -tempFractionT * log2(tempFractionT);
			}// Of for
			resultEntropyT += (double)tempValueCounts->getValue(j) / tempNumInstances * tempEntropyT;
		}// Of for

		if (result > resultEntropyT) {
			splitValue = tempSplitValue;
			result = resultEntropyT;
		}// Of if
	}// Of for
	delete[] tempValues;
	delete tempIndices;
	delete tempValueCounts;
	delete tempCountMatrix;
	continuesSplitValues->setValue(0, paraAttribute, splitValue);
	return result;
}// Of binaryPartition

需要注意的一点是由于每次计算
G a i n ( D , a ) = max ⁡ t ∈ T a E n t ( D ) − ∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ E n t ( D t λ ) \mathrm{Gain}(D,a)=\max_{t\in T_a}\mathrm{Ent}(D) -\sum_{\lambda\in\{-,+\}}\frac{|D^{\lambda}_t|}{|D|}\mathrm{Ent}(D^{\lambda}_t) Gain(D,a)=tTamaxEnt(D)λ{,+}DDtλEnt(Dtλ)
其中 E n t ( D ) \mathrm{Ent}(D) Ent(D) 都是相同的,因此直接计算
∑ λ ∈ { − , + } ∣ D t λ ∣ ∣ D ∣ E n t ( D t λ ) \sum_{\lambda\in\{-,+\}}\frac{|D^{\lambda}_t|}{|D|}\mathrm{Ent}(D^{\lambda}_t) λ{,+}DDtλEnt(Dtλ)
这样每次返回的值就不是最大值了,而是上式计算得出的最小值。

2. 随机森林算法实现

2.1 先实现随机化,有放回抽取样本,以及随机抽取属性(无放回)

/*
* Random instances are obtained through sampling methods with replacement.
* Return: Available instances.
*/
IntArray* RandomForestClassifier::bootStrap()
{
	int count = 0;
	int tempIndex;
	IntArray* resInstances;
	int length = trainingSet->getRows();
	int* tempIndices = new int[length];
	memset(tempIndices, 0, length * sizeof(int));

	for (int i = 0; i < length; i++)
	{
		tempIndex = rand() % length;
		if (tempIndices[tempIndex] == 0) {
			tempIndices[tempIndex] = 1;
			count++;
		}// Of if
	}// Of for

	resInstances = new IntArray(count);

	for (int i = 0, j = 0; i < length; i++)
	{
		if (tempIndices[i] == 1)
		{
			resInstances->setValue(j++, i);
		}// Of if
	}// Of for
#ifdef IF_DEBUG
	std::cout << "bootStrap() : " << resInstances->toString() << std::endl;
#endif // IF_DEBUG

	delete[] tempIndices;
	return resInstances;
}// Of bootStrap


/*
* Random attributes are obtained through sampling without replacement.
* Return: Available attributes.
*/
IntArray* RandomForestClassifier::getAttributes()
{
	
	int tempIndex;
	int numAvailableAttributes;
	if (maxAttributes != -1 && maxAttributes <= numAttributes)
	{
		numAvailableAttributes = maxAttributes;
	}
	else
	{
		numAvailableAttributes = sqrt(numAttributes);
	}
	int* tempAttributes = new int[numAttributes];
	memset(tempAttributes, 0, numAttributes * sizeof(int));
	IntArray* resAttributes = new IntArray(numAvailableAttributes);

	int count = 0;
	while (count < numAvailableAttributes) {
		tempIndex = rand() % numAttributes;
		if (tempAttributes[tempIndex] == 1) {
			continue;
		}// Of while
		tempAttributes[tempIndex] = 1;
		count++;
	}// Of while

	tempIndex = 0;
	for (int i = 0; i < numAttributes; i++)
	{
		if (tempAttributes[i] == 1) {
			resAttributes->setValue(tempIndex++, i);
		}// Of if
	}// Of for

#ifdef IF_DEBUG
	std::cout << "getAttributes() : " << resAttributes->toString() << std::endl;
#endif // IF_DEBUG

	delete[] tempAttributes;
	return resAttributes;
}// Of getAttributes

2.2 构造有穷颗树

/*
* Build a tree
* Return: A tree.
*/
Tree* RandomForestClassifier::buildTree()
{
	IntArray* availableInstances = bootStrap();
	IntArray* availableAttributes = getAttributes();

	Tree* tree = new Tree(trainingSet, trainingLables, availableInstances, availableAttributes, numClasses, 0,
		continuesAttributes, maxDepth, minSampleSplit, criterion);
	tree->train();

	delete availableInstances;
	delete availableAttributes;
	return tree;
}// Of buildTree

/*
* Build a given number of trees.
*/
void RandomForestClassifier::train()
{
	trees = new Tree * [numTrees];
	for (int i = 0; i < numTrees; i++)
	{
		trees[i] = buildTree();
	}// Of for
}// Of train

2.3 预测以及投票

/*
* Return the most frequent label by counting the number of various labels
* param@paraLabels : The set of labels predicted by each tree.
* Return: Label.
*/
int RandomForestClassifier::vote(IntArray* paraLabels)
{
	int* tempCountClasses = new int[numClasses];
	memset(tempCountClasses, 0, numClasses * sizeof(int));
	int max = 0;
	for (int i = 0; i < paraLabels->getLength(); i++)
	{
		tempCountClasses[paraLabels->getValue(i)]++;
		if (tempCountClasses[max] < tempCountClasses[paraLabels->getValue(i)])
		{
			max = paraLabels->getValue(i);
		}// Of if
	}// Of for
	delete[] tempCountClasses;
	return max;
}// Of vote

/*
* Predict.
*/
int RandomForestClassifier::predict(DoubleMatrix* paraInstance)
{
	IntArray* tempLabels = new IntArray(numTrees);

	for (int i = 0; i < numTrees; i++)
	{
		tempLabels->setValue(i, trees[i]->predict(paraInstance));
	}// Of for

	int resLable = vote(tempLabels);

	delete tempLabels;

	return resLable;
}// Of predict

3. 结果测试

3.1 采用 weather 数据测试

该数据集数量较少,且为离散值,使用随机森林算法处理如下:
在这里插入图片描述

3.2 采用 iris 数据测试

由于 iris 数据中的属性值是连续的,所以随机森林需要选择连续值的处理方式,也就是在决策树中对连续值做一个二分的处理。

参数设置:

RandomForestClassifier rf(testReader.getTrainData(), testReader.getTrainLabels(), 50, 3, myContinuesAttributes, 1000, 5, 3, 1);
50 : 树的数量
3 : 表示标签种类
myContinuesAttributes : 表示哪些属性的值是连续值
1000 : 树能够形成的最大深度
5 : 能分割成节点的最小样本数量
3 : 需要抽样的属性数量
1 : 表示使用哪种划分选择方式(目前有信息增益以及基尼指数)

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值