动手实现Logistic Regression （c++）_实现

最新推荐文章于 2017-04-22 00:19:21 发布

xceman1997

最新推荐文章于 2017-04-22 00:19:21 发布

阅读量1.9k

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/xceman1997/article/details/17881941

版权

机器学习专栏收录该内容

143 篇文章 0 订阅

订阅专栏

模型实现起来，就两个主要任务：训练和预测。先看训练部分。函数的名字是TrainSGDOnSampleFile，写完整了就是“train the model by stochastic gradient descent on the file containing samples” ，算法就是SGD了（不是GD）。实现如下：

// the sample format: classid featureId1 featureId2...
bool LogisticRegression::TrainSGDOnSampleFile (
			const char * sFileName, int iMaxFeatureNum,			// about the samples
			double dLearningRate = 0.05,						// about the learning
			int iMaxLoop = 1, double dMinImproveRatio = 0.01	// about the stop criteria
			)
{
	ifstream in (sFileName);
	if (!in)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	ThetaVec.clear();
	ThetaVec.resize (iMaxFeatureNum, 0.0);

	double dCost = 0.0;
	double dPreCost = 1.0;
	for (int iLoop = 0; iLoop < iMaxLoop; iLoop++)
	{
		int iSampleNum = 0;
		string sLine;
		while (getline (in, sLine))
		{
			Sample theSample;
			if (ReadSampleFrmLine (sLine, theSample))
			{
				double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
				UpdateThetaVec (theSample, dY, dLearningRate); 

				// the cost function is : cost = -( iClass * log (dY) + (1.0 - iClass) * log(1.0 - dY) )
				// that is: cost = log(dY) when iClass is 1, otherwise cost = log(1.0 - dY) when iClass is 0
				if (theSample.iClass > 0.0)
					dCost -= log (dY);
				else
					dCost -= log (1.0 - dY);

				iSampleNum++;
			}
		}

		dCost /= iSampleNum;
		double dTmpRatio = (dPreCost - dCost) / dPreCost;

		// show info on screen
		cout << "In loop " << iLoop << ": current cost (" << dCost << ") previous cost (" << dPreCost << ") ratio (" << dTmpRatio << ") "<< endl;

		if (dTmpRatio < dMinImproveRatio)
			break;
		else
		{
			dPreCost = dCost;
			dCost = 0.0;
			//reset the current reading position of file
			in.clear();
			in.seekg (0, ios::beg);
		}
	}

	return true;
}

函数输入参数分为三类：（1）和训练样本相关——样本文件位置、样本最大特征数量（决定了特征系数的数量）；（2）学习算法相关——学习率；（3）算法停止条件——最大迭代次数、最小提升幅度。

函数过程大致如下：（1）ReadSampleFrmLine从文件中读取一个sample（一行代表一个sample，看函数前面注释）；（2）CalcFuncOutByFeaVec计算当前模型参数条件下对这个sample的计算输出（如果加上判断条件，就是预测类别了）；（3）UpdateThetaVec根据模型计算输出和sample实际类别，来更新特征系数；最后，计算是否到达停止条件，否则的话进行新一轮训练。

UpdateThetaVec的实现如下：

// the update formula is : theta_new = theta_old - dLearningRate * (dY - iClass) * dXi
// in which iClass is 0-1, and dXi is 0-1
void LogisticRegression::UpdateThetaVec(Sample & theSample, double dY, double dLearningRate)
{
	double dGradient = dY - theSample.iClass;
	double dDelta = dGradient * dLearningRate;

	vector<int>::iterator p = theSample.FeaIdVec.begin();
	while (p != theSample.FeaIdVec.end())
	{
		if (*p < (int)ThetaVec.size())
		{
			ThetaVec[*p] -= dDelta;
		}
		p++;
	}
}

因为特征输入是0-1输入，根据权重更新公式，输入为0的特征，对应的权重不需要更新（再回想2-（2）的限制）。

CalcFuncOutByFeaVec的实现如下：

double LogisticRegression::CalcFuncOutByFeaVec(vector<int> & FeaIdVec)
{
	double dX = 0.0;
	vector<int>::iterator p = FeaIdVec.begin();
	while (p != FeaIdVec.end())
	{
		if (*p < (int)ThetaVec.size())	// all input is evil
			dX += ThetaVec[*p];			// actually it is ThetaVec[*p] * 1.0
		p++;
	}
	double dY = Sigmoid (dX);
	return dY;
}

原本是向量内积，不过输入是0的特征对结果就没有影响了，输入是1的，直接加权重就行。Sigmoid函数实现如下：

double LogisticRegression::Sigmoid(double x)
{
	double dTmpOne = exp (x);
	double dTmpTwo = dTmpOne + 1;
	return dTmpOne / dTmpTwo;
}

用的是公式 e(x) / ( 1 + e(x) )，比标准公式 1 / ( 1 + e(-x) ) 在大多数情况下精度高些。

模型保存，函数SaveLRModelTxt，如下：

bool LogisticRegression::SaveLRModelTxt(const char *sFileName)
{
	if (ThetaVec.empty())
	{
		cerr << "The Theta vector is empty" << endl;
		return false;
	}

	ofstream out (sFileName);
	if (!out)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	out << (int)ThetaVec.size() << "\n";
	copy (ThetaVec.begin(), ThetaVec.end(), ostream_iterator<double>(out, "\n"));

	return true;
}

模型加载，函数LoadLRModelTxt，如下：

bool LogisticRegression::LoadLRModelTxt (const char * sFileName)
{
	ifstream in (sFileName);
	if (!in)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	ThetaVec.clear();

	int iNum = 0;
	in >> iNum;

	ThetaVec.resize (iNum, 0.0);
	for (int i=0; i<iNum; i++)
	{
		in >> ThetaVec[i];
	}

	return true;
}

最后，模型预测，函数PredictOnSampleFile，如下：

bool LogisticRegression::PredictOnSampleFile(const char *sFileIn, const char *sFileOut, const char *sFileLog)
{
	ifstream in (sFileIn);
	ofstream out (sFileOut);
	ofstream log (sFileLog);
	if (!in || !out || !log)
	{
		cerr << "Can not open the files " << endl;
		return false;
	}

	int iSampleNum = 0;
	int iCorrectNum = 0;
	string sLine;
	while (getline (in, sLine))
	{
		Sample theSample;
		if (ReadSampleFrmLine (sLine, theSample))
		{
			int iClass = PredictOneSample (theSample);

			if (iClass == theSample.iClass)
				iCorrectNum++;

			out << iClass << " ";
			copy (theSample.FeaIdVec.begin(), theSample.FeaIdVec.end(), ostream_iterator<int>(out, " "));
			out << endl;
		}
		else
			out << "bad input" << endl;

		iSampleNum++;
	}

	log << "The total number of sample is : " << iSampleNum << endl;
	log << "The correct prediction number is : " << iCorrectNum << endl;
	log << "Precision : " << (double)iCorrectNum / iSampleNum << endl;

	return true;
}

输入的测试文件和训练文件的格式一样（将来大规模数据、没有标注类别结果的时候，接口要变动），输出预测结果以及统计信息（准确率）。调用PredictOneSample来对每个sample进行预测。PredictOneSample的实现就是在CalcFuncOutByFeaVec的基础上加入判断，如下：

int LogisticRegression::PredictOneSample (Sample & theSample)
{
	double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
	if (dY > 0.5)
		return 1;
	else
		return 0;
}

完。

转载请注明出处：http://blog.csdn.net/xceman1997/article/details/17881941

xceman1997

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
2
评论
动手实现Logistic Regression （c++）_实现

模型实现起来，就两个主要任务：训练和预测。先看训练部分。函数的名字是TrainSGDOnSampleFile，写完整了就是“train the model by stochastic gradient descent on the file containing samples” ，算法就是SGD了（不是GD）。实现如下：// the sample format: classid featur
复制链接

扫一扫

专栏目录