动手实现Logistic Regression (c++)_实现

模型实现起来,就两个主要任务:训练和预测。先看训练部分。函数的名字是TrainSGDOnSampleFile,写完整了就是“train the model by stochastic gradient descent on the file containing samples” ,算法就是SGD了(不是GD)。实现如下:

// the sample format: classid featureId1 featureId2...
bool LogisticRegression::TrainSGDOnSampleFile (
			const char * sFileName, int iMaxFeatureNum,			// about the samples
			double dLearningRate = 0.05,						// about the learning
			int iMaxLoop = 1, double dMinImproveRatio = 0.01	// about the stop criteria
			)
{
	ifstream in (sFileName);
	if (!in)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	ThetaVec.clear();
	ThetaVec.resize (iMaxFeatureNum, 0.0);

	double dCost = 0.0;
	double dPreCost = 1.0;
	for (int iLoop = 0; iLoop < iMaxLoop; iLoop++)
	{
		int iSampleNum = 0;
		string sLine;
		while (getline (in, sLine))
		{
			Sample theSample;
			if (ReadSampleFrmLine (sLine, theSample))
			{
				double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
				UpdateThetaVec (theSample, dY, dLearningRate); 

				// the cost function is : cost = -( iClass * log (dY) + (1.0 - iClass) * log(1.0 - dY) )
				// that is: cost = log(dY) when iClass is 1, otherwise cost = log(1.0 - dY) when iClass is 0
				if (theSample.iClass > 0.0)
					dCost -= log (dY);
				else
					dCost -= log (1.0 - dY);

				iSampleNum++;
			}
		}

		dCost /= iSampleNum;
		double dTmpRatio = (dPreCost - dCost) / dPreCost;

		// show info on screen
		cout << "In loop " << iLoop << ": current cost (" << dCost << ") previous cost (" << dPreCost << ") ratio (" << dTmpRatio << ") "<< endl;

		if (dTmpRatio < dMinImproveRatio)
			break;
		else
		{
			dPreCost = dCost;
			dCost = 0.0;
			//reset the current reading position of file
			in.clear();
			in.seekg (0, ios::beg);
		}
	}

	return true;
}

函数输入参数分为三类:(1)和训练样本相关——样本文件位置、样本最大特征数量(决定了特征系数的数量);(2)学习算法相关——学习率;(3)算法停止条件——最大迭代次数、最小提升幅度。

函数过程大致如下:(1)ReadSampleFrmLine从文件中读取一个sample(一行代表一个sample,看函数前面注释);(2)CalcFuncOutByFeaVec计算当前模型参数条件下对这个sample的计算输出(如果加上判断条件,就是预测类别了);(3)UpdateThetaVec根据模型计算输出和sample实际类别,来更新特征系数;最后,计算是否到达停止条件,否则的话进行新一轮训练。

UpdateThetaVec的实现如下:

// the update formula is : theta_new = theta_old - dLearningRate * (dY - iClass) * dXi
// in which iClass is 0-1, and dXi is 0-1
void LogisticRegression::UpdateThetaVec(Sample & theSample, double dY, double dLearningRate)
{
	double dGradient = dY - theSample.iClass;
	double dDelta = dGradient * dLearningRate;

	vector<int>::iterator p = theSample.FeaIdVec.begin();
	while (p != theSample.FeaIdVec.end())
	{
		if (*p < (int)ThetaVec.size())
		{
			ThetaVec[*p] -= dDelta;
		}
		p++;
	}
}

因为特征输入是0-1输入,根据权重更新公式,输入为0的特征,对应的权重不需要更新(再回想2-(2)的限制)。

CalcFuncOutByFeaVec的实现如下:

double LogisticRegression::CalcFuncOutByFeaVec(vector<int> & FeaIdVec)
{
	double dX = 0.0;
	vector<int>::iterator p = FeaIdVec.begin();
	while (p != FeaIdVec.end())
	{
		if (*p < (int)ThetaVec.size())	// all input is evil
			dX += ThetaVec[*p];			// actually it is ThetaVec[*p] * 1.0
		p++;
	}
	double dY = Sigmoid (dX);
	return dY;
}

原本是向量内积,不过输入是0的特征对结果就没有影响了,输入是1的,直接加权重就行。Sigmoid函数实现如下:

double LogisticRegression::Sigmoid(double x)
{
	double dTmpOne = exp (x);
	double dTmpTwo = dTmpOne + 1;
	return dTmpOne / dTmpTwo;
}

用的是公式 e(x) / ( 1 + e(x) ),比标准公式 1 / ( 1 + e(-x) ) 在大多数情况下精度高些。


模型保存,函数SaveLRModelTxt,如下:

bool LogisticRegression::SaveLRModelTxt(const char *sFileName)
{
	if (ThetaVec.empty())
	{
		cerr << "The Theta vector is empty" << endl;
		return false;
	}

	ofstream out (sFileName);
	if (!out)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	out << (int)ThetaVec.size() << "\n";
	copy (ThetaVec.begin(), ThetaVec.end(), ostream_iterator<double>(out, "\n"));

	return true;
}

模型加载,函数LoadLRModelTxt,如下:

bool LogisticRegression::LoadLRModelTxt (const char * sFileName)
{
	ifstream in (sFileName);
	if (!in)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	ThetaVec.clear();

	int iNum = 0;
	in >> iNum;

	ThetaVec.resize (iNum, 0.0);
	for (int i=0; i<iNum; i++)
	{
		in >> ThetaVec[i];
	}

	return true;
}

最后,模型预测,函数PredictOnSampleFile,如下:

bool LogisticRegression::PredictOnSampleFile(const char *sFileIn, const char *sFileOut, const char *sFileLog)
{
	ifstream in (sFileIn);
	ofstream out (sFileOut);
	ofstream log (sFileLog);
	if (!in || !out || !log)
	{
		cerr << "Can not open the files " << endl;
		return false;
	}

	int iSampleNum = 0;
	int iCorrectNum = 0;
	string sLine;
	while (getline (in, sLine))
	{
		Sample theSample;
		if (ReadSampleFrmLine (sLine, theSample))
		{
			int iClass = PredictOneSample (theSample);

			if (iClass == theSample.iClass)
				iCorrectNum++;

			out << iClass << " ";
			copy (theSample.FeaIdVec.begin(), theSample.FeaIdVec.end(), ostream_iterator<int>(out, " "));
			out << endl;
		}
		else
			out << "bad input" << endl;

		iSampleNum++;
	}

	log << "The total number of sample is : " << iSampleNum << endl;
	log << "The correct prediction number is : " << iCorrectNum << endl;
	log << "Precision : " << (double)iCorrectNum / iSampleNum << endl;

	return true;
}

输入的测试文件和训练文件的格式一样(将来大规模数据、没有标注类别结果的时候,接口要变动),输出预测结果以及统计信息(准确率)。调用PredictOneSample来对每个sample进行预测。PredictOneSample的实现就是在CalcFuncOutByFeaVec的基础上加入判断,如下:

int LogisticRegression::PredictOneSample (Sample & theSample)
{
	double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
	if (dY > 0.5)
		return 1;
	else
		return 0;
}

完。


转载请注明出处:http://blog.csdn.net/xceman1997/article/details/17881941

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值