第三次实现Logistic Regression（c++）_再尝试

最新推荐文章于 2019-06-14 22:38:18 发布

xceman1997

最新推荐文章于 2019-06-14 22:38:18 发布

阅读量1.3k

点赞数

分类专栏：机器学习 NLP

本文链接：https://blog.csdn.net/xceman1997/article/details/18566857

版权

机器学习同时被 2 个专栏收录

143 篇文章 0 订阅

订阅专栏

NLP

106 篇文章 2 订阅

订阅专栏

写了三个Logistic Regression的实现，发了好几篇博文，我都有点儿写上瘾了。

这一篇再进一步聊一下SGD的程序实现。从前的代码实现框架是这个样子：

// the sample format: classid feature1_value feature2_value...
bool LogisticRegression::TrainSGDOnSampleFile (
			const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
			double dLearningRate = 0.05,								// about the learning
			int iMaxLoop = 1, double dMinImproveRatio = 0.01			// about the stop criteria
			)
{
	......
	for (int iLoop = 0; iLoop < iMaxLoop; iLoop++)
	{
		......
		while (getline (in, sLine))
		{
			Sample theSample;
			if (ReadSampleFrmLine (sLine, theSample))
			{
				......
			}
		}

		......

		if (dTmpRatio < dMinImproveRatio)
			break;
		else
		{
			......
			//reset the current reading position of file
			in.clear();
			in.seekg (0, ios::beg);
		}
	}

	return true;
}

在训练的过程中，一行行的读文件。读一行，解析一个样本，更新一次权重。如此反复。这种方法“蠢”在算法的效率瓶颈在于磁盘IO。如果把所有样本都load到内存中，在内存中遍历，至少比用磁盘IO提升上千倍速度。为啥用这种蠢方法，原因在于我从前上学的时候，做语言模型的训练，一般都是用十年人民日报（10G左右），无法一次都load到内存中。所以，以后每次写程序的时候，头脑中都有一个假设，就是目标数据集是单PC无法放到内存处理的，都是规模比较大的。

说了一会儿，发现说歪了，怪不得我高考作文都能写跑题。

我们看看上一篇博文中用来训练的数据集合，数据的排列是比较规律的：第0类的数据样本、接下来是第1类的数据样本、第2类的......我在想，这种排列会不会对训练有影响？如：收敛速度、在测试机集合的准确率等等。所以SGD算法作了如下改动：把所有样本load进内存，每次随机挑选一个样本来训练，代码如下：

// the sample format: classid feature1_value feature2_value...
bool LogisticRegression::TrainSGDOnSampleFileEx2 (
			const char * sFileName, int iClassNum, int iFeatureNum,		// about the samples
			double dLearningRate = 0.05,								// about the learning
			int iMaxLoop = 1, double dMinImproveRatio = 0.01			// about the stop criteria
			)
{
	ifstream in (sFileName);
	if (!in)
	{
		cerr << "Can not open the file of " << sFileName << endl;
		return false;
	}

	if (!InitThetaMatrix (iClassNum, iFeatureNum))
		return false;

	vector<Sample> SampleVec;
	if (!LoadAllSamples (sFileName, SampleVec))
		return false;

	double dCost = 0.0;
	double dPreCost = 100.0;
	for (int iLoop = 0; iLoop < iMaxLoop; iLoop++)
	{
		srand((unsigned)time(NULL));
		int iErrNum = 0;
		int iSampleNum = (int)SampleVec.size();
		for (int i=0; i<iSampleNum; i++)
		{
			double dRandomFloat = (double)rand() / RAND_MAX;
			int iSampleIndex = (int)(dRandomFloat * iSampleNum);

			vector<double> ClassProbVec;
			int iPredClassIndex = CalcFuncOutByFeaVecForAllClass (SampleVec[iSampleIndex].FeaValNodeVec, ClassProbVec);
			if (iPredClassIndex != SampleVec[iSampleIndex].iClass)
				iErrNum++;

			dCost += UpdateThetaMatrix (SampleVec[iSampleIndex], ClassProbVec, dLearningRate); 
		}

		dCost /= iSampleNum;
		double dTmpRatio = (dPreCost - dCost) / dPreCost;
		double dTmpErrRate = (double)iErrNum / iSampleNum;

		// show info on screen
		cout << "In loop " << iLoop << ": current cost (" << dCost << ") previous cost (" << dPreCost << ") ratio (" << dTmpRatio << ") "<< endl;
		cout << "And Error rate : " << dTmpErrRate << endl;

		/*if (dTmpRatio < dMinImproveRatio)
			break;
		else*/
		if (dCost < 0.001)
			break;
		{
			dPreCost = dCost;
			dCost = 0.0;
		}
	}

	return true;
}

其中，load样本的代码片段：

vector<Sample> SampleVec;
if (!LoadAllSamples (sFileName, SampleVec))
	return false;

随机挑选样本的代码片段：

double dRandomFloat = (double)rand() / RAND_MAX;
int iSampleIndex = (int)(dRandomFloat * iSampleNum);

从训练的log上来看，修改后的算法的确收敛速度快一些。在测试集合上，性能是0.863395。对比相同参数条件下，顺序遍历样本来训练的方法，性能为0.851459，后者提升了一个百分点左右。

通过这个结果，粗略的一个结论，就是SGD的效果是与训练样本顺序有关的，最好是随机顺序，而不是什么有规律的排列。当然，我还是坚持从前的观点，就是假设能把所有样本load到内存是不现实的，如果真有这样的事情，那么就用svm了，干嘛选择LR啊。

完。

转载请注明出处：http://blog.csdn.net/xceman1997/article/details/18566857

xceman1997

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录