模型实现起来,就两个主要任务:训练和预测。先看训练部分。函数的名字是TrainSGDOnSampleFile,写完整了就是“train the model by stochastic gradient descent on the file containing samples” ,算法就是SGD了(不是GD)。实现如下:
// the sample format: classid featureId1 featureId2...
bool LogisticRegression::TrainSGDOnSampleFile (
const char * sFileName, int iMaxFeatureNum, // about the samples
double dLearningRate = 0.05, // about the learning
int iMaxLoop = 1, double dMinImproveRatio = 0.01 // about the stop criteria
)
{
ifstream in (sFileName);
if (!in)
{
cerr << "Can not open the file of " << sFileName << endl;
return false;
}
ThetaVec.clear();
ThetaVec.resize (iMaxFeatureNum, 0.0);
double dCost = 0.0;
double dPreCost = 1.0;
for (int iLoop = 0; iLoop < iMaxLoop; iLoop++)
{
int iSampleNum = 0;
string sLine;
while (getline (in, sLine))
{
Sample theSample;
if (ReadSampleFrmLine (sLine, theSample))
{
double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
UpdateThetaVec (theSample, dY, dLearningRate);
// the cost function is : cost = -( iClass * log (dY) + (1.0 - iClass) * log(1.0 - dY) )
// that is: cost = log(dY) when iClass is 1, otherwise cost = log(1.0 - dY) when iClass is 0
if (theSample.iClass > 0.0)
dCost -= log (dY);
else
dCost -= log (1.0 - dY);
iSampleNum++;
}
}
dCost /= iSampleNum;
double dTmpRatio = (dPreCost - dCost) / dPreCost;
// show info on screen
cout << "In loop " << iLoop << ": current cost (" << dCost << ") previous cost (" << dPreCost << ") ratio (" << dTmpRatio << ") "<< endl;
if (dTmpRatio < dMinImproveRatio)
break;
else
{
dPreCost = dCost;
dCost = 0.0;
//reset the current reading position of file
in.clear();
in.seekg (0, ios::beg);
}
}
return true;
}
函数输入参数分为三类:(1)和训练样本相关——样本文件位置、样本最大特征数量(决定了特征系数的数量);(2)学习算法相关——学习率;(3)算法停止条件——最大迭代次数、最小提升幅度。
函数过程大致如下:(1)ReadSampleFrmLine从文件中读取一个sample(一行代表一个sample,看函数前面注释);(2)CalcFuncOutByFeaVec计算当前模型参数条件下对这个sample的计算输出(如果加上判断条件,就是预测类别了);(3)UpdateThetaVec根据模型计算输出和sample实际类别,来更新特征系数;最后,计算是否到达停止条件,否则的话进行新一轮训练。
UpdateThetaVec的实现如下:
// the update formula is : theta_new = theta_old - dLearningRate * (dY - iClass) * dXi
// in which iClass is 0-1, and dXi is 0-1
void LogisticRegression::UpdateThetaVec(Sample & theSample, double dY, double dLearningRate)
{
double dGradient = dY - theSample.iClass;
double dDelta = dGradient * dLearningRate;
vector<int>::iterator p = theSample.FeaIdVec.begin();
while (p != theSample.FeaIdVec.end())
{
if (*p < (int)ThetaVec.size())
{
ThetaVec[*p] -= dDelta;
}
p++;
}
}
因为特征输入是0-1输入,根据权重更新公式,输入为0的特征,对应的权重不需要更新(再回想2-(2)的限制)。
CalcFuncOutByFeaVec的实现如下:
double LogisticRegression::CalcFuncOutByFeaVec(vector<int> & FeaIdVec)
{
double dX = 0.0;
vector<int>::iterator p = FeaIdVec.begin();
while (p != FeaIdVec.end())
{
if (*p < (int)ThetaVec.size()) // all input is evil
dX += ThetaVec[*p]; // actually it is ThetaVec[*p] * 1.0
p++;
}
double dY = Sigmoid (dX);
return dY;
}
原本是向量内积,不过输入是0的特征对结果就没有影响了,输入是1的,直接加权重就行。Sigmoid函数实现如下:
double LogisticRegression::Sigmoid(double x)
{
double dTmpOne = exp (x);
double dTmpTwo = dTmpOne + 1;
return dTmpOne / dTmpTwo;
}
用的是公式 e(x) / ( 1 + e(x) ),比标准公式 1 / ( 1 + e(-x) ) 在大多数情况下精度高些。
模型保存,函数SaveLRModelTxt,如下:
bool LogisticRegression::SaveLRModelTxt(const char *sFileName)
{
if (ThetaVec.empty())
{
cerr << "The Theta vector is empty" << endl;
return false;
}
ofstream out (sFileName);
if (!out)
{
cerr << "Can not open the file of " << sFileName << endl;
return false;
}
out << (int)ThetaVec.size() << "\n";
copy (ThetaVec.begin(), ThetaVec.end(), ostream_iterator<double>(out, "\n"));
return true;
}
模型加载,函数LoadLRModelTxt,如下:
bool LogisticRegression::LoadLRModelTxt (const char * sFileName)
{
ifstream in (sFileName);
if (!in)
{
cerr << "Can not open the file of " << sFileName << endl;
return false;
}
ThetaVec.clear();
int iNum = 0;
in >> iNum;
ThetaVec.resize (iNum, 0.0);
for (int i=0; i<iNum; i++)
{
in >> ThetaVec[i];
}
return true;
}
最后,模型预测,函数PredictOnSampleFile,如下:
bool LogisticRegression::PredictOnSampleFile(const char *sFileIn, const char *sFileOut, const char *sFileLog)
{
ifstream in (sFileIn);
ofstream out (sFileOut);
ofstream log (sFileLog);
if (!in || !out || !log)
{
cerr << "Can not open the files " << endl;
return false;
}
int iSampleNum = 0;
int iCorrectNum = 0;
string sLine;
while (getline (in, sLine))
{
Sample theSample;
if (ReadSampleFrmLine (sLine, theSample))
{
int iClass = PredictOneSample (theSample);
if (iClass == theSample.iClass)
iCorrectNum++;
out << iClass << " ";
copy (theSample.FeaIdVec.begin(), theSample.FeaIdVec.end(), ostream_iterator<int>(out, " "));
out << endl;
}
else
out << "bad input" << endl;
iSampleNum++;
}
log << "The total number of sample is : " << iSampleNum << endl;
log << "The correct prediction number is : " << iCorrectNum << endl;
log << "Precision : " << (double)iCorrectNum / iSampleNum << endl;
return true;
}
输入的测试文件和训练文件的格式一样(将来大规模数据、没有标注类别结果的时候,接口要变动),输出预测结果以及统计信息(准确率)。调用PredictOneSample来对每个sample进行预测。PredictOneSample的实现就是在CalcFuncOutByFeaVec的基础上加入判断,如下:
int LogisticRegression::PredictOneSample (Sample & theSample)
{
double dY = CalcFuncOutByFeaVec (theSample.FeaIdVec);
if (dY > 0.5)
return 1;
else
return 0;
}
完。
转载请注明出处:http://blog.csdn.net/xceman1997/article/details/17881941