上面的那个LR实现中,作了如下限制:输入特征都是0-1特征。在实际问题中,对于以枚举类型为主的特征,加入这种限制后,特征转化成0-1特征是非常方便的;但如果输入特征绝大部分是实值特征,则需要将这些特征映射到固定区间、然后转成枚举特征、最后映射为0-1特征,这个过程存在信息损失的。现在这个LR特征,则改为输入特征完全为实值特征,在此基础上进行0-1分类。
具体接口如下:
#pragma once
#include <vector>
#include <fstream>
#include <iostream>
#include <iterator>
#include <sstream>
#include <algorithm>
#include <cmath>
using namespace std;
// The represetation for a feature and its value
class FeaValNode
{
public:
int iFeatureId;
double dValue;
FeaValNode (void);
~FeaValNode (void);
};
// The represetation for a sample
class Sample
{
public:
// the class index for a sample: 0-1 value, init with '-1'
int iClass;
vector<FeaValNode> FeaValNodeVec;
Sample (void);
~Sample (void);
};
// The logistic regression
class LogisticRegression
{
public:
LogisticRegression(void);
~LogisticRegression(void);
// train by SGD on the sample file
bool TrainSGDOnSampleFile (
const char * sFileName, int iMaxFeatureNum, // about the samples
double dLearningRate, // about the learning
int iMaxLoop, double dMinImproveRatio // about the stop criteria
);
// save the model to txt file: the theta vector with its size
bool SaveLRModelTxt (const char * sFileName);
// load the model from txt file: the theta vector with its size
bool LoadLRModelTxt (const char * sFileName);
// load the samples from file, predict by the LR model
bool PredictOnSampleFile (const char * sFileIn, const char * sFileOut, const char * sFileLog);
// just for test
void Test (void);
private:
// the theta vector for each dimension of feature
vector<double> ThetaVec;
// read a sample from a line, return false if fail
bool ReadSampleFrmLine (string & sLine, Sample & theSample);
// the Sigmoid function: f(x) = 1 / (1 + exp(-x)) = exp (x) / ( 1 + exp(x) )
double Sigmoid (double x);
// calculate the model function output by feature vector
double CalcFuncOutByFeaVec (vector<FeaValNode> & FeaValNodeVec);
// calculate the gradient and update the theta value, it requires that sorts the feature index ascendingly
void UpdateThetaVec (Sample & theSample, double dY, double dLearningRate);
// predict the class for one single sample
int PredictOneSample (Sample & theSample);
};
class LogisticRegression的接口基本没变。表示特征的类class Sample稍作改变,除了存储特征索引,还存储了对应的特征值。
转载请注明出处:http://blog.csdn.net/xceman1997/article/details/18136117