动手实现Logistic Regression （c++）_接口

最新推荐文章于 2016-08-24 12:21:23 发布

xceman1997

最新推荐文章于 2016-08-24 12:21:23 发布

阅读量2.1k

点赞数

分类专栏：机器学习 c/c++ NLP

本文链接：https://blog.csdn.net/xceman1997/article/details/17881681

版权

机器学习同时被 3 个专栏收录

143 篇文章 0 订阅

订阅专栏

NLP

106 篇文章 2 订阅

订阅专栏

c/c++

25 篇文章 0 订阅

订阅专栏

1. 初衷

从前求学的时候，大致了解logistic regression——原理、应用场合，等等。这段时间工作需要，又找了些资料，重新回顾了一下。一直以来我都有个观点：一个机器学习模型，如果没有一行行代码亲自实现过，就谈不上真正了解它。周末在家，啤酒音乐作伴，码了一个简单LR c++实现，贴出来供大家参考。

2. 限制

LR模型很简单，不过在具体应用的时候，还是有很多trick、或者说很多限制需要考虑。这些trick或者限制决定了工程化代码如何写。我这里的限制如下：（1）只有二分类，即0-1分类；（2）只接受二值输入，即特征输入非0即1。对于限制（2），那连续值输入怎么办？如：人的身高、上网时间的长短、等等。可以设置区间，将连续值进行离散化。拿身高为例，160以下算“矮”（不带任何歧视），160-175算“中等”，175以上算“高个子”，则这种方法将连续区间分成三个离散区间，相应的，一个连续特征也分成了三个离散特征——是否矮个子、是否中等个子、和是否高个子，作为三个0-1特征来输入到LR模型中。

3. 具体接口

直接贴头文件了，如下：

#pragma once

#include <vector>
#include <fstream>
#include <iostream>
#include <iterator>
#include <sstream>
#include <algorithm>
#include <cmath>

using namespace std;

// The represetation for a sample
class Sample
{
public:
	// the class index for a sample: 0-1 value, init with '-1'
	int iClass;
	// the feature index vector
	// note: it requires that the input value of a feature is either 0 or 1, 
	// it stores all the features of a sample whose value is 1. 
	vector<int> FeaIdVec;

	Sample (void);
	~Sample (void);
};

// The logistic regression 
class LogisticRegression
{
public:
	LogisticRegression(void);
	~LogisticRegression(void);

	// train by SGD on the sample file
	bool TrainSGDOnSampleFile (
				const char * sFileName, int iMaxFeatureNum,		// about the samples
				double dLearningRate,							// about the learning 
				int iMaxLoop, double dMinImproveRatio			// about the stop criteria
				);
	// save the model to txt file: the theta vector with its size
	bool SaveLRModelTxt (const char * sFileName);
	// load the model from txt file: the theta vector with its size
	bool LoadLRModelTxt (const char * sFileName);
	// load the samples from file, predict by the LR model
	bool PredictOnSampleFile (const char * sFileIn, const char * sFileOut, const char * sFileLog);

	// just for test
	void Test (void);
	

private:
	// the theta vector for each dimension of feature
	vector<double> ThetaVec;

	// read a sample from a line, return false if fail
	bool ReadSampleFrmLine (string & sLine, Sample & theSample);
	// the Sigmoid function: f(x) = 1 / (1 + exp(-x)) = exp (x) / ( 1 + exp(x) )
	double Sigmoid (double x);
	// calculate the model function output by feature vector
	double CalcFuncOutByFeaVec (vector<int> & FeaIdVec);
	// calculate the gradient and update the theta value, it requires that sorts the feature index ascendingly
	void UpdateThetaVec (Sample & theSample, double dY, double dLearningRate);
	// predict the class for one single sample
	int PredictOneSample (Sample & theSample);
};

有两个类。class Sample是表示一个样本的节点类，包含样本分类和样本特征——因为输入特征是0-1特征，这里只保存输入为1的特征的index。如果没有2-（2）的限制，则需要保存输入特征index和具体的输入值，效率受影响。

class LogisticRegression类是实现LR的主类。四个公共接口：训练（TrainSGDOnSampleFile）、保存（SaveLRModelTxt）、加载（LoadLRModelTxt）、预测（PredictOnSampleFile）。最后那个“Test”是个dirty的地方，干杂事。

LogisticRegression的成员数据只有一个theta vector，用来存储各个特征的权重。

另，转载请注明出处：http://blog.csdn.net/xceman1997/article/details/17881681