半监督算法工具SVMlin使用

最新推荐文章于 2024-07-24 07:00:00 发布

zt_706

最新推荐文章于 2024-07-24 07:00:00 发布

阅读量3k

点赞数

分类专栏： weka 源码学习

weka 源码学习专栏收录该内容

26 篇文章 0 订阅

订阅专栏

转自 Koala++'s blog 感谢原作者

SVMlin中有监督SVM和半监督SVM算法，下载地址是http://people.cs.uchicago.edu/~vikass/svmlin.html，其实google一下svmlin就找到了。

SVMlin is software package for linear SVMs. It is well-suited to classification problems involving a large number of examples and features. It is primarily written for sparse datasets (number of non-zero features in an example is typically small). It is written in C++ (mostly C).

大概翻译一下，他说svmlin能处理大样本，多特征的数据集（我是有点怀疑，他用的数据结据就是一个简单的数组，能行吗？），主要用于稀疏数据集（也就是说有许多特征是0值），它是用C++写的（大部分是用C），

SVMlin can also utilize unlabeled data, in addition to labeled examples. It currently implements two extensions of standard SVMs to incorporate unlabeled examples.

SVMlin可以利用未标记样本进行分类，它目前实现了两个标准SVM的扩展算法（目前？我看也就是永远了，2006年后没有再更新了，所以他写的bug我都懒的告诉他了）。

For a Reuters text categorization problem with around 804414 labeled examples and 47326 features,SVMlin takes less than two minutes to train a linear SVM on an Intel machine with 3GHz processor and 2GB RAM. Given just 1000 labels, it can utilize the remaining hundreds of thousands of unlabeled examples for training a semi-supervised linear SVM in about 20 minutes. Unlabeled data can be very useful in improving classification performance when labels are relatively few.

这上面是讲这个算法很强悍的数据证据，看样子还真是不错。

它用的数据集与LibSVM比较相似，只是作者比较懒惰（不过很有奉献精神），他并没有把第一列作为类别，他是将特征和类别分成两个文件（当然，这样写程序好写一点）。

For example, the following data matrix with 4 examples and 5 features
0 3 0 0 1
4 1 0 0 0
0 5 9 2 0
6 0 0 5 3

is described in the input file as

2:3 5:1
1:4 2:1
2:5 3:9 4:2
1:6 4:5 5:3

这是作者举的数据集的例子。

The file containing labels is separate since it is routine to use the same inputs with different labels. Each line should contain a label for the corresponding line in the input file with one of the following values:
+1 (labeled positive example)
-1 (labeled negative example)
0 (unlabeled examples)

+1 正例 -1 负例，0表示未标记样本

Download the file example.tar.gz or example.zip

下载这两个数据集先试一下吧。

两个半监督算法运行的命令（对思维怪异的人再提醒一句：下面的命令，每次输一个就行了）

1：svmlin -A 2 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels

2：svmlin -A 3 -W 0.001 -U 1 -R 0.5 example/training_examples example/training_labels

用下面的命令看一下准确率：

svmlin -f training_examples.weights example/test_examples example/test_labels

如果你是在linux下面当然一切都不是问题，安装上编译工具（ubuntu下面似乎没有自带的，让我这个windows fan还晕了一会，至于怎么装，自已google去吧）。

Type

make

This will create an executable

Svmlin

如作者所说的，敲个make就有一个svmlin的可执行文件。

最好用的当然还是windows的visual studio了，不过visual c++ 6.0并不完全支持标准c++（这点也是非常让人心烦的，循环变量i，j，k用完了用什么变量呢？），导致svmlin在vc6.0下编辑会提示i，j重复定义了，最简单的做法，直接删吧，还有提醒一上register int，这个的意思是i放到寄存器中，当然在现在的编译器下，这只是一种通常不会实现的愿望。