mallet在目录/bin下面提供的是shell scripts,本文介绍的是在MyEclipse中使用命令行工具运行分类程序。
一、运行类Text2Vectors
在run的Arguments中的Program arguments中写入--input e:/mallet/20_newsgroups/talk.politics.* --skip-header --output e:/mallet/news2.vectors
--input后面的文件表示输入的文件地址
--skip-header表示每个文档在接受两个空行之后开始分析
--output指输出文件名及位置
输出结果:
Labels =
这三个即匹配e:/mallet/20_newsgroups/talk.politics.*
在e:/mallet/下生成了文件news2.vectors
二、运行类vector2classify
在run的Arguments中的Program arguments 中写入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3
其中--trainer 选择训练的算法,本例中选择NaiveBays
输出结果:
-------------------- Trial 0
Trial 0 Training NaiveBayesTrainer with 1800 instances
Trial 0 Training NaiveBayesTrainer finished
Trial 0 Trainer NaiveBayesTrainer training data accuracy= 0.9511111111111111
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8958333333333334
-------------------- Trial 1
Trial 1 Training NaiveBayesTrainer with 1800 instances
Trial 1 Training NaiveBayesTrainer finished
Trial 1 Trainer NaiveBayesTrainer training data accuracy= 0.9522222222222222
Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.8891666666666667
-------------------- Trial 2
Trial 2 Training NaiveBayesTrainer with 1800 instances
Trial 2 Training NaiveBayesTrainer finished
Trial 2 Trainer NaiveBayesTrainer training data accuracy= 0.9533333333333334
Trial 2 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted
Trial 2 Trainer NaiveBayesTrainer test data accuracy= 0.895
NaiveBayesTrainer
Summary. train accuracy mean = 0.9522222222222222 stddev = 9.072184232530348E-4 stderr = 5.237828008789275E-4
Summary. test accuracy mean = 0.8933333333333334 stddev = 0.002965855070008714 stderr = 0.0017123372230469474
参数输入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3等价于
--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3
其中的report可以输出confusion, accuracy, f1,
讨论一、训练集和测试集选择
--training-portion 0.6 表示随机选择60%做训练集,剩下的做测试集
默认的--training-portion参数是1.0,指所有的数据都做训练,没有做测试的
还有一个参数--validation-portion指做有效性
例如:--training-portion 0.6 --validation-portion 0.1
表示60%训练,10%有效性,剩下的30%做测试。
尽管有效性设置在Mallet的分类算法中可以使用,但目前所有的算法都不能非常好地应用它
讨论二:分开的数据
对于分开的训练和测试数据,语法为vectors2classify --training-file train.vectors --testing-file test.vectors
还可以将数据分开,语法为vectors2vectors --input news2.vectors --training-portion .6
讨论三:分类算法
这样两个算法将分别进行训练及测试工作
三、运行类vector2info,显示各种信息
1、词信息
0 israel
1 israeli
2 arab
3 turkish
4 gun
5 turks
6 jews
7 armenia
8 muslim
9 armenian
2.类标签信息
guns
mideast
misc
3.词/文档矩阵
file:/e:/mallet/20_newsgroups/talk.politics.guns/55057 guns
其中--print-matrix siw中的siw表示稀疏,整数,词三个属性,以下是三组参数的介绍
Print entries for all words in the vocabulary, or just print the words that actually occur in the document. | |
a | all |
s | sparse, (default) |
Print word counts as integers or as binary presence/absence indicators. | |
b | binary |
i | integer, (default) |
How to indicate the word itself. | |
n | integer word index |
w | word string |
c | combination of integer word index and word string, (default) |
e | empty, don't print anything to indicate the identity of the word |