mallet之命令行工具

最新推荐文章于 2022-07-19 14:39:57 发布

kobe00712

最新推荐文章于 2022-07-19 14:39:57 发布

阅读量915

点赞数

谢谢分享能不能在详细点

原文地址：mallet之命令行工具 作者：章芝青

mallet在目录/bin下面提供的是shell scripts，本文介绍的是在MyEclipse中使用命令行工具运行分类程序。

一、运行类Text2Vectors

在run的Arguments中的Program arguments中写入--input e:/mallet/20_newsgroups/talk.politics.* --skip-header --output e:/mallet/news2.vectors

--input后面的文件表示输入的文件地址

--skip-header表示每个文档在接受两个空行之后开始分析

--output指输出文件名及位置

输出结果：

Labels =
   e:/mallet/20_newsgroups/talk.politics.guns
   e:/mallet/20_newsgroups/talk.politics.mideast
   e:/mallet/20_newsgroups/talk.politics.misc

这三个即匹配e:/mallet/20_newsgroups/talk.politics.*

在e:/mallet/下生成了文件news2.vectors

二、运行类vector2classify

在run的Arguments中的Program arguments 中写入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3

其中--trainer 选择训练的算法，本例中选择NaiveBays

--training-portion 0.6指60%的数据作为训练数据，剩下40%的作为测试数据

--num-trials 表示测试三次

输出结果：

-------------------- Trial 0 --------------------

Trial 0 Training NaiveBayesTrainer with 1800 instances
Trial 0 Training NaiveBayesTrainer finished
Trial 0 Trainer NaiveBayesTrainer training data accuracy= 0.9511111111111111
Trial 0 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.8958333333333334
      label   0   1   2 |total
0    guns 395   2 18 |415
1 mideast   2 360 33 |395
2    misc 52 18 320 |390

Trial 0 Trainer NaiveBayesTrainer test data accuracy= 0.8958333333333334

-------------------- Trial 1 --------------------

Trial 1 Training NaiveBayesTrainer with 1800 instances
Trial 1 Training NaiveBayesTrainer finished
Trial 1 Trainer NaiveBayesTrainer training data accuracy= 0.9522222222222222
Trial 1 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.8891666666666667
      label   0   1   2 |total
0    guns 392   3 18 |413
1 mideast   5 350 30 |385
2    misc 58 19 325 |402

Trial 1 Trainer NaiveBayesTrainer test data accuracy= 0.8891666666666667

-------------------- Trial 2 --------------------

Trial 2 Training NaiveBayesTrainer with 1800 instances
Trial 2 Training NaiveBayesTrainer finished
Trial 2 Trainer NaiveBayesTrainer training data accuracy= 0.9533333333333334
Trial 2 Trainer NaiveBayesTrainer Test Data Confusion Matrix
Confusion Matrix, row=true, column=predicted accuracy=0.895
      label   0   1   2 |total
0    guns 392   . 21 |413
1 mideast 12 383 30 |425
2    misc 44 19 299 |362

Trial 2 Trainer NaiveBayesTrainer test data accuracy= 0.895

NaiveBayesTrainer
Summary. train accuracy mean = 0.9522222222222222 stddev = 9.072184232530348E-4 stderr = 5.237828008789275E-4
Summary. test accuracy mean = 0.8933333333333334 stddev = 0.002965855070008714 stderr = 0.0017123372230469474

参数输入--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3等价于

--input e:malletnews2.vectors --trainer NaiveBayes --training-portion 0.6 --num-trials 3 --report train:confusion train:accuracy test:accuracy

其中的report可以输出confusion, accuracy, f1, 和 raw这些值，需要时可以选择输出

讨论一、训练集和测试集选择

--training-portion 0.6 表示随机选择60%做训练集，剩下的做测试集

默认的--training-portion参数是1.0，指所有的数据都做训练，没有做测试的

还有一个参数--validation-portion指做有效性

例如：--training-portion 0.6 --validation-portion 0.1

表示60%训练，10%有效性，剩下的30%做测试。

尽管有效性设置在Mallet的分类算法中可以使用，但目前所有的算法都不能非常好地应用它

讨论二：分开的数据

对于分开的训练和测试数据，语法为vectors2classify --training-file train.vectors --testing-file test.vectors
还可以将数据分开，语法为vectors2vectors --input news2.vectors --training-portion .6
--training-file train.vectors --testing-file test.vectors

讨论三：分类算法

mallet默认的分类算法是Naive Bayes, 但是 Maximum Entropy, Decision Tree,和 Winnow等算法都是可用的，选择算法的语法为vectors2classify --input news2.vectors --trainer MaxEnt --training-portion 0.7，上面的语法将选择Maximum Entropy算法分类

还可以选择多个算法，例如：vectors2classify --input news2.vectors --trainer NaiveBayes --trainer MaxEnt --training-portion 0.7
这样两个算法将分别进行训练及测试工作

还可以用有参数的分类算法，例如：vectors2classify --input news2.vectors --trainer "new MaxEntTrainer(0.01)" --training-portion 0.6，这表示选择了gaussian prior variance为0.01的Maximum Entropy算法分类

三、运行类vector2info，显示各种信息

１、词信息

通过语法--input e:malletnews2.vectors --print-infogain 10，可以将news2.vector中的前十位的信息增益词显示，显示结果为：

0 israel
1 israeli
2 arab
3 turkish
4 gun
5 turks
6 jews
7 armenia
8 muslim
9 armenian

2.类标签信息

通过语法--input e:malletnews2.vectors --print-labels，显示news2.vectors中的类别信息，运行结果为：

guns
mideast
misc

3.词/文档矩阵

通过语法--input e:malletnews2.vectors --print-matrix siw，输出news2.vectors中的词/文档矩阵信息，运行结果为：

file:/e:/mallet/20_newsgroups/talk.politics.guns/55057 guns in 5 writes 1 you 2 。。。file:/e:/mallet/20_newsgroups/talk.politics.guns/54866 guns in 1 c 2 got 1 was 1 tear 1 gas 1 the 34 davidians 6 their 3 or 3 so 1 children 1 to 。。。

其中--print-matrix siw中的siw表示稀疏，整数，词三个属性，以下是三组参数的介绍

Print entries for all words in the vocabulary, or just print the words that actually occur in the document.
`a`	all
`s`	sparse, (default)
Print word counts as integers or as binary presence/absence indicators.
`b`	binary
`i`	integer, (default)
How to indicate the word itself.
`n`	integer word index
`w`	word string
`c`	combination of integer word index and word string, (default)
`e`	empty, don't print anything to indicate the identity of the word