Weka manual 3.6翻译: 16.6 分类

最新推荐文章于 2024-07-14 22:18:32 发布

weixin_34054931

最新推荐文章于 2024-07-14 22:18:32 发布

阅读量134

点赞数

文章标签： java python 数据结构与算法

原文链接：https://my.oschina.net/leopardsaga/blog/92740

版权

2019独角兽企业重金招聘Python工程师标准>>>

若觉排版不好，可点这里。

16.6 分类

在WEKA内，分类和回归算法都被称为“分类”，并都位于 weka.classifiers 包中。本节包括以下主题：

• 建立一个分类 -批量和增量学习。

• 评价一个分类 -各种评估技术，以及如何获得生成的统计信息。

• 分类实例 -获得未知数据的分类。

WEKA Examples 集合[3]包含分类的示例类，在 wekaexamples.classifiers 包中。

16.6.1 建立分类器

通过设计，WEK中的所有分类器都可批量分类，即，他们对整个数据集在一次训练。这是正常的，如果训练数据装入到内存中。但也有算法，可以运行中更新自己的内部模型这些分类器被称为增量的。以下两部分覆盖批量和增量的分类器。

批量分类器

建立了一批分类是非常简单的：

• 设置选项 -无论是使用 setOptions(String[]) 方法或实际的set方法。

• 训练 -提供训练集，调用 buildClassifier(Instances) 。根据定义 buildClassifier(Instances) 方法完全重置内部模型，以确保后续用同一数据调用此方法会得到同一个模型（“重复实验”）。

下面的代码片段用数据集生成未修剪J48：

import weka.core.Instances;

import weka.classifiers.trees.J48;

...

Instances data = ... // from somewhere

String[] options = new String[1];

options[0] = "-U"; // unpruned tree

J48 tree = new J48(); // new instance of tree

tree.setOptions(options); // set the options

tree.buildClassifier(data); // build classifier

增量分类器

在WEKA内，所有增量分类器都实现了接口UpdateableClassifier（位于包 weka.classifiers）。这个特定接口的Javadoc，讲述了一个什么样的分类实现此接口。这些分类器可以被用于处理大量的数据，利用较小的存储器占用空间，因为训练数据不必加载在内存中。例如，ARFF文件可以增量地读出（见第16.2章）。

训练增量分类器分两个阶段：

1. 通过调用buildClassifier(Instances) 方法初始化模型。可以使用一个 weka.core.Instances 对象，对象可以没有实际的数据或有一组初始数据。

2. 通过调用 updateClassifier(Instance) 方法一行一行更新模型。

下面的例子演示了如何使用ArffLoader的类增量地加载一个ARFF文件，且一行一行地训练NaiveBayesUpdateable 分类器：

import weka.core.converters.ArffLoader;

import weka.classifiers.bayes.NaiveBayesUpdateable;

import java.io.File;

...

/ /加载数据

ArffLoader loader = new ArffLoader();

loader.setFile(new File("/some/where/data.arff"));

Instances structure = loader.getStructure();

structure.setClassIndex（structure.numAttributes（） - 1）;

// train NaiveBayes

NaiveBayesUpdateable nb = new NaiveBayesUpdateable();

nb.buildClassifier(structure);

Instance current;

while ((current = loader.getNextInstance(structure)) != null)

nb.updateClassifier(current);

16.6.2 评估分类器

建立一个分类器只是其中的一部分，评估其表现如何是另一个重要部分。WEKA支持两种类型的评价：

• 交叉验证 -如果只有一个单一的数据集，并希望得到一个合理的实事求是的评价。设置折的数量为数据集中的行的数量会得到一个留一法交叉验证（LOOCV）。

• 专用测试集 -测试集完全是用于评估建好的分类器。有一个采用相同（或类似）概念的测试集作为训练集，是很重要的，否则将永远是表现不佳。

评价步骤，包括收集统计资料，由Evaluation类做（包weka.classifiers）。

交叉验证

Evaluation类的crossValidateModel方法用于执行交叉验证，使用未经训练的分类器和一个数据集。提供未经训练的分类器，确保没有信息泄漏到实际的评估中。虽然，buildClassifier重置了分类器，这是一个实现的要求，它不能保证实际情况就是这样（“漏”(leaky)实现）。使用未经训练的分类，避免了不必要的副作用，因为每对训练/测试组合，我们使用最初提供的分类器的副本。

进行交叉验证之前，数据就被随机附带的随机数发生器(java.util.Random) 随机化。建议此发生器使用指定的“种子”。否则，在同一数据集上后续的运行的交叉验证不会产生相同的结果，原因是不同的数据随机化（参阅Section 16.4获取更多信息随机化）。

下面的代码片段对一个J48决策树算法进行10折交叉验证，在数据集 newData 上，用随机数生成器，其种子是“1”。收集到的统计数据汇总输出到标准输出。

import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

import weka.core.Instances;

import java.util.Random;

...

Instances newData = ... // from somewhere

Evaluation eval = new Evaluation(newData);

J48 tree = new J48();

eval.crossValidateModel(tree, newData, 10, new Random(1));

System.out.println(eval.toSummaryString("\nResults\n\n", false));

这个例子中的Evaluation对象用一个数据集初始化，这个数据集在评估过程中使用。这样做是为了告知评估方法正在评估的数据类型是什么，确保所有的内部数据结构正确设置。

训练/测试集

使用专用的测试集评估一个分类器与交叉验证一样简单。但是，现在提供不是一个未经训练的分类器，而是一个受过训练的分类器。再次，weka.classifiers.Evaluation类是用来执行评估的，这一次使用 evaluateModel 方法。

下面的代码片段训练J48，在数据集上使用默认选项，并对它在测试集上进行评估，然后输出收集到的统计数据汇总。

import weka.core.Instances;

import weka.classifiers.Evaluation;

import weka.classifiers.trees.J48;

...

Instances train = ... // from somewhere

Instances test = ... // from somewhere

// train classifier

Classifier cls = new J48();

cls.buildClassifier(train);

// evaluate classifier and print some statistics

Evaluation eval = new Evaluation(train);

eval.evaluateModel(cls, test);

System.out.println(eval.toSummaryString("\nResults\n\n", false));

统计

在前面的章节中，我们在代码用Evaluation类的toSummaryString方法但还有其他对标称类属性进行的方法：

• toMatrixString – 输出混淆矩阵.

• toClassDetailsString– 输出 TP/FP 率，精确率, 召回率, F-measure, AUC (per class).

• toCumulativeMarginDistributionString– 输出积累频率分布cumulative margins distribution。

如果不希望使用这些汇总的方法，可以直接访问个人统计度量方法。下面列出一些常见的措施：

• 标称类属性

- correct() -正确分类的实例的数量。不正确的分类可通过 incorrect()。

- pctCorrect() -正确分类的实例（精度）的百分比。pctIncorrect()返回的错误分类的百分比。

- areaUnderROC(int) -指定类标记索引（基于0的索引）的曲线下方区域(AUC)。

• 数字类属性

- corelationCoefficient() -的相关系数。

• 一般

- meanAbsoluteError() -平均绝对误差。

- rootMeanSquaredError() -均方根误差。

- numInstances() -一拥有类值的实例数量

- unclassified() {3}-未分类的实例的数量。{/3}

– pctUnclassified() - 未分类的实例的百分比.

关于完整概述，参阅Evaluation 类的Javadoc页面。通过查找上述的汇总方法的源代码，可以很容易地确定哪些方法被用于特定的输出。

16.6.3 分类实例

创建的分类器评估且证明有效后，构造的分类器可以用来作预测与标签无标签数据。第16.5.2节已经提供的如何使用一个分类器的 classifyInstance 方法的简要说明。此节在这里，阐述多一点。

下面的示例使用一个训练好的分类树，把从磁盘加载的所有未标记的数据集的实例作标记。在所有的实例都被贴上了标签后，产生的新的数据集写入到磁盘一个新的文件中。

// 加载未标记的数据且设置类属性

Instances unlabeled = DataSource.read("/some/where/unlabeled.arff");

unlabeled.setClassIndex(unlabeled.numAttributes() - 1);

// create copy

Instances labeled = new Instances(unlabeled);

// label instances

for (int i = 0; i < unlabeled.numInstances(); i++) {

double clsLabel = tree.classifyInstance(unlabeled.instance(i));

labeled.instance(i).setClassValue(clsLabel);

}

// save newly labeled data

DataSink.write("/some/where/labeled.arff", labeled);

当然，上面的例子对分类和回归问题同样有效，只要分类器可以处理数值型的类。这是为什么？对于数值类型，classifyInstance(Instance) 方法返回归值，对于标称类型，返回可用类标签列表中基于0的索引列表中的可用的类标签。

如果对类的分布感兴趣，可以使用 distributionForInstanc(Instance) 方法（该数组他为1）。当然，使用这种方法只对分类问题才有意义。下面的代码片段输出类的分布，实际和预测的标签在控制台中的并排输出：

// load data

Instances train = DataSource.read(args[0]);

train.setClassIndex(train.numAttributes() - 1);

Instances test = DataSource.read(args[1]);

test.setClassIndex(test.numAttributes() - 1);

// train classifier

J48 cls = new J48();

cls.buildClassifier(train);

// output predictions

System.out.println("# - actual - predicted - distribution");

for (int i = 0; i < test.numInstances(); i++) {

double pred = cls.classifyInstance(test.instance(i));

double[] dist = cls.distributionForInstance(test.instance(i));

System.out.print((i+1) + " - ");

System.out.print(test.instance(i).toString(test.classIndex()) + " - ");

System.out.print(test.classAttribute().value((int) pred) + " - ");

System.out.println(Utils.arrayToString(dist));

}

转载于:https://my.oschina.net/leopardsaga/blog/92740

weixin_34054931

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Weka manual 3.6翻译: 16.6 分类

2019独角兽企业重金招聘Python工程师标准>>> ...
复制链接

扫一扫

Weka manual 3.6翻译: 16.6 分类

16.6 分类

16.6.1 建立分类器

import weka.core.converters.ArffLoader;

import weka.classifiers.bayes.NaiveBayesUpdateable;

import java.io.File;

...

/ /加载数据

ArffLoader loader = new ArffLoader();

loader.setFile(new File("/some/where/data.arff"));

Instances structure = loader.getStructure();

structure.setClassIndex（structure.numAttributes（） - 1）;

// train NaiveBayes

NaiveBayesUpdateable nb = new NaiveBayesUpdateable();

nb.buildClassifier(structure);

Instance current;

while ((current = loader.getNextInstance(structure)) != null)

nb.updateClassifier(current);

16.6.2 评估分类器

16.6.3 分类实例

“相关推荐”对你有帮助么？