WEKA编写新学习方案

最新推荐文章于 2019-05-05 10:02:29 发布

comlc

最新推荐文章于 2019-05-05 10:02:29 发布

阅读量4k

点赞数

文章标签： exception class tree classification null 存储

本文链接：https://blog.csdn.net/comlc/article/details/1933775

版权

1. 编写新学习方案

如果用户需要实现一个Weka所没有的特殊目的的学习算法,或者用户正在进行机器学习的研究,并且想试验一个新的学习方案,或者用户只是想通过亲自动手编程,了解更多有关一个归纳算法的内部运作,本节用一个简单的范例演示在编写分类器时,如何充分利用Weka的类的层级结构,从而满足用户的需要.

Weka包含了表15-1中所列的基本的、主要用于教育目的的学习方案.表中的方案对于接受命令行选项没有特别要求.它们对于理解分类器的内部运作都很有用.我们会将weka.classifiers.trees.Id3作为一个例子讨论,该方案实现了第4.3节中的ID3决策树学习器.
表15-1 Weka中的简单学习方案

方案	描述
weka.classifiers.bayes.NaiveBayesSimple	概率学习器
weka.classifiers.trees.Id3	决策树学习器
weka.classifiers.rules.Prism	规则学习器
weka.classifiers.lazy.IB1	基于实例的学习器

2 ．一个分类器范例
图15-1给出了weka.classifiers.trees.Id3的源代码,用户从代码中可看出它扩展Classifier类.无论是用于预测名词性类还是预测数值性类,每个Weka中的分类器都必须扩展Classifier类.
weka.classifiers.trees.Id3方案中的第一个方法是globalInfo():我们在进人到更有趣的部分之前先谈谈这个方法.当这个方案在Weka的图形用户界面上被选中时,该方法只是简单地返回一个显示在屏幕上的字符串.

package weka.classifiers.trees;
import weka.classifiers.*;
import weka.core.*;
import java.io.*;
import java.util.*;
/**
* Class implementing an Id3 decision tree classifier.
*/
public class Id3 extends Classifier {
/** The node's successors. */
private Id3[] m_Successors;
/** Attribute used for splitting. */
private Attribute m_Attribute;
/** Class value if node is leaf. */
private double m_ClassValue;
/** Class distribution if node is leaf. */
private double[] m_Distribution;
/** Class attribute of dataset. */
private Attribute m_ClassAttribute;
/**
* Returns a string describing the classifier.
* @return a description suitable for the GUI.
*/
public String globalInfo() {
return "Class for constructing an unpruned decision tree based on the ID3 "
+ "algorithm. Can only deal with nominal attributes. No missing values "
+ "allowed. Empty leaves may result in unclassified instances. For more "
+ "information see: /n/n"
+ " R. Quinlan (1986). /"Induction of decision "
+ "trees/". Machine Learning. Vol.1, No.1, pp. 81-106";
}
/**
* Builds Id3 decision tree classifier.
*
* @param data the training data
* @exception Exception if classifier can't be built successfully
*/
public void buildClassifier(Instances data) throws Exception {
if (!data.classAttribute().isNominal()) {
throw new UnsupportedClassTypeException("Id3: nominal class, please.");
}
Enumeration enumAtt = data.enumerateAttributes();
while (enumAtt.hasMoreElements()) {
if (!((Attribute) enumAtt.nextElement()).isNominal()) {
throw new UnsupportedAttributeTypeException("Id3: only nominal " +
"attributes, please.");
}
}
Enumeration enum = data.enumerateInstances();
while (enum.hasMoreElements()) {
if (((Instance) enum.nextElement()).hasMissingValue()) {
throw new NoSupportForMissingValuesException("Id3: no missing values, "
+ "please.");
}
}
data = new Instances(data);
data.deleteWithMissingClass();
makeTree(data);
}
/**
* Method for building an Id3 tree.
*
* @param data the training data
* @exception Exception if decision tree can't be built successfully
*/
private void makeTree(Instances data) throws Exception {
// Check if no instances have reached this node.
if (data.numInstances() == 0) {
m_Attribute = null;
m_ClassValue = Instance.missingValue();
m_Distribution = new double[data.numClasses()];
return;
}
// Compute attribute with maximum information gain.
double[] infoGains = new double[data.numAttributes()];
Enumeration attEnum = data.enumerateAttributes();
while (attEnum.hasMoreElements()) {
Attribute att = (Attribute) attEnum.nextElement();
infoGains[att.index()] = computeInfoGain(data, att);
}
m_Attribute = data.attribute(Utils.maxIndex(infoGains));
// Make leaf if information gain is zero.
// Otherwise create successors.
if (Utils.eq(infoGains[m_Attribute.index()], 0)) {
m_Attribute = null;
m_Distribution = new double[data.numClasses()];
Enumeration instEnum = data.enumerateInstances();
while (instEnum.hasMoreElements()) {
Instance inst = (Instance) instEnum.nextElement();
m_Distribution[(int) inst.classValue()]++;
}
Utils.normalize(m_Distribution);
m_ClassValue = Utils.maxIndex(m_Distribution);
m_ClassAttribute = data.classAttribute();
} else {
Instances[] splitData = splitData(data, m_Attribute);
m_Successors = new Id3[m_Attribute.numValues()];
for (int j = 0; j < m_Attribute.numValues(); j++) {
m_Successors[j] = new Id3();
m_Successors[j].makeTree(splitData[j]);
}
}
}
/**
* Classifies a given test instance using the decision tree.
*
* @param instance the instance to be classified
* @return the classification
*/
public double classifyInstance(Instance instance)
throws NoSupportForMissingValuesException {
if (instance.hasMissingValue()) {
throw new NoSupportForMissingValuesException("Id3: no missing values, "
+ "please.");
}
if (m_Attribute == null) {
return m_ClassValue;
} else {
return m_Successors[(int) instance.value(m_Attribute)].
classifyInstance(instance);
}
}
/**
* Computes class distribution for instance using decision tree.
*
* @param instance the instance for which distribution is to be computed
* @return the class distribution for the given instance
*/
public double[] distributionForInstance(Instance instance)
throws NoSupportForMissingValuesException {
if (instance.hasMissingValue()) {
throw new NoSupportForMissingValuesException("Id3: no missing values, "
+ "please.");
}
if (m_Attribute == null) {
return m_Distribution;
} else {
return m_Successors[(int) instance.value(m_Attribute)].
distributionForInstance(instance);
}
}
/**
* Prints the decision tree using the private toString method from below.
*
* @return a textual description of the classifier
*/
public String toString() {
if ((m_Distribution == null) && (m_Successors == null)) {
return "Id3: No model built yet.";
}
return "Id3/n/n" + toString(0);
}
/**
* Computes information gain for an attribute.
*
* @param data the data for which info gain is to be computed
* @param att the attribute
* @return the information gain for the given attribute and data
*/
private double computeInfoGain(Instances data, Attribute att)
throws Exception {
double infoGain = computeEntropy(data);
Instances[] splitData = splitData(data, att);
for (int j = 0; j < att.numValues(); j++) {
if (splitData[j].numInstances() > 0) {
infoGain -= ((double) splitData[j].numInstances() /
(double) data.numInstances()) *
computeEntropy(splitData[j]);
}
}
return infoGain;
}
/**
* Computes the entropy of a dataset.
*
* @param data the data for which entropy is to be computed
* @return the entropy of the data's class distribution
*/
private double computeEntropy(Instances data) throws Exception {
double [] classCounts = new double[data.numClasses()];
Enumeration instEnum = data.enumerateInstances();
while (instEnum.hasMoreElements()) {
Instance inst = (Instance) instEnum.nextElement();
classCounts[(int) inst.classValue()]++;
}
double entropy = 0;
for (int j = 0; j < data.numClasses(); j++) {
if (classCounts[j] > 0) {
entropy -= classCounts[j] * Utils.log2(classCounts[j]);
}
}
entropy /= (double) data.numInstances();
return entropy + Utils.log2(data.numInstances());
}
/**
* Splits a dataset according to the values of a nominal attribute.
*
* @param data the data which is to be split
* @param att the attribute to be used for splitting
* @return the sets of instances produced by the split
*/
private Instances[] splitData(Instances data, Attribute att) {
Instances[] splitData = new Instances[att.numValues()];
for (int j = 0; j < att.numValues(); j++) {
splitData[j] = new Instances(data, data.numInstances());
}
Enumeration instEnum = data.enumerateInstances();
while (instEnum.hasMoreElements()) {
Instance inst = (Instance) instEnum.nextElement();
splitData[(int) inst.value(att)].add(inst);
}
for (int i = 0; i < splitData.length; i++) {
splitData[i].compactify();
}
return splitData;
}
/**
* Outputs a tree at a certain level.
*
* @param level the level at which the tree is to be printed
*/
private String toString(int level) {
StringBuffer text = new StringBuffer();
if (m_Attribute == null) {
if (Instance.isMissingValue(m_ClassValue)) {
text.append(": null");
} else {
text.append(": " + m_ClassAttribute.value((int) m_ClassValue));
}
} else {
for (int j = 0; j < m_Attribute.numValues(); j++) {
text.append("/n");
for (int i = 0; i < level; i++) {
text.append("| ");
}
text.append(m_Attribute.name() + " = " + m_Attribute.value(j));
text.append(m_Successors[j].toString(level + 1));
}
}
return text.toString();
}
/**
* Main method.
*
* @param args the options for the classifier
*/
public static void main(String[] args) {
try {
System.out.println(Evaluation.evaluateModel(new Id3(), args));
} catch (Exception e) {
System.err.println(e.getMessage());
}
}
}
图15-1 ID3决策树学习器的源代码

3．buildClassifier()

buildClassifier()方法根据训练数据集构建一个分类器.因为ID3算法无法处理非名词性类,残缺属性值,或任何非名词性的属性,因此,buildClassifier()方法首先在数据中对以上提到的进行查验.然后,它会生成一个训练集的复制件(以避免改变原始数据),并调用weka.core.Instances中的一个方法来删除所有含残缺类值的实例,因为这些实例在训练过程中不起作用.最后,它会调用makeTree(),该方法实际上通过递归的方式产生所有附加到根节点上的子树,从而生成一个决策树.

4．makeTree()

在makeTree()中,第一步是检查数据集是否为空.如果是,通过将m_Attribute设为空生成一个叶节点.为该叶指定的类值m_ClassValue设定为残缺,且m_Distribution中为数据集中的每个类所估计的概率皆初始化为0.如果训练实例已准备好,makeTree()会找出令这些实例产生最大信息增益的属性.它首先生成一个数据集属性的Java枚举.如果类属性的索引已经设定,像正在讨论的这个数据集设定一样,该类属性会被自动排除在该枚举之外.

在枚举内部,每个属性的信J氢增益都由computelnfoGain()计算出来并存储在一个数组中.我们以后会重新讲这个方法.weka.core.Attribute中的index()方法可返回数据集中属性的索引.它可为刚刚提到的数组编制索引.一旦完成了枚举,具有最大信息增益的属性就会存储在实例变量m Attribute中.weka.core.Utils中的maxlndex()方法会返回一个由整数或双精度浮点小数构成的数组中最大值的索引.(如果具有最大值的组元不止一个,那么只有第一个被返回.)该属性的索引会被传给weka.core.Instances中的attribute()方法,该方法返回与索引相对应的属性.

用户也许在想,数组中与类属性相对应的那个值域怎么样了?这个不必担心,因为Java会自动将数组中所有组元初始化为整数0,而信息增益总是大于或等于0.如果最大信息增益是0,makeTree()会生成一个叶节点.在这种情况下,makeTree()会设为空,且makeTree()会同时计算类概率的分布以及具有最大概率的类.(weka,core.Utils中的normalize()方法会将一个双精度浮点小数数组正常化使其组员相加总和为1.)

当它产生一个已指定类值的叶节点时,makeTree()将类属性存储到m_ClassAttrfbute中.这是因为用来输出决策树的方法需要读取该类值以便显示类标签.

如果发现了一个具有非零信息增益的属性,makeTree()会根据该属性的值分割数据集,并以递归的方式为每个新产生的数据集构建子树.该方法调用另一个方法splitData()进行分割.这样就会生成与属性值一样多个空的数据集,且把这些数据集存储到一个数组中(将每个数据集的初始容量设定为原始数据集中所含实例的数量),然后在原始数据集中将每个实例依次循环一遍,并在新数据集中根据相对应的属性值为这些实例开辟空间.然后压缩Instances对象以减少占用的存储器.返回到makeTree()后,所得到的数据集数组用于构建子树.该方法会生成一个由Id3对象构成的数组,数组中的每个对象对应着一个属性值9并将相对应的数据集传给makeTree(),从而在每个对象上调用该方法.

5．computeInfoGain()

现在回到corrtputeInfoGain(),与一个属性和一个数据集相关联的信息增益是用第4.3节中介绍过的方程式的一个直接实现计算出来的.首先计算出数据集的熵,然后用splitData()将数据集分割成子集,并在每个子集上调用computeEntr0py().最后,将前面计算出来的熵与后面计算出来的每个熵的加权总和相减的差,即信息增益返回.computeEntropy()方法使用weka.core.Utils中的log2()方法得出一个数的对数(以2为基数).

6．classifyInstance()

看过了ID3如何构建决策树,我们再来看如何利用树结构来预测类值及概率.每一个分类器都必须实现classifylnstance()方法或distributionFor.Instance()方法(或两个方法都实现).Classifier超类含有这两种方法的默认实现.classifylnstance()的默认实现调用distributionForlns tance().如果类是名词性的,classifyInstance()会把具有最大概率的属性预测为类,否则,如果从distributionForInstance()返回的所有概率都是零,classifylllstarlce()会返回一个残缺值.如果类是数值性的,distributionForlnstance()必须返回有数值性预测的单一组元数组,该数组也就是classifylnstance()要提取并返回的.最后,distributionForlnstance()的默认实现反过来把从classifyInstance()中得来的预测包装成一个单一组元数组.如果类是名词性的,distributionForInstance()将概率1指定给classihzlnstance()预测出的类属性,把概率0指定给其他属性.如果classi.fylnstance()返回一个残缺值,所有属性的概率都设为0.为了让用户更好地了解这些方法所做的工作,weka.classifiers.trees,Id3类重新编写了这两个方法.

我们先来看看针对一个给定实例预测类值的classifylnstance().上一节曾经讲过,与名词性属性值一样,名词性类值是以double变量的形式编码及存储的,表示值的名字在属性声明中的索引.这种更简洁有效的面向对象的处理方式可加快运行速度.在ID3的实现中,classifyInstance()首先查看待分类的实例中是否有残缺值.如果有的话,就丢弃一个异常.否则,它就以递归的方式,根据待分类实例的属性值,沿着树自上而下,直至到达某个末端叶节点.然后,它会返回存储在该叶节点的类值m_ClassValue.要注意所返回的也有可能是残缺值,如果是残缺值,该实例则成为未被分类的实例.distributionForInstance()方法的工作方式与此完全一样,它返回存储于m_Distribution中的概率分布.

大多数机器学习模型,特别是决策树,大致上全面反映了数据本身的结构.因此每个Weka分类器,如同许多其他Java对象一样,实现toString()方法从而以字符串变量的形式生成一个它自身的文本表述.ID3的toString()方法输出一个与J4.8格式大致相同的决策树(图10-5).它通过读取存储于节点上的属性信息,以递归的方式将树的结构输入一个字符串变量.它使用weka.core.Attribute中的name()和value()方法得到每个属性的名字和值.不含类值的空末端叶节点由字符串null标示出来.

7．main()

weka.classifiers.tree.Id3中还没有被描述的唯一方法就是main(),每当由命令行执行一个类, 该方法都会被调用.正如用户看到的一样,该方法很简单:基本上就是告诉Weka的类Evaluation用所给的命令行选项评估Id3,并输出所得到的字符串.完成此项任务的单行表达式就包含在一个try-catch语句中,该语句能捕获各种各样由Weka例程或其他Java方法丢出的异常.

Weka.classifiers.Evaluation中的evaluation()方法解释了第13.3节中讨论过的,可适用于任何学习方案的通用命令行选项及相应的作用.例如,它可接受训练文件名字的-L选项,并载人相对应的数据集.如果没有测试文件,它就进行交叉验证,方式是生成一个分类器,并在训练数据的不同的子集上重复调用buildClassifier(),classify Instance()和distributionForlnstance().除非用户设定了相应的命令行选项从而阻止模型的输出,它还会调用toString()方法,输出由整个训练数据集生成的模型.

如果某个学习方案需要解释一个具体的选项,比如一个修剪参数,怎么办?这可由weka.core中的OptionHandler接口来解决.实现该接口的分类器含有三个方法:listOptions(),setOptions()和getOption().它们分别用来列出所有针对该分类器的选项,设定其中某些选项,以及取得目前已设定的选项.如果一个分类器实现了OptionHandle r接口,Evaluation类中的evaluation()方法会自动调用这些方法.处理完通用选项后,evaluation()会调用setOption()来处理余下的选项,然后利用buildClassifier()产生一个新的分类器.输出所产生的分类器,evaluation()会用getOptions()输出一列目前已设定的选项.在weka.classifiers.rules.0neR的源代码中可找到一个如何实现这些方法的简单范例.

OptionHandler使得在命令行中设定选项成为可能.要在图形用户界面中设定这些选项,Weka使用的是Java豆的架构.实施该构架所要做的全部工作就是为一个类中所用到的每个参数都提供set...()及get...()方法.比方说,方法setPruningParameter()和getPruningParameter()对于一个修剪参数来说就是必须的.还有一个方法也必不可少,pruningParameterTipText()返回的是显示在图形用户界面上的对该参数的一个描述.再强调一次,见weka.classifiers.rules.OneR中的例子.

一些分类器可在新的训练实例陆续到达时进行递增更新,并且不需要在同一批中处理全部数据.在Weka中,递增分类器须实现weka.classifiers中的UpdateableClassifier接口.该接口只声明了一个名为updateClassifier()的方法,该方法只接受一个单独的训练实例作为它的可变参数.要参考一个如何使用该接口的例子,见weka.classifiers.lazy.IBk的源代码.

如果一个分类器能运用实例的权,它必须实现weka.core中的Weightedlnstartce sHandler()接口.如此一来其他的算法,比方说那些用于提升的算法,就可对该属性加以利用.

在weka.core中还有很多其他对于分类器来说很有用的接口,例如,rondomizable,summarizable,drawable,和graphable这些用于分类器的接口.有关接口的更多信息,见weka.core中相应类的Javadoc.

8．与实现分类器有关的惯例

在实现Weka中的分类器时,有一些惯例用户必须遵守.否则,程序会出错.比方说,Weka的评估模块在评估分类器时可能会无法恰当地计算它的统计数据.
第一个惯例前面已经提到过,当一个分类器的buildClassifier()方法被调用时,必须令模型重新复位.类CheckClassifier进行测试,确保模型的确被复位了.当buildC`assifier()在某个数据集上被调用时,无论该分类器以前已经在同一个或其他的数据集上被调用过多少次,所得到的结果必须是一样的.还有,一些实例变量是与某些只适用于具体方案的选项相对应的,buildClassifier()方法绝对不可以将这些变量复位,因为这些变量的值一旦被设定,它们在多次调用buildClassifier()的过程中必须保持不变.还有,调用buildClassifier()绝对不可以改动输人数据.

另外两个惯例以前也提到过.一个是当某个分类器无法做出预测时,它的classifyInstance()方法必须返回Instance.missingValue(),且它的distributionForlnstance()方法必须针对所有类属性都返回0概率.图15-1中的ID3实现就是这么做的.另外一个惯例是这样的,对手用作数值性预测的分类器来说,它的classifyInstance()要返回分类器所预测出的数值性类值.还有一些分类器可以对名词性的类和类概率,以及数值性的类值做出预测,weka.classifiers.lazy.IBk就是一个例子.这些分类器实现了distributionForlnstance()方法,如果类是数值性的,它会返回一个单一组元数组,其唯一组元就含有所预测的数值性值.

另外一个惯例虽然并不是不可或缺,但不管怎么说都是有益的,即每个分类器都实现一个toString()方法,用于输出一个它自身的文本描述.