Use Weka in your Java code

http://weka.wikispaces.com/Use+Weka+in+your+Java+code

 

The most common components you might want to use, are

  • Instances - your data
  • Filter - for pre-processing the data
  • Classifier/Clusterer - is built on the processed data
  • Evaluating - how good is the classifier/clusterer?
  • Attribute selection - removing irrelevant attributes from your data

The following sections explain how to use them in your own code. A link to an  example classcan be found at the end of this page, under the  Links section. The classifiers and filters list their options always in the Javadoc API ( bookstabledeveloper version) specification.

You might also want to check out the  Weka Examples collection, containing examples for the different versions of Weka. Another, more comprehensive, source of information is the chapter Using the API of the Weka manual for the stable-3.6 and developer version ( snapshots and releases later than 09/08/2009).

 Instances

 ARFF File

 Pre 3.5.5 and 3.4.x

Reading from an  ARFF file is straight forward:
 import weka.core.Instances;
 import java.io.BufferedReader;
 import java.io.FileReader;
 ...
 BufferedReader reader = new BufferedReader(
                              new FileReader("/some/where/data.arff"));
 Instances data = new Instances(reader);
 reader.close();
 // setting class attribute
 data.setClassIndex(data.numAttributes() - 1);

The class index indicate the target attribute used for classification. By default, in an ARFF File, it's the last attribute, that's why it's set to numAttributes-1.
You  must set it if your instances are used as a parameter of a weka function (ex: weka.classifiers.Classifier.buildClassifier(data))

 3.5.5 and newer

The  DataSource class is not limited to ARFF files, it can also read CSV files and other formats (basically all file formats that Weka can import via its converters).
 import weka.core.converters.ConverterUtils.DataSource;
 ...
 DataSource source = new DataSource("/some/where/data.arff");
 Instances data = source.getDataSet();
 // setting class attribute if the data format does not provide this information
 // E.g., the XRFF format saves the class attribute information as well
 if (data.classIndex() == -1)
   data.setClassIndex(data.numAttributes() - 1);

 Database

Reading from  Databases is little bit more complicated, but still very easy. First, you'll have to modify your  DatabaseUtils.props file to resemble your Database connection. Suppose, you want to connect to a  MySQL server that is running on the local machine on port 3306 (the default one). The MySQL JDBC driver is called  Connector/J (the driver class is  org.gjt.mm.mysql.Driver). The database your target data resides in, is called  some_database. Since you're only reading, you can use the default user  nobodywithout a password. The following lines have to be in your props file:
 jdbcDriver=org.gjt.mm.mysql.Driver
 jdbcURL=jdbc:mysql://localhost:3306/some_database
Secondly, your Java code needs to look something like this, to load the data from the database:
 import weka.core.Instances;
 import weka.experiment.InstanceQuery;
 ...
 InstanceQuery query = new InstanceQuery();
 query.setUsername("nobody");
 query.setPassword("");
 query.setQuery("select * from whatsoever");
 // if your data is sparse, then you can say so too
 // query.setSparseData(true);
 Instances data = query.retrieveInstances();

Notes: 
  • Don't forget to add the JDBC driver to your CLASSPATH.
  • For MS Access you'll have to use the JDBC-ODBC-bridge that is part of a JDK. The Windows databases article explains how to do this.

 Option handling

Weka schemes that implement the  weka.core.OptionHandler interface, like classifiers, clusterers, filters, etc. offer the following methods for setting and retrieving options:
  • void setOptions(String[] options)
  • String[] getOptions()
There are several ways of setting the options:
  • Manually creating a String array:
 String[] options = new String[2];
 options[0] = "-R";
 options[1] = "1";
  • Using a single commandline string and using the splitOptions method of the weka.core.Utils class to turn it into an array:
 String[] options = weka.core.Utils.splitOptions("-R 1");
  • Using the  OptionsToCode.java class to automatically turn a commandline into code. Especially handy, if the commandline contains nested classes that have their own options, like kernels for SMO:
 java OptionsToCode weka.classifiers.functions.SMO
  • will generate output like this:
 // create new instance of scheme
 weka.classifiers.functions.SMO scheme = new weka.classifiers.functions.SMO();
 // set options
 scheme.setOptions(weka.core.Utils.splitOptions("-C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K /"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0/""));
Also, the    OptionTree.java tool allows you to view a nested options string, e.g., used at the commandline, as a tree. Can help spotting nesting errors.

 Filter

A filter has two different properties:
  • supervised or unsupervised
    either takes the class attribute into account or not
  • attribute- or instance-based
    e.g., removing a certain attribute or removing instances that meet a certain condition

Most filters implement the  OptionHandler interface, which means you can set the options via a String array, rather than setting them each manually via  set-methods.
E.g., if you want to remove the  first attribute of a dataset, you need this filter
 weka.filters.unsupervised.attribute.Remove
with this option
 -R 1
If you have an  Instances object, called  data, you can create and apply the filter like this:
 import weka.core.Instances;
 import weka.filters.Filter;
 import weka.filters.unsupervised.attribute.Remove;
 ...
 String[] options = new String[2];
 options[0] = "-R";                                    // "range"
 options[1] = "1";                                     // first attribute
 Remove remove = new Remove();                         // new instance of filter
 remove.setOptions(options);                           // set options
 remove.setInputFormat(data);                          // inform filter about dataset **AFTER** setting options
 Instances newData = Filter.useFilter(data, remove);   // apply filter

 Filtering on-the-fly

The  FilteredClassifier meta-classifier is an easy way of filtering data on-the-fly. It removes the necessity of filtering the data first, before the classifier can be trained. And also, at prediction time, one doesn't have to worry about passing the data through the trained filter again. In the following an example of using this meta-classifier with the  Remove filter and  J48, for getting rid of a numeric ID attribute in the data:
 import weka.classifiers.meta.FilteredClassifier;
 import weka.classifiers.trees.J48;
 import weka.filters.unsupervised.attribute.Remove;
 ...
 Instances train = ...         // from somewhere
 Instances test = ...          // from somewhere
 // filter
 Remove rm = new Remove();
 rm.setAttributeIndices("1");  // remove 1st attribute
 // classifier
 J48 j48 = new J48();
 j48.setUnpruned(true);        // using an unpruned J48
 // meta-classifier
 FilteredClassifier fc = new FilteredClassifier();
 fc.setFilter(rm);
 fc.setClassifier(j48);
 // train and make predictions
 fc.buildClassifier(train);
 for (int i = 0; i < test.numInstances(); i++) {
   double pred = fc.classifyInstance(test.instance(i));
   System.out.print("ID: " + test.instance(i).value(0));
   System.out.print(", actual: " + test.classAttribute().value((int) test.instance(i).classValue()));
   System.out.println(", predicted: " + test.classAttribute().value((int) pred));
 }

Other handy meta-schemes in Weka:

 Batch filtering

On the commandline, you can enable a second input/output pair (via  -r and  -s) with the  -b option, in order to process the second file with the same filter setup as the first one. Necessary, if you're using attribute selection or standardization - otherwise you end up with incompatible datasets. This is done fairly easy, since one initializes the filter only once with the  setInputFormat(Instances)method, namely with the training set, and then applies the filter subsequently to the training set  and the test set. The following example shows how to apply the  Standardize filter to a train and a test set.
 Instances train = ...   // from somewhere
 Instances test = ...    // from somewhere
 Standardize filter = new Standardize();
 filter.setInputFormat(train);  // initializing the filter once with training set
 Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
 Instances newTest = Filter.useFilter(test, filter);    // create new test set

 Calling conventions

The  setInputFormat(Instances) method  always has to be the last call before the filter is applied, e.g., with Filter.useFilter(Instances,Filter)Why? First, it is the convention for using filters and, secondly, lots of filters generate the header of the output format in the  setInputFormat(Instances) method with the currently set options (setting otpions  after this call doesn't have any effect any more).

 Classification

The necessary classes can be found in this package:
 weka.classifiers

 Building a Classifier

 Batch

A Weka classifier is rather simple to train on a given dataset. E.g., we can train an unpruned C4.5 tree algorithm on a given dataset data. The training is done via the  buildClassifier(Instances) method.
 import weka.classifiers.trees.J48;
 ...
 String[] options = new String[1];
 options[0] = "-U";            // unpruned tree
 J48 tree = new J48();         // new instance of tree
 tree.setOptions(options);     // set the options
 tree.buildClassifier(data);   // build classifier

 Incremental

Classifiers implementing the  weka.classifiers.UpdateableClassifier interface can be trained incrementally. This conserves memory, since the data doesn't have to be loaded into memory all at once. See the Javadoc of this interface to see what classifiers are implementing it. 

The actual process of training an incremental classifier is fairly simple:
  • Call buildClassifier(Instances) with the structure of the dataset (may or may not contain any actual data rows).
  • Subsequently call the updateClassifier(Instance) method to feed the classifier new weka.core.Instance objects, one by one.

Here is an example using data from a  weka.core.converters.ArffLoader to train weka.classifiers.bayes.NaiveBayesUpdateable:
 // load data
 ArffLoader loader = new ArffLoader();
 loader.setFile(new File("/some/where/data.arff"));
 Instances structure = loader.getStructure();
 structure.setClassIndex(structure.numAttributes() - 1);
 
 // train NaiveBayes
 NaiveBayesUpdateable nb = new NaiveBayesUpdateable();
 nb.buildClassifier(structure);
 Instance current;
 while ((current = loader.getNextInstance(structure)) != null)
   nb.updateClassifier(current);

A working example is    IncrementalClassifier.java.

 Evaluating

 Cross-validation

If you only have a training set and no test you might want to evaluate the classifier by using 10 times 10-fold cross-validation. This can be very easily done via the  Evaluation class. Here we  seed the random selection of our folds for the CV with  1. Check out the Evaluation class for more information about the statistics it produces.
 import weka.classifiers.Evaluation;
 import java.util.Random;
 ...
 Evaluation eval = new Evaluation(newData);
 eval.crossValidateModel(tree, newData, 10, new Random(1));

Note: The classifier (in our example  tree) is not supposed to be trained when handed over to the  crossValidateModel method. Why? If the classifier does not abide to the Weka conventions, that a classifier has to be re-initialized every time the buildClassifier method is called (in other words: subsequent calls to the  buildClassifier method always return the same results), you will get inconsistent and worthless results. The  crossValidateModel takes care of training and evaluating the classifier (it creates a copy of the original classifier that you hand over to the  crossValidateModel for each run of the cross-validation).

 Train/test set

In case you have a dedicated test set, you can train the classifier and then evaluate it on this test set. In the following example a J48 is instantiated, trained and then evaluated. Some statistics are printed to  stdout:
 import weka.core.Instances;
 import weka.classifiers.Evaluation;
 import weka.classifiers.trees.J48;
 ...
 Instances train = ...   // from somewhere
 Instances test = ...    // from somewhere
 // train classifier
 Classifier cls = new J48();
 cls.buildClassifier(train);
 // evaluate classifier and print some statistics
 Evaluation eval = new Evaluation(train);
 eval.evaluateModel(cls, test);
 System.out.println(eval.toSummaryString("/nResults/n======/n", false));

 Statistics

Some methods for retrieving the results from the evaluation:
  • nominal class
    • correct() - number of correctly classified instances (see also incorrect())
    • pctCorrect() - percentage of correctly classified instances (see also pctIncorrect())
    • kappa() - Kappa statistics
  • numeric class
    • correlationCoefficient() - correlation coefficient
  • general
    • meanAbsoluteError() - the mean absolute error
    • rootMeanSquaredError() - the root mean squared error
    • unclassified() - number of unclassified instances
    • pctUnclassified() - percentage of unclassified instances

If you want to have the exact same behavior as from command line, use this call:
 import weka.classifiers.trees.J48;
 import weka.classifiers.Evaluation;
 ...
 String[] options = new String[2];
 options[0] = "-t";
 options[1] = "/some/where/somefile.arff";
 System.out.println(Evaluation.evaluateModel(new J48(), options));

 ROC curves/AUC

Since Weka 3.5.1, you can generate ROC curves/AUC as well with the predictions Weka recorded during testing. You can access these predictions via the  predictions() method of the  Evaluation class. See the  Generating ROC curve article for a full example of how to generate ROC curves.

 Classifying instances

In case you have an unlabeled dataset that you want to classify with your newly trained classifier, you can use the following code snippet. It loads the file  /some/where/unlabeled.arff, uses the previously built classifier  tree to label the instances, and saves the labeled data as  /some/where/labeled.arff.
 import java.io.BufferedReader;
 import java.io.BufferedWriter;
 import java.io.FileReader;
 import java.io.FileWriter;
 import weka.core.Instances;
 ...
 // load unlabeled data
 Instances unlabeled = new Instances(
                         new BufferedReader(
                           new FileReader("/some/where/unlabeled.arff")));
 
 // set class attribute
 unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
 
 // create copy
 Instances labeled = new Instances(unlabeled);
 
 // label instances
 for (int i = 0; i < unlabeled.numInstances(); i++) {
   double clsLabel = tree.classifyInstance(unlabeled.instance(i));
   labeled.instance(i).setClassValue(clsLabel);
 }
 // save labeled data
 BufferedWriter writer = new BufferedWriter(
                           new FileWriter("/some/where/labeled.arff"));
 writer.write(labeled.toString());
 writer.newLine();
 writer.flush();
 writer.close();

Note on nominal classes:
  • If you're interested in the distribution over all the classes, use the method distributionForInstance(Instance). This method returns a double array with the probability for each class.
  • The returned double value from classifyInstance (or the index in the array returned by distributionForInstance) is just the index for the string values in the attribute. I.e., if you want the string representation for the above returned class labelclsLabel, then you can, e.g., print it like this:
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));

 Clustering

Clustering is similar to Classification. The necessary classes can be found in this package:
 weka.clusterers

 Building a Clusterer

 Batch

A clusterer is built pretty much the same way as a Classifier. But instead of using the  buildClassifier(Instances) method, one uses the  buildClusterer(Instances) one. The following code snippet shows how to build an  EM clusterer with a maximum number of iterations of  100.
 import weka.clusterers.EM;
 ...
 String[] options = new String[2];
 options[0] = "-I";                 // max. iterations
 options[1] = "100";
 EM clusterer = new EM();   // new instance of clusterer
 clusterer.setOptions(options);     // set the options
 clusterer.buildClusterer(data);    // build the clusterer

 Incremental

Clusterers implementing the  weka.clusterers.UpdateableClusterer interface can be trained incrementally (available since version 3.5.4). This conserves memory, since the data doesn't have to be loaded into memory all at once. See the Javadoc of this interface to see what clusterers are implementing it. 

The actual process of training an incremental clusterer is fairly simple:
  • Call buildClusterer(Instances) with the structure of the dataset (may or may not contain any actual data rows).
  • Subsequently call the updateClusterer(Instance) method to feed the clusterer new weka.core.Instance objects, one by one.
  • Call updateFinished() after all Instance objects have been processed, for the clusterer to perform additional computations.

Here is an example using data from a  weka.core.converters.ArffLoader to train  weka.clusterers.Cobweb:
 // load data
 ArffLoader loader = new ArffLoader();
 loader.setFile(new File("/some/where/data.arff"));
 Instances structure = loader.getStructure();
 
 // train Cobweb
 Cobweb cw = new Cobweb();
 cw.buildClusterer(structure);
 Instance current;
 while ((current = loader.getNextInstance(structure)) != null)
   cw.updateClusterer(current);
 cw.updateFinished();

A working example is    IncrementalClusterer.java.

 Evaluating

For evaluating a clusterer you can use the  ClusterEvaluation class, e.g. outputting the number of clusters found:
 import weka.clusterers.ClusterEvaluation;
 import weka.clusterers.Clusterer;
 ...
 ClusterEvaluation eval = new ClusterEvaluation();
 Clusterer clusterer = new EM();                                 // new clusterer instance, default options
 clusterer.buildClusterer(data);                                 // build clusterer
 eval.setClusterer(clusterer);                                   // the cluster to evaluate
 eval.evaluateClusterer(newData);                                // data to evaluate the clusterer on
 System.out.println("# of clusters: " + eval.getNumClusters());  // output # of clusters

Or, in case of  density based clusters, you can cross-validate the clusterer (Note: with  MakeDensityBasedClusterer you can turn any clusterer into a density based one):
 import weka.clusterers.ClusterEvaluation;
 import weka.clusterers.DensityBasedClusterer;
 import weka.core.Instances;
 import java.util.Random;
 ...
 Instances data = ...                                     // from somewhere
 DensityBasedClusterer clusterer = new ...                // the clusterer to evaluate
 double logLikelyhood = 
    ClusterEvaluation.crossValidateModel(                 // cross-validate
    clusterer, data, 10,                                  // with 10 folds 
    new Random(1));                                       // and random number generator with seed 1

Or, if you want the same behavior/print-out from command line, use this call:
 import weka.clusterers.EM;
 import weka.clusterers.ClusterEvaluation;
 ...
 String[] options = new String[2];
 options[0] = "-t";
 options[1] = "/some/where/somefile.arff";
 System.out.println(ClusterEvaluation.evaluateClusterer(new EM(), options));

 Clustering instances

The only difference to classification is the method name. Instead of  classifyInstance(Instance) it is now clusterInstance(Instance). The method for obtaining the distribution is still the same, i.e., distributionForInstance(Instance).

 Classes to clusters evaluation

If your data contains a class attribute and you want to check how well the generated clusters fit the classes, you can perform a so-called  classes to clusters evaluation. The Weka Explorer offers this functionality and it's quite easy to implement. These are the necessary steps (complete source code:    ClassesToClusters.java):
  • load the data and set the class attribute
 Instances data = new Instances(new BufferedReader(new FileReader("/some/where/file.arff")));
 data.setClassIndex(data.numAttributes() - 1);
  • generate the class-less data to train the clusterer with
 weka.filters.unsupervised.attribute.Remove filter = new weka.filters.unsupervised.attribute.Remove();
 filter.setAttributeIndices("" + (data.classIndex() + 1));
 filter.setInputFormat(data);
 Instances dataClusterer = Filter.useFilter(data, filter);
  • train the clusterer, e.g., EM
 EM clusterer = new EM();
 // set further options for EM, if necessary...
 clusterer.buildClusterer(dataClusterer);
  • evaluate the clusterer with the data containing still the class attribute
 ClusterEvaluation eval = new ClusterEvaluation();
 eval.setClusterer(clusterer);
 eval.evaluateClusterer(data);
  • print the results of the evaluation to stdout
 System.out.println(eval.clusterResultsToString());

 Attribute selection

There's no real need to use the attribute selection classes directly in your own code, since there's already a meta-classifier and a filter available for applying attribute selection, but the low-level approach is still listed for the sake of completeness. The following examples all use  CfsSubsetEval and  GreedyStepwise (backwards). The code listed below is taken from the    AttributeSelectionTest.java.

 Meta-Classifier

The following meta-classifier performs a pre-processing step of attribute selection before the data gets presented to the base classifier (in the example here, this is  J48).
  Instances data = ...  // from somewhere
  AttributeSelectedClassifier classifier = new AttributeSelectedClassifier();
  CfsSubsetEval eval = new CfsSubsetEval();
  GreedyStepwise search = new GreedyStepwise();
  search.setSearchBackwards(true);
  J48 base = new J48();
  classifier.setClassifier(base);
  classifier.setEvaluator(eval);
  classifier.setSearch(search);
  // 10-fold cross-validation
  Evaluation evaluation = new Evaluation(data);
  evaluation.crossValidateModel(classifier, data, 10, new Random(1));
  System.out.println(evaluation.toSummaryString());

 Filter

The filter approach is straightforward: after setting up the filter, one just filters the data through the filter and obtains the reduced dataset.
  Instances data = ...  // from somewhere
  AttributeSelection filter = new AttributeSelection();  // package weka.filters.supervised.attribute!
  CfsSubsetEval eval = new CfsSubsetEval();
  GreedyStepwise search = new GreedyStepwise();
  search.setSearchBackwards(true);
  filter.setEvaluator(eval);
  filter.setSearch(search);
  filter.setInputFormat(data);
  // generate new data
  Instances newData = Filter.useFilter(data, filter);
  System.out.println(newData);

 Low-level

If neither the meta-classifier nor filter approach are suitable for your purposes, you can use the attribute selection classes themselves.
  Instances data = ...  // from somewhere
  AttributeSelection attsel = new AttributeSelection();  // package weka.attributeSelection!
  CfsSubsetEval eval = new CfsSubsetEval();
  GreedyStepwise search = new GreedyStepwise();
  search.setSearchBackwards(true);
  attsel.setEvaluator(eval);
  attsel.setSearch(search);
  attsel.SelectAttributes(data);
  // obtain the attribute indices that were selected
  int[] indices = attsel.selectedAttributes();
  System.out.println(Utils.arrayToString(indices));

 Note on randomization

Most machine learning schemes, like classifiers and clusterers, are susceptible to the ordering of the data. Using a different seed for randomizing the data will most likely produce a different result. E.g., the Explorer or if a classifier/clusterer is run from commandline uses only a seeded  java.util.Random number generator, wherease the weka.core.Instances.getgetRandomNumberGenerator(int) (which the    WekaDemo.java uses) also takes the data into account for seeding. Unless one runs 10-fold cross-validation 10 times and averages the results, one will most likely get different results.

 See also


 Examples

In the following a few example classes for using various parts of the Weka API:


 Links

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值