废话不多说,直接上干货
weka的官方参考手册 :https://www.cs.waikato.ac.nz/ml/weka/documentation.html
在API 一节中有 creating datasets in memory (官网的还是比较标准的)
大家也可以简单的看一下我写的一个简单的demo。
首先看Attribute:这个是我自己写的一个2-mer特征提取的代码大家参考形式就行;
public ArrayList<Attribute> attributes() {
ArrayList<Attribute> attributes = new ArrayList<>();
attributes.add(new Attribute("aa"));
attributes.add(new Attribute("ag"));
attributes.add(new Attribute("ac"));
attributes.add(new Attribute("at"));
attributes.add(new Attribute("ga"));
attributes.add(new Attribute("gg"));
attributes.add(new Attribute("gc"));
attributes.add(new Attribute("gt"));
attributes.add(new Attribute("ca"));
attributes.add(new Attribute("cg"));
attributes.add(new Attribute("cc"));
attributes.add(new Attribute("ct"));
attributes.add(new Attribute("ta"));
attributes.add(new Attribute("tg"));
attributes.add(new Attribute("tc"));
attributes.add(new Attribute("tt"));
ArrayList<String> labels = new ArrayList<String>();
labels.add("1");
labels.add("-1");
attributes.add(new Attribute("class", labels));
return attributes;
}
紧接着就是Instances和Instance的创建:
需要说明的是下面代码里面的a[1]是标签(对应自己的标签就行)
public Instances atgc1(Instances data) {
ArrayList<Attribute> l = attributes();
Instances instances = new Instances("DNA", l, 0);
instances.setClassIndex(instances.numAttributes() - 1);
int sum = data.numInstances();
String line;
for (int i = 0; i < sum; i++) {
double num[] = new double[instances.numAttributes()];
Instance instance = new DenseInstance(1, num);
line = String.valueOf(data.instance(i));
if (line.trim().length() != 0) {
String[] a = line.split(",");
num[0] = stringCount(a[0], "aa");
num[1] = stringCount(a[0], "ag");
num[2] = stringCount(a[0], "ac");
num[3] = stringCount(a[0], "at");
num[4] = stringCount(a[0], "ga");
num[5] = stringCount(a[0], "gg");
num[6] = stringCount(a[0], "gc");
num[7] = stringCount(a[0], "gt");
num[8] = stringCount(a[0], "ca");
num[9] = stringCount(a[0], "cg");
num[10] = stringCount(a[0], "cc");
num[11] = stringCount(a[0], "ct");
num[12] = stringCount(a[0], "ta");
num[13] = stringCount(a[0], "tg");
num[14] = stringCount(a[0], "tc");
num[15] = stringCount(a[0], "tt");
num[16] = instances.attribute(16).indexOfValue(a[1]);
}
instances.add(instance);
}
return instances;
}
在这之前也参考了网上好多网友写法,记不住链接了;大家也可以参考:
numeric:数值型,连续变量
-
Attribute date=new Attribute("aa");
date:日期型,日期变量
-
Attribute date=new Attribute("attribute_name","yyyy-MM-dd");
nominal:标称型,预定义的标签
-
ArrayList<String> labels=new ArrayList<String>(); labels.add("label_a"); labels.add("label_b"); Attribute nominal=new Attribute("attribute_name",labels);
- string:字符串型,文本数据
-
Attribute string = new Attribute("attribute_name",(ArrayList<String>)null);