执行代码:
String[] params = new String[]{
"-p","hdfs://192.168.9.72:9000/zhoujianfeng/mahout_test/decisiontree/input/KDDTrain.arff",
"-f","hdfs://192.168.9.72:9000/zhoujianfeng/mahout_test/decisiontree/desc/KDDTrain.info",
"-d","N C C C N N C N N N N C N N N N N N N N C C N N N N N N N N N N N N N N N N N N N L",
};
HadoopUtil.delete(new Configuration(), new Path(params[Arrays.asList(params).indexOf("-f")+1]));
Describe.main(params);
报错1:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string: 2. Must be: 42
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)at org.apache.mahout.classifier.df.data.DataLoader.parseString(DataLoader.java:69)
at org.apache.mahout.classifier.df.data.DataLoader.generateDataset(DataLoader.java:188)
at org.apache.mahout.classifier.df.tools.Describe.generateDataset(Describe.java:127)
at org.apache.mahout.classifier.df.tools.Describe.runTool(Describe.java:115)
at org.apache.mahout.classifier.df.tools.Describe.main(Describe.java:100)
at myTesting.decisiontree.Step1Describe.run(Step1Describe.java:24)
at myTesting.decisiontree.Step1Describe.main(Step1Describe.java:33)
原因: arff 文件前面是注释说明语句,不需要转换,而程序解析了;解析过程自然报错了。
报错2:
Exception in thread "main" java.lang.IllegalArgumentException: Wrong number of attributes in the string: 31. Must be: 42
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
at org.apache.mahout.classifier.df.data.DataLoader.parseString(DataLoader.java:68)
at org.apache.mahout.classifier.df.data.DataLoader.generateDataset(DataLoader.java:196)
at org.apache.mahout.classifier.df.tools.Describe.generateDataset(Describe.java:126)
at org.apache.mahout.classifier.df.tools.Describe.runTool(Describe.java:114)
at org.apache.mahout.classifier.df.tools.Describe.main(Describe.java:100)
at myTesting.decisiontree.Step1Describe.run(Step1Describe.java:24)
at myTesting.decisiontree.Step1Describe.main(Step1Describe.java:33)
原因:数据缺失,有的行数据丢失,没有42个字段。
解决方法:
修改DataLoader 类中 generateDataset方法,修改代码如下:
public static Dataset generateDataset(CharSequence descriptor,
boolean regression,
FileSystem fs,
Path path) throws DescriptorException, IOException {
Attribute[] attrs = DescriptorUtils.parseDescriptor(descriptor);
FSDataInputStream input = fs.open(path);
Scanner scanner = new Scanner(input, "UTF-8");
// used to convert CATEGORICAL attribute to Integer
@SuppressWarnings("unchecked")
Set<String>[] valsets = new Set[attrs.length];
int size = 0;
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
if (!line.isEmpty()) {
/*
*
* target: arff file 前面很多都是注释语句,不需要读取的
*/
if(line.startsWith("@") || line.split(",").length!=42){
continue;
}
if (parseString(attrs, valsets, line, regression)) {
size++;
}
}
}
scanner.close();
@SuppressWarnings("unchecked")
List<String>[] values = new List[attrs.length];
for (int i = 0; i < valsets.length; i++) {
if (valsets[i] != null) {
values[i] = Lists.newArrayList(valsets[i]);
}
}
return new Dataset(attrs, values, size, regression);
}