LinePipe是一个对于个人用户开源免费的java实现的机器学习工具包,实现了聚类,分类,中文分词,词性标注,拼写检查,情感分析等算法。我们将在后续的博客中逐步推出详细的使用说明。
这篇博客主要写新闻分类,LinePipe 提供了新闻分类的sample,是英文的。我们使用的是搜狗的新闻语料库。
LinePipe的文档:http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
下载完成后,直接将LinePide导入到Myeclipse中即可,其中出现的错误多是由包名错误引起的,找到错误修改即可。
目录结构:
classifyNews就是我们要用到的类。
搜狗语料库中我们选取了财经,互联网,招聘三个类,来训练分类器。
public class ClassifyNews {
private static File TRAINING_DIR
= new File("./demos/data/搜狗语料库/SogouC.reduced/Reduced");
private static File TESTING_DIR
= new File("./demos/data/搜狗语料库/SogouC.reduced/Test");
private static String test
= "./demos/data/搜狗语料库/SogouC.reduced/Test";
private static String[] CATEGORIES
= { "财经",
"互联网",
"招聘" };
private static int NGRAM_SIZE = 6;
public static void main(String[] args)
throws ClassNotFoundException, IOException {
DynamicLMClassifier<NGramProcessLM> classifier
= DynamicLMClassifier.createNGramProcess(CATEGORIES,NGRAM_SIZE);
//分类器创建
for(int i=0; i<CATEGORIES.length; ++i) {
File classDir = new File(TRAINING_DIR,CATEGORIES[i]);//训练路径与类别导入
if (!classDir.isDirectory()) {//判断是否存在该文件夹
String msg = "Could not find training directory="
+ classDir
+ "\nHave you unpacked 4 newsgroups?";
System.out.println(msg); // in case exception gets lost in shell
throw new IllegalArgumentException(msg);
}
String[] trainingFiles = classDir.list();
for (int j = 0; j < trainingFiles.length; ++j) {
File file = new File(classDir,trainingFiles[j]);
String text = Files.readFromFile(file,"ISO-8859-1");
System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
Classification classification
= new Classification(CATEGORIES[i]);
Classified<CharSequence> classified
= new Classified<CharSequence>(text,classification);
classifier.handle(classified);
}
}
//compiling
System.out.println("Compiling");
@SuppressWarnings("unchecked") // we created object so know it's safe
JointClassifier<CharSequence> compiledClassifier
= (JointClassifier<CharSequence>)
AbstractExternalizable.compile(classifier);
getBestCategory(compiledClassifier);
}
public static void getBestCategory(JointClassifier<CharSequence> compiledClassifier) throws IOException{
File classDir = new File(test);
String[] testingFiles = classDir.list();
for (int j=0; j<testingFiles.length; ++j) {
String text
= Files.readFromFile(new File(classDir,testingFiles[j]),"ISO-8859-1");
System.out.print("Testing on " + "/" + testingFiles[j] + " ");
JointClassification jc =
compiledClassifier.classify(text);
String bestCategory = jc.bestCategory();
String details = jc.toString();
System.out.println("Got best category of: " + bestCategory);
System.out.println(jc.toString());
System.out.println("---------------");
}
}
}
主要是调整训练语料与测试语料的路径path,还有类别category。
运行代码,查看结果如下。
这样结果就出来了,通过训练好的分类器可以判断出新文本所属类别。