这里以<<人民日报分词语料>>为样例分析,总共297959行,字数为461多万,网上有很多下载链接:
1、每行单独分词(运算时间为8529MS):
import java.io.BufferedReader;
import java.io.IOException;
import org.ansj.splitWord.analysis.ToAnalysis;
import love.cq.util.IOUtil;
public class GetResult {
public static void main(String[] args) throws IOException{
BufferedReader reader = IOUtil.getReader("files/人民日报分词语料-分词前.txt", "UTF-8");
ToAnalysis.parse("123");
String word = null;
long before = 0,after = 0;
before = System.currentTimeMillis();
while ((word = reader.readLine()) != null) {
ToAnalysis.parse(word);
}
after = System.currentTimeMillis();
System.out.println("花费时间(MS):" +(after - before));
}
}
2、一次全部分词(运行时间为30822MS):
import java.io.BufferedReader;
import java.io.IOException;
import org.ansj.splitWord.analysis.ToAnalysis;
import org.apache.commons.io.IOUtils;
import love.cq.util.IOUtil;
public class GetResult {
public static void main(String[] args) throws IOException{
BufferedReader reader = IOUtil.getReader("files/人民日报分词语料-分词前.txt", "UTF-8");
String word = IOUtils.toString(reader);
long before = 0,after = 0;
before = System.currentTimeMillis();
ToAnalysis.parse(word);
after = System.currentTimeMillis();
System.out.println("花费时间(MS):" +(after - before));
}
}
综合来看,单行分词比一次分词要快一些。以上均代表个人意见。