做中文文本聚类,研究中科院的imdict-chinese-analyzer分词器时,我自己加载的停用词表一直都跑不出正确的结果,于是,就追踪lucene是怎么加载自己的停用词表的。在源代码的WordListLoader.java类中,发现了这样的代码:
- public static HashSet getWordSet(Reader reader) throws IOException {
- HashSet result = new HashSet();
- BufferedReader br = null;
- try {
- if (reader instanceof BufferedReader) {
- br = (BufferedReader) reader;
- } else {
- br = new BufferedReader(reader);
- }
- String word = null;
- while ((word = br.readLine()) != null) {
- result.add(word.trim());
- }
- }
- finally {
- if (br != null)
- br.close();
- }
- return result;
- }
问题就变得清晰了,于是我在这行代码中加入了输出语句,把从文件中加载进去的停用词打印出来。这样,就找到了问题的根源了:停用词表的编码格式问题(我设置的编码格式为:Unicode)
把研究的结果总结如下:
1、Lucene支持的停用词表文件有utf-8.
2、停用词表的格式很简单:每词一行.
3、Luence中支持停用词的方式有5种(见org.apache.lucene.analysis.StopAnalyzer.java类的5种构造方法):StopAnalyzer.java类默认停用词,以String[]传入,以Set传入,以File传入,以Reader传入
4、 关于停用词处理的参考代码如下:
- package com.xh.TextClustering;
- import java.io.File;
- import java.io.IOException;
- import java.io.StringReader;
- import org.apache.lucene.analysis.Analyzer;
- import org.apache.lucene.analysis.Token;
- import org.apache.lucene.analysis.TokenStream;
- import org.apache.lucene.index.CorruptIndexException;
- import org.apache.lucene.store.LockObtainFailedException;
- public class StopAnalyzerTestChinese
- {
- static String source="我 是 中国人。";
- public static void main(String args[])
- {
- Indexer();
- }
- private static void Indexer()
- {
- //Analyzer analyzer=new StopAnalyzer();
- try {
- Analyzer analyzer=new StopAnalyzer(new File("chinese_stopword.dic"));
- // IndexWriter writer=new IndexWriter(IndexPath,analyzer,true,MaxFieldLength.UNLIMITED);
- // Document document=new Document();
- // Field field_content=new Field("content",source,Field.Store.YES,Field.Index.ANALYZED);
- // document.add(field_content);
- // ArrayList ItemList=new ArrayList();
- TokenStream stream=analyzer.tokenStream("content", new StringReader(source));
- while(true)
- {
- Token item=stream.next();
- if(null==item)break;
- System.out.println("{"+item.termText()+"}");
- }
- // writer.optimize();
- // writer.close();
- } catch (CorruptIndexException e) {
- e.printStackTrace();
- } catch (LockObtainFailedException e) {
- e.printStackTrace();
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
注:参考中文停用词表在附件中。