Lucene4.3.1拼写检查SpellChecker源码解析

最新推荐文章于 2020-06-05 22:19:42 发布

_荣耀之路_

最新推荐文章于 2020-06-05 22:19:42 发布

阅读量788

点赞数 1

分类专栏： Lucene 文章标签： Lucene SpellChecker 源码解析

本文链接：https://blog.csdn.net/asty9000/article/details/81299227

版权

源码分析同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Lucene

15 篇文章 1 订阅

订阅专栏

概述

SpellChecker包含在Lucene的扩展包lucene-suggest中，可以用来对输入的检索内容进行校正，利用SpellChecker可以很方便的实现拼写检查功能，但是检查的效果需要开发者进行调整、优化。

构建SpellChecker

构建SpellChecker需要三个参数

org.apache.lucene.store.Directory：拼写检查索引存放目录，如果目录不存在会创建。
org.apache.lucene.search.spell.StringDistance：字符串距离接口，用于计算两个字符串之间的距离，默认为org.apache.lucene.search.spell.LevensteinDistance。
java.util.Comparator<org.apache.lucene.search.spell.SuggestWord>：建议词比较器，用于对结果进行排序，默认为org.apache.lucene.search.spell.SuggestWordScoreComparator。

  public SpellChecker(Directory spellIndex, StringDistance sd) throws IOException {
    this(spellIndex, sd, SuggestWordQueue.DEFAULT_COMPARATOR);
  }
  
  public SpellChecker(Directory spellIndex) throws IOException {
    this(spellIndex, new LevensteinDistance());
  }

  public SpellChecker(Directory spellIndex, StringDistance sd, Comparator<SuggestWord> comparator) throws IOException {
    setSpellIndex(spellIndex);
    setStringDistance(sd);
    this.comparator = comparator;
  }

构建过程主要在setSpellIndex(spellIndex)。

public void setSpellIndex(Directory spellIndexDir) throws IOException {
    //获取索引修改锁
    synchronized (modifyCurrentIndexLock) {
      //确保索引可打开
      ensureOpen();
      //如果目录下不存在索引则，通过IndexWriter关闭可能存在的资源
      if (!DirectoryReader.indexExists(spellIndexDir)) {
          IndexWriter writer = new IndexWriter(spellIndexDir,
            new IndexWriterConfig(Version.LUCENE_CURRENT,
                null));
          writer.close();
      }
      //交换searcher
      swapSearcher(spellIndexDir);
    }
  }

  private void swapSearcher(final Directory dir) throws IOException {
    //创建新的IndexSearcher
    final IndexSearcher indexSearcher = createSearcher(dir);
    //获取searcher锁
    synchronized (searcherLock) {
      if(closed){
        indexSearcher.getIndexReader().close();
        throw new AlreadyClosedException("Spellchecker has been closed");
      }
      //如果searcher不为空，关闭reader
      if (searcher != null) {
        searcher.getIndexReader().close();
      }
      //完成交换
      searcher = indexSearcher;
      this.spellIndex = dir;
    }
  }

数据来源

SpellChecker通过实现org.apache.lucene.search.spell.Directory接口支持多种数据来源。该接口将数据来源抽象为一个字典，getWordsIterator方法返回字典中单词的迭代器。

public interface Dictionary {

  /**
   * Return all words present in the dictionary
   * @return Iterator
   */
  BytesRefIterator getWordsIterator() throws IOException;
}

Lucene提供了以下四种数据来源。

加载数据

SpellChecker通过indexDictionary方法从数据源加载数据。

  public final void indexDictionary(Dictionary dict, IndexWriterConfig config, boolean fullMerge) throws IOException {
	//获取索引修改锁
    synchronized (modifyCurrentIndexLock) {
	  //确保索引可打开
      ensureOpen();
      final Directory dir = this.spellIndex;
      final IndexWriter writer = new IndexWriter(dir, config);
      IndexSearcher indexSearcher = obtainSearcher();
      final List<TermsEnum> termsEnums = new ArrayList<TermsEnum>();
	  //查询原本已存在的索引数据
      final IndexReader reader = searcher.getIndexReader();
	  //如果存在数据则添加到termsEnums中
      if (reader.maxDoc() > 0) {
        for (final AtomicReaderContext ctx : reader.leaves()) {
          Terms terms = ctx.reader().terms(F_WORD);
          if (terms != null)
            termsEnums.add(terms.iterator(null));
        }
      }
      
      boolean isEmpty = termsEnums.isEmpty();

      try {
		//从词典接口获取单词迭代器
        BytesRefIterator iter = dict.getWordsIterator();
        BytesRef currentTerm;
        //遍历
        terms: while ((currentTerm = iter.next()) != null) {
  
          String word = currentTerm.utf8ToString();
          int len = word.length();
		  //单词长度小于3忽略
          if (len < 3) {
            continue; // too short we bail but "too long" is fine...
          }
		  //精确查找，如果索引中已存在，则忽略
          if (!isEmpty) {
            for (TermsEnum te : termsEnums) {
              if (te.seekExact(currentTerm, false)) {
                continue terms;
              }
            }
          }
  
          //创建Document，写入索引
          Document doc = createDocument(word, getMin(len), getMax(len));
          writer.addDocument(doc);
        }
      } finally {
		//遍历结束释放searcher
        releaseSearcher(indexSearcher);
      }
      if (fullMerge) {
		//合并为一个段
        writer.forceMerge(1);
      }
      //关闭writer
      writer.close();
      //交换searcher
      swapSearcher(dir);
    }
  }

检查建议

SpellChecker通过suggestSimilar方法进行检查建议，建议有以下三种模式：

public enum SuggestMode {
  //只对索引中没有的词生成建议，默认为此模式
  SUGGEST_WHEN_NOT_IN_INDEX,
  //只返回比搜索词频率相同或更高的词
  SUGGEST_MORE_POPULAR,
  //总是进行搜索建议，但是也可能受其他参数影响
  SUGGEST_ALWAYS
}

suggestSimilar需要六个参数

word：想要一个拼写检查的单词。
numSug：建议词数量。
ir：索引的IndexReader，可以为null。
field：索引中的字段，如果不为空，建议的单词仅限于该字段的单词。
suggestMode：建议模式，如果ir或field为空，则会被重置为SuggestMode.SUGGEST_ALWAYS，默认为SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX。
accuracy：建议词的最低分数即精确度，低于次分数的结果会被过滤，默认为0.5。

待完成。。。。