庖丁分词的源码分析（3）字典文件的生成和使用

最新推荐文章于 2018-11-01 15:00:05 发布

huangyunbin90

最新推荐文章于 2018-11-01 15:00:05 发布

阅读量109

点赞数

分类专栏：庖丁分词的源码分析文章标签：庖丁分词源码分析

庖丁分词的源码分析专栏收录该内容

3 篇文章

订阅专栏

[img]http://dl.iteye.com/upload/attachment/0082/8215/a8ba9baf-71ba-3d7c-9778-d4cd3df14906.png[/img]

[size=large]可以看到庖丁分词里有两种字典，一种读取的是编译处理之前的，一种是处理之后的。
基本上看他们的getVocabularyDictionary方法就可以了，这个是获取字典文件的主要方法。

在这里两个字典对象的基本相同，只是处理前的要过滤一些词，而处理过的直接就可以了。

处理后的字典对象的getVocabularyDictionary方法：[/size]

/**
	 * 词汇表字典
	 * 
	 * @return
	 */
	public synchronized Dictionary getVocabularyDictionary() {
		if (vocabularyDictionary == null) {
			// 大概有5639个字有词语，故取0x2fff=x^13>8000>8000*0.75=6000>5639
			vocabularyDictionary = new HashBinaryDictionary(
					getVocabularyWords(), 0x2fff, 0.75f);
		}
		return vocabularyDictionary;
	}

[size=large]
其中的关键其实就是这个HashBinaryDictionary了。其实这个就是个多叉树。
用这个方法来生成字典的。[/size]

protected void createSubDictionaries() {
		if (this.start >= ascWords.length) {
			return;
		}

		// 定位相同头字符词语的开头和结束位置以确认分字典
		int beginIndex = this.start;
		int endIndex = this.start + 1;

		char beginHashChar = getChar(ascWords[start], hashIndex);
		char endHashChar;
		for (; endIndex < this.end; endIndex++) {
			endHashChar = getChar(ascWords[endIndex], hashIndex);
			if (endHashChar != beginHashChar) {
				addSubDictionary(beginHashChar, beginIndex, endIndex);
				beginIndex = endIndex;
				beginHashChar = endHashChar;
			}
		}
		addSubDictionary(beginHashChar, beginIndex, this.end);
	}

[size=large]说明一下，传入字典的这个大数组是排好序的。endHashChar != beginHashChar，说的是相同位置的字不一样的时候，如一一一二二二，第三个字符串二二和前面的两个字的第一个字不一样，前面的两个都是一二二的第一个字是二。这个时候执行addSubDictionary。意思就是说相同位置的字相同，那么久在一个SubDictionary，就是说这些是个分叉树。然后继续这样不断分叉下去。[/size]

/**
	 * 将位置在beginIndex和endIndex之间(不包括endIndex)的词语作为一个分词典
	 * 
	 * @param hashChar
	 * @param beginIndex
	 * @param endIndex
	 */
	protected void addSubDictionary(char hashChar, int beginIndex, int endIndex) {
		Dictionary subDic = createSubDictionary(ascWords, beginIndex, endIndex);
		SubDictionaryWrap subDicWrap = new SubDictionaryWrap(hashChar,
				subDic, beginIndex);
		subs.put(keyOf(hashChar), subDicWrap);
	}

[size=large]注意subDicWrap包装的可能是HashBinaryDictionary，也可能是BinaryDictionary（二叉树），当数目小于16的时候就用二叉树了。[/size]

protected Dictionary createSubDictionary(Word[] ascWords, int beginIndex,
			int endIndex) {
		int count = endIndex - beginIndex;
		if (count < 16) {
			return new BinaryDictionary(ascWords, beginIndex, endIndex);
		} else {
			return new HashBinaryDictionary(ascWords, hashIndex + 1,
					beginIndex, endIndex, getCapacity(count), 0.75f);
		}
	}

[size=large]
所以其实整个字典就是个多叉树。

字典的使用：其实就是用个关键词去匹配这个多叉树，
HashBinaryDictionary的查找方法[/size]

public Hit search(CharSequence input, int begin, int count) {
		SubDictionaryWrap subDic = (SubDictionaryWrap) subs.get(keyOf(input
				.charAt(hashIndex + begin)));
		if (subDic == null) {
			return Hit.UNDEFINED;
		}
		Dictionary dic = subDic.dic;
		// 对count==hashIndex + 1的处理
		if (count == hashIndex + 1) {
			Word header = dic.get(0);
			if (header.length() == hashIndex + 1) {
				if (subDic.wordIndexOffset + 1 < this.ascWords.length) {
					return new Hit(subDic.wordIndexOffset, header,
							this.ascWords[subDic.wordIndexOffset + 1]);
				} else {
					return new Hit(subDic.wordIndexOffset, header, null);
				}
			} else {
				return new Hit(Hit.UNCLOSED_INDEX, null, header);
			}
		}
		// count > hashIndex + 1
		Hit word = dic.search(input, begin, count);
		if (word.isHit()) {
			int index = subDic.wordIndexOffset + word.getIndex();
			word.setIndex(index);
			if (word.getNext() == null && index < size()) {
				word.setNext(get(index + 1));
			}
		}
		return word;
	}