第四步：查看StandardAnalyzer的分词效果并添加停用词

最新推荐文章于 2022-09-27 10:21:44 发布

weixin_30795127

最新推荐文章于 2022-09-27 10:21:44 发布

阅读量278

点赞数

文章标签： java

原文链接：http://www.cnblogs.com/lb809663396/p/5856662.html

版权

LUCENE的创建索引有好多种分词方式，这里我们用的StandardAnalyzer分词

package cn.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.util.CharArraySet;

public class test1 {
	public static final String[] china_stop = {"着", "的", "之", "式"};
	public static void main(String[] args) throws IOException {
		//把数组赋值到CharArraySet里
		CharArraySet cnstop=new CharArraySet(china_stop.length, true);
	    for(String value : china_stop) {
	    	cnstop.add(value);
	    }
	    //并把StandardAnalyzer默认的停用词加进来
	    cnstop.addAll(StandardAnalyzer.STOP_WORDS_SET);
	    System.out.println(cnstop);		
		
		Analyzer analyzer = new StandardAnalyzer(cnstop);
		TokenStream stream=  analyzer.tokenStream("", "中秋be之夜，享受着月华的孤独，享受着爆炸式的思维跃迁");
		//获取每个单词信息,获取词元文本属性
		CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
		stream.reset();
        while(stream.incrementToken()){
            System.out.print("[" + cta + "]");
        }
        System.out.println();
		analyzer.close();
	}
}

输出结果如下：

输入所有的停止词，可以看到新的停止词已经加进去了

[着, but, be, 的, with, such, then, for, 之, 式, no, will, not, are, and, their, if, this, on, into, a, or, there, in, that, they, was, is, it, an, the, as, at, of, by, to, these]

分词结果，"着", "的", "之", "式"四个词已经被停止分词了
[中][秋][夜][享][受][月][华][孤][独][享][受][爆][炸][思][维][跃][迁]

通过上面的分词效果应该就知道StandardAnalyzer是怎么分词了吧！

转载于:https://www.cnblogs.com/lb809663396/p/5856662.html