lucene分词器介绍

最新推荐文章于 2024-04-10 16:33:28 发布

uestc_lxp

最新推荐文章于 2024-04-10 16:33:28 发布

阅读量469

点赞数

分类专栏： Lucene

本文链接：https://blog.csdn.net/uestc_lxp/article/details/45099221

版权

Lucene 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

1、几种分词器简介

（1）

public class AnalyzerUtils {
	public static void display(String str,Analyzer a) {
		try {
			TokenStream stream = a.tokenStream("content", new StringReader(str));
			CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
			while(stream.incrementToken()) {
				System.out.print(" cta:"+cta);
			}
			System.out.println();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
	
	public static void displayAllTokenInfo(String str,Analyzer a) {
		try {
			TokenStream stream = a.tokenStream("content", new StringReader(str));
			PositionIncrementAttribute pia = stream.addAttribute(PositionIncrementAttribute.class);
			OffsetAttribute oa = stream.addAttribute(OffsetAttribute.class);
			CharTermAttribute cta = stream.addAttribute(CharTermAttribute.class);
			TypeAttribute ta = stream.addAttribute(TypeAttribute.class);
			for(;stream.incrementToken();) {
				System.out.println(pia.getPositionIncrement()+":");
				System.out.println(cta+"["+oa.startOffset()+"-"+oa.endOffset()+"]"+ta.type());
			}
		} catch (IOException e) {
			e.printStackTrace();
		}
	}

（2）测试

@Test
	public void test01() {
		Analyzer a1 = new StandardAnalyzer(Version.LUCENE_35);
		Analyzer a2 = new StopAnalyzer(Version.LUCENE_35);
		Analyzer a3 = new SimpleAnalyzer(Version.LUCENE_35);
		Analyzer a4 = new WhitespaceAnalyzer(Version.LUCENE_35);
		String txt = "this is my house,I am come from China";
		AnalyzerUtils.display(txt, a1);
		AnalyzerUtils.display(txt, a2);
		AnalyzerUtils.display(txt, a3);
		AnalyzerUtils.display(txt, a4);
	}

2、自定义分词器

（1）

public class MyStopAnalyzer extends Analyzer{
	private Set stops;
	public MyStopAnalyzer(String[] sws) {
		//会自动将字符串数组转换为set
		stops = StopFilter.makeStopSet(Version.LUCENE_35, sws, true);
	}
	@Override
	public TokenStream tokenStream(String fieldName, Reader reader) {
		return new StopFilter(Version.LUCENE_35,
			   new LowerCaseFilter(Version.LUCENE_35,
					   new LetterTokenizer(Version.LUCENE_35,reader)),stops);
	}
}

（2）测试

@Test
	public void test04() {
		Analyzer a1 = new MyStopAnalyzer(new String[]{"house","come"});
		String txt = "this is my house,I am come from China";
		AnalyzerUtils.displayAllTokenInfo(txt, a1);
		
	}

3、中文分词器

（1）paoding：庖丁解牛，已经不用

（2）mmseg4j：搜狗的，可以使用

mmseg4j简介

（1）下载mmseg4j-1.8.5.jar（有带dic的，有不带dic的版本，如果使用不带dic的，必须自己指定词库位置

（2）创建的时候使用MMSegAnalyzer分词器，其它代码无需改变

@Test
	public void test02() {
		String txt = "我来自中国，我爱我的祖国，快乐的小天使";
		Analyzer a5 = new MMSegAnalyzer("D:\\lucene\\mmseg4j-1.8.5\\data");
		AnalyzerUtils.display(txt, a5);

<span style="white-space:pre">	</span>}

（3）如果分出来的词有没有的，我们可以通过手动添加的形式添加到mmseg4j根目录/data/words-my.dic进行手动添加

uestc_lxp

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
lucene分词器介绍

1、几种分词器简介（1）public class AnalyzerUtils { public static void display(String str,Analyzer a) { try { TokenStream stream = a.tokenStream("content", new StringReader(str)); CharTermAttribute
复制链接

扫一扫

专栏目录