java lucene的使用,如何在Java中使用Lucene添加自定义停用词

该博客讲述了如何在使用Lucene进行英文停用词移除时,同时处理自定义停用词。通过创建一个包含标准英文停用词集并添加自定义停用词的CharSet,然后使用StopFilter过滤TokenStream,从而实现自定义停用词的过滤。示例代码展示了如何实现这一过程。
摘要由CSDN通过智能技术生成

I am using lucene to remove English Stop words but my requirement is remove English stop words and Custom stop words. Below is my code to remove English stop words using lucene.

My Sample Code:

public class Stopwords_remove {

public String removeStopWords(String string) throws IOException

{

StandardAnalyzer ana = new StandardAnalyzer(Version.LUCENE_30);

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36,newStringReader(string));

StringBuilder sb = new StringBuilder();

tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, ana.STOP_WORDS_SET);

CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class);

while (tokenStream.incrementToken())

{

if (sb.length() > 0)

{

sb.append(" ");

}

sb.append(token.toString());

}

return sb.toString();

}

public static void main(String args[]) throws IOException

{

String text = "this is a java project written by james.";

Stopwords_remove stopwords = new Stopwords_remove();

stopwords.removeStopWords(text);

}

}

output: java project written james.

required output: java project james.

How can I do this?

解决方案

You could add add your additional stop words into a copy of the standard english stop word set, or just add in another StopFilter. Like:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));

CharArraySet stopSet = CharArraySet.copy(Version.LUCENE_36, StandardAnalyzer.STOP_WORD_SET);

stopSet.add("add");

stopSet.add("your");

stopSet.add("stop");

stopSet.add("words");

tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, stopSet);

//Or, if you just need the added stopwords in a standardanalyzer, you could just pass this stopfilter into the StandardAnalyzer...

//analyzer = new StandardAnalyzer(Version.LUCENE_36, stopSet);

or:

TokenStream tokenStream = new StandardTokenizer(Version.LUCENE_36, new StringReader(string));

tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StandardAnalyzer.STOP_WORDS_SET);

List stopWords = //your list of stop words.....

tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, StopFilter.makeStopSet(Version.LUCENE_36, stopWords));

If you are trying to create your own Analyzer, you might be better served following a pattern more like the example in the Analyzer documentation.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值