java 英文去停用词算法,Java中的停用词和词干

博客探讨了在不使用预封装库如Lucene的情况下,如何手动实现文本相似性程序,涉及停用词移除和词干提取。示例代码展示了如何使用Porter Stemmer算法对输入字符串进行处理,并给出了处理前后的输出结果。
摘要由CSDN通过智能技术生成

I'm thinking of putting a stop words in my similarity program and then a stemmer (going for porters 1 or 2 depends on what easiest to implement)

I was wondering that since I read my text from files as whole lines and save them as a long string, so if I got two strings ex.

String one = "I decided buy something from the shop.";

String two = "Nevertheless I decidedly bought something from a shop.";

Now that I got those strings

Stemming:

Can I just use the stemmer algoritmen directly on it, save it as a String and then continue working on the similarity like I did before implementing the stemmer in the program, like running one.stem(); kind of thing?

Stop word:

How does this work out? O.o

Do I just use; one.replaceall("I", ""); or is there some specific way to use for this proces? I want to keep working with the string and get a string before using the similarity algorithms on it to get the similarity. Wiki doesn't say a lot.

Hope you can help me out! Thanks.

Edit: It is for a school-related project where I'm writing a paper on similarity between different algorithms so I don't think I'm allowed to use lucene or other libraries that does the work for me. Plus I would like to try and understand how it works before I start using the libraries like Lucene and co. Hope it's not too much a bother ^^

解决方案

If you're not implementing this for academic reasons you should consider using the Lucene library. In either case it might be good for reference. It has classes for tokenization, stop word filtering, stemming and similarity. Here's a quick example using Lucene 3.0 to remove stop words and stem an input string:

public static String removeStopWordsAndStem(String input) throws IOException {

Set stopWords = new HashSet();

stopWords.add("a");

stopWords.add("I");

stopWords.add("the");

TokenStream tokenStream = new StandardTokenizer(

Version.LUCENE_30, new StringReader(input));

tokenStream = new StopFilter(true, tokenStream, stopWords);

tokenStream = new PorterStemFilter(tokenStream);

StringBuilder sb = new StringBuilder();

TermAttribute termAttr = tokenStream.getAttribute(TermAttribute.class);

while (tokenStream.incrementToken()) {

if (sb.length() > 0) {

sb.append(" ");

}

sb.append(termAttr.term());

}

return sb.toString();

}

Which if used on your strings like this:

public static void main(String[] args) throws IOException {

String one = "I decided buy something from the shop.";

String two = "Nevertheless I decidedly bought something from a shop.";

System.out.println(removeStopWordsAndStem(one));

System.out.println(removeStopWordsAndStem(two));

}

Yields this output:

decid bui someth from shop

Nevertheless decidedli bought someth from shop

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值