来自:http://hi.baidu.com/liuqiyuan/item/9926018e6e4561d55e0ec1df
更多/More: http://www.liuqiyuan.com/blog/?p=110
在信息检索领域,Stemming是指将英文单词转换为词干的处理过程。Stemming与Lemmatization的不同是,前者只是词干的简单提取,后者则利用上下文语义环境(context)进行词元(lemma)转换。贴出WIKIPedia对两者的定义:
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base or root form—generally a written word form.
Lemmatization in linguistics, is the process of grouping together the different inflected forms of a word so they can be analysed as a single item. In computational linguistics,is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.
Lemmatization的处理难度较大,并且因为词典的使用成本会比较高,在做“词变”的技术处理中,stemmer自然成为研究的首选。就.NET平台C#语言来讲,比较出名的stemmer——snowball已经放出。
源作者博文:http://www.iveonik.com/blog/2011/08/snowball-stemmers-on-csharp-free-download/
Stemmer for C#下载地址:http://www.iveonik.com/src/StemmersNet.rar
演示如下:
引入库Error:将下载文件打开,试运行其中Demo的Program.cs时会报错(Figure 1),这里只要按照(Figure 2)指示将StemmerDemo设为启动项目即可。
运行代码:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace Iveonik.Stemmers
{
class Program
{
static void Main(string[] args)
{
string Liu = "That fishman likes fishing with his cats";
Console.WriteLine(“OriginalString:”+Liu);
TestStemmer(new EnglishStemmer(), SplitedStr(Liu));
Console.ReadKey();
}
private static void TestStemmer(IStemmer stemmer, params string[] words)
{
Console.WriteLine("Stemmer: " + stemmer);
foreach (string word in words)
{
Console.WriteLine(word + " --> " + stemmer.Stem(word));
}
}
private static string[] SplitedStr(string Liu)
{
string[] SplitedStr = Liu.Split(' ');
return SplitedStr;
}
}
}
得结果如下: