Analyzing "The quick brown fox jumped over the lazy dogs"
WhitespaceAnalyzer:
[The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
SimpleAnalyzer:
[the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]
StopAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
StandardAnalyzer:
[quick] [brown] [fox] [jumped] [over] [lazy] [dogs]
Analyzing "XY&Z Corporation - xyz@example.com"
WhitespaceAnalyzer:
[XY&Z] [Corporation] [-] [xyz@example.com]
SimpleAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StopAnalyzer:
[xy] [z] [corporation] [xyz] [example] [com]
StandardAnalyzer:
[xy&z] [corporation] [xyz@example.com]
(产生以上文本的代码见附录)
结合四个Analyzer对两个句子的分析,我们可以看到WhitespaceAnalyzer只对文本进行空格切分;SimpleAnalyzer除了按空格切分之外遇到标点符号也会切分,同时还把所有的字母变成了小写的;StopAnalyzer在SimpleAnalyzer功能的基础上还去掉了”the”, “a”等停用词;而StandardAnalyzer最为强大,表面上看它是按空格切分,然后去掉一些停用词,但实际上它有很强的token识别功能,像”xyz@example.com”这样的字符串它可以识别为email。