分词器:中文分词器和英文分词器
输入文本 -- 关键词切分--去除停用词【一些辅助词,不会对文章意思产生影响的词】这步可以加快建立索引的速度,减少索引文件--形态还原【比如:把时态去掉了】--转化为小写
private Analyzer analyzer = new StandardAnalyzer();
private Analyzer analyzer1=new SimpleAnalyzer();
@Test
public void test()throws Exception {
analyzer(analyzer1, "javadoc.txt aa bb");
}
public void analyzer(Analyzer analyzer, String text)throws Exception {
TokenStream tokenStream = analyzer.tokenStream("content", new StringReader(text));
for(Token token=new Token();(token=tokenStream.next(token))!=null;){
System.out.println(token);
}
}
测试标准分词器的使用
结果如下:
(javadoc.txt,0,11,type=<HOST>)
(aa,12,14,type=<ALPHANUM>)
(bb,16,18,type=<ALPHANUM>)
该分词器不为把.号当停用词
第二个分词器分词结果如下:.号在该分词器中会当成停用词
(javadoc,0,7)
(txt,8,11)
(aa,12,14)
(bb,16,18)
中文分词有三种分法:单字分词 二分法分词【每两个字分成一个词】 词典分词 【按某种算法,再匹配词库】
最好的是语义分词
private Analyzer analyzer2=new CJKAnalyzer(); //二分法分词
private Analyzer analyzer2=new CJKAnalyzer(); //二分法分词
private Analyzer analyzer3=new MMAnalyzer(); //极意分词
下面分极意分词MMAnalyzer
//高亮器的处理
@Test
public void search()throws Exception{
String queryString="老婆";
QueryParser queryParser=new MultiFieldQueryParser(new String[]{"name","content"},analyzer);
Query query=queryParser.parse(queryString);
IndexReader reader = IndexReader.open(FSDirectory.getDirectory(new File(indexPath)), true);
IndexSearcher is=new IndexSearcher(reader);
TopDocs topdocs=is.search(query, null, 10000);
System.out.println("共搜索到记录"+topdocs.totalHits+"条");
Formatter formattter=new SimpleHTMLFormatter("<font color='red'>","</font>");
Scorer scorer=new QueryScorer(query);
List<Document> docList=new ArrayList<Document>();
Highlighter highlighter=new Highlighter(formattter,scorer); //高亮器
Fragmenter fragmenter=new SimpleFragmenter(100); //设置摘要住处的大小
highlighter.setTextFragmenter(fragmenter);
for (ScoreDoc sdoc : topdocs.scoreDocs) {
int docSn=sdoc.doc;
Document doc=is.doc(docSn);
docList.add(doc);
String hc=highlighter.getBestFragment(analyzer, "content", doc.get("content"));
if(hc==null){
String content=doc.get("content");
hc=doc.get("content").substring(0,content.length()>50?50:content.length());
}
doc.getField("content").setValue(hc);
System.out.println(doc.get("name"));
System.out.println(doc.get("content"));
System.out.println(doc.get("size"));
System.out.println(doc.get("path"));
}
System.out.println(topdocs.totalHits);
}