lucene索引:改变索引打分的四种方式(

lucene索引:改变索引打分的四种方式(2) (6) float coord(int overlap, int maxOverlap)一次搜索可能包含多个搜索词,而一篇文档中也可能包含多个搜索词,此项表示,当一篇文档中包含的搜索词越多,则此文档则打分越高。public void TestCoord() throws Exception { MySimilarity sim = new MySimilarity(); File indexDir = new File("TestCoord"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newStandardAnalyzer(Version.LUCENE_CURRENT), true,IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); for(int i = 0; i < 10; i++){ Document doc3 = new Document(); Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED); doc3.add(f3); writer.addDocument(doc3); } writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("common world"); TopDocs docs = searcher.search(query, 2); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }class MySimilarity extends Similarity { @Override public float coord(int overlap, int maxOverlap) { return 1; }}如上面的实例,当coord返回1,不起作用的时候,文档一虽然包含了两个搜索词common和world,但由于world的所在的文档数太多,而文档二包含common的次数比较多,因而文档二分数较高:docid : 1 score : 1.9059997 docid : 0 score : 1.2936771而当coord起作用的时候,文档一由于包含了两个搜索词而分数较高:class MySimilarity extends Similarity { @Override public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; }}docid : 0 score : 1.2936771 docid : 1 score : 0.95299983(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)由于Lucene引入了payload,因而可以存储一些自己的信息,用户可以根据自己存储的信息,来影响Lucene的打分。payload的定义我们知道,索引是以倒排表形式存储的,对于每一个词,都保存了包含这个词的一个链表,当然为了加快查询速度,此链表多用跳跃表进行存储。 Payload信息就是存储在倒排表中的,同文档号一起存放,多用于存储与每篇文档相关的一些信息。当然这部分信息也可以存储域里(storedField),两者从功能上基本是一样的,然而当要存储的信息很多的时候,存放在倒排表里,利用跳跃表,有利于大大提高搜索速度。 Payload的存储方式如下图: 由payload的定义,我们可以看出,payload可以存储一些不但与文档相关,而且与查询词也相关的信息。比如某篇文档的某个词有特殊性,则可以在这个词的这个文档的position信息后存储payload信息,使得当搜索这个词的时候,这篇文档获得较高的分数。要利用payload来影响查询需要做到以下几点,下面举例用 标记的词在payload中存储1,否则存储0:首先要实现自己的Analyzer从而在Token中放入payload信息:class BoldAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new WhitespaceTokenizer(reader); result = new BoldFilter(result); return result; } } class BoldFilter extends TokenFilter { public static int IS_NOT_BOLD = 0; public static int IS_BOLD = 1; private TermAttribute termAtt; private PayloadAttribute payloadAtt; protected BoldFilter(TokenStream input) { super(input); termAtt = addAttribute(TermAttribute.class); payloadAtt = addAttribute(PayloadAttribute.class); } @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { final char[] buffer = termAtt.termBuffer(); final int length = termAtt.termLength(); String tokenstring = new String(buffer, 0, length); if (tokenstring.startsWith(" ") && tokenstring.endsWith("")) { tokenstring = tokenstring.replace(" ", ""); tokenstring = tokenstring.replace("", ""); termAtt.setTermBuffer(tokenstring); payloadAtt.setPayload(new Payload(int2bytes(IS_BOLD))); } else { payloadAtt.setPayload(new Payload(int2bytes(IS_NOT_BOLD))); } return true; } else return false; } public static int bytes2int(byte[] b) { int mask = 0xff; int temp = 0; int res = 0; for (int i = 0; i < 4; i++) { res <<= 8; temp = b[i] & mask; res |= temp; } return res; } public static byte[] int2bytes(int num) { byte[] b = new byte[4]; for (int i = 0; i < 4; i++) { b[i] = (byte) (num >>> (24 - i * 8)); } return b; } }然后,实现自己的Similarity,从payload中读出信息,根据信息来打分。class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); } else { System.out.println("It is not a bold char."); } return 1; } }最后,查询的时候,一定要用PayloadXXXQuery(在此用PayloadTermQuery,在Lucene 2.4.1中,用BoostingTermQuery),否则scorePayload不起作用。public void testPayloadScore() throws Exception { PayloadSimilarity sim = new PayloadSimilarity(); File indexDir = new File("TestPayloadScore"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newBoldAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); PayloadTermQuery query = new PayloadTermQuery(new Term("contents", "hello"), new MaxPayloadFunction()); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }如果scorePayload函数始终是返回1,则结果如下, 不起作用。It is not a bold char. It is a bold char. docid : 0 score : 0.2101998 docid : 1 score : 0.2101998如果scorePayload函数如下:class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); return 10; } else { System.out.println("It is not a bold char."); return 1; } } }则结果如下,同样是包含hello,包含加粗的文档获得较高分:It is not a bold char. It is a bold char. docid : 1 score : 2.101998 docid : 0 score : 0.2101998继承并实现自己的collector以上各种方法,已经把Lucene score计算公式的所有变量都涉及了,如果这还不能满足您的要求,还可以继承实现自己的collector。在Lucene 2.4中,HitCollector有个函数public abstract void collect(int doc, float score),用来收集搜索的结果。其中TopDocCollector的实现如下:public void collect(int doc, float score) { if (score > 0.0f) { totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }此函数将docid和score插入一个PriorityQueue中,使得得分最高的文档先返回。我们可以继承HitCollector,并在此函数中对score进行修改,然后再插入PriorityQueue,或者插入自己的数据结构。比如我们在另外的地方存储docid和文档创建时间的对应,我们希望当文档时间是一天之内的分数最高,一周之内的分数其次,一个月之外的分数很低。我们可以这样修改:public static long milisecondsOneDay = 24L * 3600L * 1000L;public static long millisecondsOneWeek = 7L * 24L * 3600L * 1000L;public static long millisecondsOneMonth = 30L * 24L * 3600L * 1000L;public void collect(int doc, float score) { if (score > 0.0f) { long time = getTimeByDocId(doc); if(time < milisecondsOneDay) { score = score * 1.0; } else if (time < millisecondsOneWeek){ score = score * 0.8; } else if (time < millisecondsOneMonth) { score = score * 0.3; } else { score = score * 0.1; } totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }在Lucene 3.0中,Collector接口为void collect(int doc),TopScoreDocCollector实现如下:public void collect(int doc) throws IOException { float score = scorer.score(); totalHits++; if (score <= pqTop.score) { return; } pqTop.doc = doc + docBase; pqTop.score = score; pqTop = pq.updateTop(); }同样可以用上面的方式影响其打分。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值