lucene索引：改变索引打分的四种方式（

最新推荐文章于 2020-07-22 14:19:02 发布

zhyf918

最新推荐文章于 2020-07-22 14:19:02 发布

阅读量806

点赞数

分类专栏：技术分享

技术分享专栏收录该内容

28 篇文章 0 订阅

订阅专栏

lucene索引：改变索引打分的四种方式（2） (6) float coord(int overlap, int maxOverlap)一次搜索可能包含多个搜索词，而一篇文档中也可能包含多个搜索词，此项表示，当一篇文档中包含的搜索词越多，则此文档则打分越高。public void TestCoord() throws Exception { MySimilarity sim = new MySimilarity(); File indexDir = new File("TestCoord"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newStandardAnalyzer(Version.LUCENE_CURRENT), true,IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); for(int i = 0; i < 10; i++){ Document doc3 = new Document(); Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED); doc3.add(f3); writer.addDocument(doc3); } writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); QueryParser parser = new QueryParser(Version.LUCENE_CURRENT, "contents", new StandardAnalyzer(Version.LUCENE_CURRENT)); Query query = parser.parse("common world"); TopDocs docs = searcher.search(query, 2); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }class MySimilarity extends Similarity { @Override public float coord(int overlap, int maxOverlap) { return 1; }}如上面的实例，当coord返回1，不起作用的时候，文档一虽然包含了两个搜索词common和world，但由于world的所在的文档数太多，而文档二包含common的次数比较多，因而文档二分数较高：docid : 1 score : 1.9059997 docid : 0 score : 1.2936771而当coord起作用的时候，文档一由于包含了两个搜索词而分数较高：class MySimilarity extends Similarity { @Override public float coord(int overlap, int maxOverlap) { return overlap / (float)maxOverlap; }}docid : 0 score : 1.2936771 docid : 1 score : 0.95299983(7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)由于Lucene引入了payload，因而可以存储一些自己的信息，用户可以根据自己存储的信息，来影响Lucene的打分。payload的定义我们知道，索引是以倒排表形式存储的，对于每一个词，都保存了包含这个词的一个链表，当然为了加快查询速度，此链表多用跳跃表进行存储。 Payload信息就是存储在倒排表中的，同文档号一起存放，多用于存储与每篇文档相关的一些信息。当然这部分信息也可以存储域里(storedField)，两者从功能上基本是一样的，然而当要存储的信息很多的时候，存放在倒排表里，利用跳跃表，有利于大大提高搜索速度。 Payload的存储方式如下图：由payload的定义，我们可以看出，payload可以存储一些不但与文档相关，而且与查询词也相关的信息。比如某篇文档的某个词有特殊性，则可以在这个词的这个文档的position信息后存储payload信息，使得当搜索这个词的时候，这篇文档获得较高的分数。要利用payload来影响查询需要做到以下几点，下面举例用标记的词在payload中存储1，否则存储0：首先要实现自己的Analyzer从而在Token中放入payload信息：class BoldAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { TokenStream result = new WhitespaceTokenizer(reader); result = new BoldFilter(result); return result; } } class BoldFilter extends TokenFilter { public static int IS_NOT_BOLD = 0; public static int IS_BOLD = 1; private TermAttribute termAtt; private PayloadAttribute payloadAtt; protected BoldFilter(TokenStream input) { super(input); termAtt = addAttribute(TermAttribute.class); payloadAtt = addAttribute(PayloadAttribute.class); } @Override public boolean incrementToken() throws IOException { if (input.incrementToken()) { final char[] buffer = termAtt.termBuffer(); final int length = termAtt.termLength(); String tokenstring = new String(buffer, 0, length); if (tokenstring.startsWith(" ") && tokenstring.endsWith("")) { tokenstring = tokenstring.replace(" ", ""); tokenstring = tokenstring.replace("", ""); termAtt.setTermBuffer(tokenstring); payloadAtt.setPayload(new Payload(int2bytes(IS_BOLD))); } else { payloadAtt.setPayload(new Payload(int2bytes(IS_NOT_BOLD))); } return true; } else return false; } public static int bytes2int(byte[] b) { int mask = 0xff; int temp = 0; int res = 0; for (int i = 0; i < 4; i++) { res <<= 8; temp = b[i] & mask; res |= temp; } return res; } public static byte[] int2bytes(int num) { byte[] b = new byte[4]; for (int i = 0; i < 4; i++) { b[i] = (byte) (num >>> (24 - i * 8)); } return b; } }然后，实现自己的Similarity，从payload中读出信息，根据信息来打分。class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); } else { System.out.println("It is not a bold char."); } return 1; } }最后，查询的时候，一定要用PayloadXXXQuery(在此用PayloadTermQuery，在Lucene 2.4.1中，用BoostingTermQuery)，否则scorePayload不起作用。public void testPayloadScore() throws Exception { PayloadSimilarity sim = new PayloadSimilarity(); File indexDir = new File("TestPayloadScore"); IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), newBoldAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED); Document doc1 = new Document(); Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc1.add(f1); writer.addDocument(doc1); Document doc2 = new Document(); Field f2 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED); doc2.add(f2); writer.addDocument(doc2); writer.close(); IndexReader reader = IndexReader.open(FSDirectory.open(indexDir)); IndexSearcher searcher = new IndexSearcher(reader); searcher.setSimilarity(sim); PayloadTermQuery query = new PayloadTermQuery(new Term("contents", "hello"), new MaxPayloadFunction()); TopDocs docs = searcher.search(query, 10); for (ScoreDoc doc : docs.scoreDocs) { System.out.println("docid : " + doc.doc + " score : " + doc.score); } }如果scorePayload函数始终是返回1，则结果如下，不起作用。It is not a bold char. It is a bold char. docid : 0 score : 0.2101998 docid : 1 score : 0.2101998如果scorePayload函数如下：class PayloadSimilarity extends DefaultSimilarity { @Override public float scorePayload(int docId, String fieldName, int start, int end, byte[] payload, int offset, int length) { int isbold = BoldFilter.bytes2int(payload); if(isbold == BoldFilter.IS_BOLD){ System.out.println("It is a bold char."); return 10; } else { System.out.println("It is not a bold char."); return 1; } } }则结果如下，同样是包含hello，包含加粗的文档获得较高分：It is not a bold char. It is a bold char. docid : 1 score : 2.101998 docid : 0 score : 0.2101998继承并实现自己的collector以上各种方法，已经把Lucene score计算公式的所有变量都涉及了，如果这还不能满足您的要求，还可以继承实现自己的collector。在Lucene 2.4中，HitCollector有个函数public abstract void collect(int doc, float score)，用来收集搜索的结果。其中TopDocCollector的实现如下：public void collect(int doc, float score) { if (score > 0.0f) { totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }此函数将docid和score插入一个PriorityQueue中，使得得分最高的文档先返回。我们可以继承HitCollector，并在此函数中对score进行修改，然后再插入PriorityQueue，或者插入自己的数据结构。比如我们在另外的地方存储docid和文档创建时间的对应，我们希望当文档时间是一天之内的分数最高，一周之内的分数其次，一个月之外的分数很低。我们可以这样修改：public static long milisecondsOneDay = 24L * 3600L * 1000L;public static long millisecondsOneWeek = 7L * 24L * 3600L * 1000L;public static long millisecondsOneMonth = 30L * 24L * 3600L * 1000L;public void collect(int doc, float score) { if (score > 0.0f) { long time = getTimeByDocId(doc); if(time < milisecondsOneDay) { score = score * 1.0; } else if (time < millisecondsOneWeek){ score = score * 0.8; } else if (time < millisecondsOneMonth) { score = score * 0.3; } else { score = score * 0.1; } totalHits++; if (reusableSD == null) { reusableSD = new ScoreDoc(doc, score); } else if (score >= reusableSD.score) { reusableSD.doc = doc; reusableSD.score = score; } else { return; } reusableSD = (ScoreDoc) hq.insertWithOverflow(reusableSD); } }在Lucene 3.0中，Collector接口为void collect(int doc)，TopScoreDocCollector实现如下：public void collect(int doc) throws IOException { float score = scorer.score(); totalHits++; if (score <= pqTop.score) { return; } pqTop.doc = doc + docBase; pqTop.score = score; pqTop = pq.updateTop(); }同样可以用上面的方式影响其打分。

zhyf918

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene索引：改变索引打分的四种方式（

lucene索引：改变索引打分的四种方式（2） (6) float coord(int overlap, int maxOverlap)一次搜索可能包含多个搜索词，而一篇文档中也可能包含多个搜索词，此项表示，当一篇文档中包含的搜索词越多，则此文档则打分越高。public void TestCoord() throws Exception { MySimilarity sim = new MySim
复制链接

扫一扫

专栏目录