请先确认一句话:“并非人人生而平等!”。对于Document和Field也是如此。
假设你现在需要索引一些邮件。要求是,搜索结果中,船长发出的邮件要排在船员的前面!如何实现?
还好Lucene为你提供了它的实现,而且非常简单:boosting. 每个文档都拥有一个优先权重因数,默认情况下它的值是1.0, 你可以通过改变此值来实现上面的要求。重要的文档(此例中为船长的邮件),我们可以让这个数大于1.0, 比如2.0如何?次要的文档(此例中为船员的邮件),我们可以让这个数小于1.0, 比如0.5。 当然,也可以让重要的为3.0,次要的为2.0,怎么设计随你。
那么怎么改变这个因数呢? Lucene API提供了一个独立的方法:setBoost(float); 就是这么简单!
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import junit.framework.TestCase;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
public class IndexBoostingTest extends TestCase{
private Directory directory;
protected void setUp() throws Exception {
directory = new RAMDirectory();
IndexWriter writer = getWriter();
List<Email> mails = makeSomeEmails();
for(Email e : mails){
Document doc = new Document();
doc.add(new Field("senderEmail", e.getSenderEmail(),
Field.Store.YES,
Field.Index.NOT_ANALYZED));
doc.add(new Field("senderName", e.getSenderName(),
Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("subject", e.getSubject(),
Field.Store.YES,
Field.Index.ANALYZED));
doc.add(new Field("body", e.getBody(),
Field.Store.NO,
Field.Index.ANALYZED));
// 关键代码:设置文档的优先权重因数
if(Email.IMPORTANT.equals(e.getSenderDomain())){
doc.setBoost(1.5F);
}else if(Email.UNIMPORTANT.equals(e.getSenderDomain())){
doc.setBoost(0.5F);
}else{
//此处写不写都一样,默认的优先权重因数是1.0
doc.setBoost(1F);
}
writer.addDocument(doc);
}
writer.close();
}
//事实上不能这么写测试用例,因为这种测试很不严格。
//排序的结果除了受boost影响还取决于文档与查询词的匹配度等
public void testBoostResult() throws IOException {
IndexSearcher is = new IndexSearcher(directory);
Query query = new TermQuery(new Term("body", "团"));
TopDocs topDocs = is.search(query, 3);
ScoreDoc[] docs = topDocs.scoreDocs;
//我们期望的排序结果是:Luffy、Sanji、Zoro
String luffy = is.doc(docs[0].doc).get("senderName");
String sanji = is.doc(docs[1].doc).get("senderName");
String zoro = is.doc(docs[2].doc).get("senderName");
assertEquals("Luffy", luffy); //路飞排第一啦~~他的boost是1.5
assertEquals("Sanji", sanji); //香吉士采用了默认的boost是1.0
assertEquals("Zoro", zoro); //容易迷路的家伙boost是0.5
}
private IndexWriter getWriter() throws IOException {
return new IndexWriter(directory, new StandardAnalyzer(Version.LUCENE_30),
IndexWriter.MaxFieldLength.UNLIMITED);
}
//模拟测试数据
private List<Email> makeSomeEmails(){
ArrayList<Email> testData = new ArrayList<Email>();
//测试数据1 不设置boost
Email mail1 = new Email();
mail1.setSenderEmail("Sanji@iteye.com");
mail1.setSenderName("Sanji");
mail1.setSenderDomain("普通的~~");
mail1.setSubject("海贼");
mail1.setBody("草帽海贼团厨师,金发,有着卷曲眉毛,永远遮住半边脸的家伙,其左眼是个迷,香烟不离口,海贼中的绅士");
testData.add(mail1);
//测试数据2 设置较高的boost
Email mail2 = new Email();
mail2.setSenderEmail("Monkey·D·Luffy@iteye.com");
mail2.setSenderName("Luffy");
mail2.setSenderDomain(Email.IMPORTANT);
mail2.setSubject("海贼");
mail2.setBody("草帽海贼团船长,特征是头戴草帽,顽强,坚定,喜欢探险,最爱吃肉");
testData.add(mail2);
//测试数据3 设置较低的boost(喜欢索隆的别喷我,我也喜欢...举个例子而已)
Email mail3 = new Email();
mail3.setSenderEmail("RoronoaZoro@iteye.com");
mail3.setSenderName("Zoro");
mail3.setSenderDomain(Email.UNIMPORTANT);
mail3.setSubject("海贼");
mail3.setBody("草帽海贼团剑士,绿色头发,左耳戴三只黄色露珠耳环");
testData.add(mail3);
return testData;
}
}
class Email{
public static String IMPORTANT = "important";
public static String UNIMPORTANT = "unimportant";
private String senderEmail;
private String senderName;
private String senderDomain;
private String subject;
private String body;
public String getSenderEmail() {
return senderEmail;
}
public void setSenderEmail(String senderEmail) {
this.senderEmail = senderEmail;
}
public String getSenderName() {
return senderName;
}
public void setSenderName(String senderName) {
this.senderName = senderName;
}
public String getSenderDomain() {
return senderDomain;
}
public void setSenderDomain(String senderDomain) {
this.senderDomain = senderDomain;
}
public String getSubject() {
return subject;
}
public void setSubject(String subject) {
this.subject = subject;
}
public String getBody() {
return body;
}
public void setBody(String body) {
this.body = body;
}
}
问题又来了,如果我们认为邮件的标题中出现的关键词比正文中更重要呢?也就是说要使名为"subject"的Field优先于"body"的。
哈哈,其实也很简单,只要针对这个域调用 setBoost(float); 即可:
Field subjectField = new Field("subject", subject,
Field.Store.YES,
Field.Index.ANALYZED);
subjectField.setBoost(1.2F);
事实上,之前的例子中对document设置boost值,相当于对该document下的所有Field设置了相同的boost值。理所当然,你也可以为单个Field设置boost值!
必须要指出,如果需要更改已经设置的boost值,那么只能重新索引整个document,然后为它设置另一个boost值。这看起来很糟糕,毕竟客户的需求总是在变化的。放心,我们还可以通过在搜索阶段自定义排序来实现此效果,那将更具动态性,更加灵活。现在,你可以为那些永远要排在前面的文档设置一个无比大的boost值了!