面对字段类型为数值时,lucene表现得并不是很完美,经常会带来一些意想不到的“问题”。
下面从索引、排序、范围检索(rangeQuery)三个方面进行分析。
搜索我们做好准备工作,建立索引。
RAMDirectory dir = new RAMDirectory();
public void index() {
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
try {
IndexWriter writer = new IndexWriter(dir, new IndexWriterConfig(
Version.LUCENE_36, analyzer));
Random random = new Random();
Fieldable f0 = new Field("f0", "c", Store.YES, Index.NOT_ANALYZED);
Fieldable f1 = new Field("f1", "", Store.YES, Index.NOT_ANALYZED);
Fieldable f2 = new Field("f2", "", Store.YES, Index.NOT_ANALYZED);
Fieldable f3 = new NumericField("f3", Store.YES, true);
Fieldable f4 = new NumericField("f4", Store.YES, true);
for (int i = 0; i < 20; i++) {
int value = random.nextInt(100);
((Field) f1).setValue(value + "");
((Field) f2).setValue(value + random.nextFloat() + "");
((NumericField) f3).setIntValue(value);
((NumericField) f4).setFloatValue(value + random.nextFloat());
Document doc = new Document();
doc.add(f0);
doc.add(f1);
doc.add(f2);
doc.add(f3);
doc.add(f4);
writer.addDocument(doc);
}
writer.close();
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (LockObtainFailedException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
这里共5个字段,
f1:field类型,填充int的StringValue;
f2:field类型,填充float的StringValue;
f3:numericField类型,填充int;
f4:numericField类型,填充float;
共20个document。
排序
从luceneApi可知,排序类型如下:
Field Summary | |
---|---|
static int | BYTE Sort using term values as encoded Bytes. |
static int | CUSTOM Sort using a custom Comparator. |
static int | DOC Sort by document number (index order). |
static int | DOUBLE Sort using term values as encoded Doubles. |
static SortField | FIELD_DOC Represents sorting by document number (index order). |
static SortField | FIELD_SCORE Represents sorting by document score (relevance). |
static int | FLOAT Sort using term values as encoded Floats. |
static int | INT Sort using term values as encoded Integers. |
static int | LONG Sort using term values as encoded Longs. |
static int | SCORE Sort by document score (relevance). |
static int | SHORT Sort using term values as encoded Shorts. |
static int | STRING Sort using term values as Strings. |
static int | STRING_VAL Sort using term values as Strings, but comparing by value (using String.compareTo) for all comparisons. |
这里我们只关注String、int、float。
public void sort() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
TermQuery query = new TermQuery(new Term("f0", "c"));
// SortField field = new SortField("f1", SortField.STRING);// 有问题
// SortField field = new SortField("f1", SortField.INT);// 没问题
// SortField field = new SortField("f1", SortField.FLOAT);// 没问题
// SortField field = new SortField("f2", SortField.STRING);// 有问题
// SortField field = new SortField("f2", SortField.INT);//有问题
// SortField field = new SortField("f2", SortField.FLOAT);// 没问题
// SortField field = new SortField("f3", SortField.STRING);// 有问题
// SortField field = new SortField("f3", SortField.INT);//没问题
// SortField field = new SortField("f3", SortField.FLOAT);// 没问题
// SortField field = new SortField("f3", SortField.STRING);// 没问题
// SortField field = new SortField("f3", SortField.INT);// 没问题
SortField field = new SortField("f3", SortField.FLOAT);// 没问题
Sort sort = new Sort(field);
TopFieldDocs docs = searcher.search(query, 20, sort);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
从上面的测试可知:
如果使用field类进行索引,排序时可以指定“正确”的数据类型进行排序。使用String类型肯定不行,如果索引的时候存放的是float的StringValue,排序时使用SortField.INT同样会产生问题,异常如下:
java.lang.NumberFormatException: Invalid shift value in prefixCoded string (is encoded value really an INT?)
从异常可以判断,lucene排序的时候会先将String转换成指定的数值类型,如果指定错了(例如将1.2转成int型)就会遇到异常。
如果使用numericField进行索引,索引的是什么类型排序就选用什么类型。如果考虑其他的太纠结。
范围检索
public void rangeSearch() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
// Query query = new TermRangeQuery("f1", "30", "60", true,
// true);//有问题
// Query query = NumericRangeQuery.newIntRange("f3", 30, 60,
// true, true);//没问题
// Query query = new TermRangeQuery("f2", "30", "60", true,
// true);//有问题
Query query = NumericRangeQuery.newFloatRange("f4", 30f, 60f, true,
true);// 没问题
TopDocs docs = searcher.search(query, 20);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
检索时,我们常用queryParser,但是queryParser的范围检索对数值型不支持,因为lucene没有记录哪些域是数值型的,在queryParser解析时也会不特殊处理。
这时我们可以创建queryParser的子类,例如:
public class NumericQueryParser extends QueryParser {
protected NumericQueryParser(Version matchVersion, String field, Analyzer a) {
super(matchVersion, field, a);
}
@Override
protected org.apache.lucene.search.Query getRangeQuery(String field,
String part1, String part2, boolean inclusive)
throws ParseException {
TermRangeQuery query = (TermRangeQuery) super.getRangeQuery(field,
part1, part2, inclusive);
if ("f3".equals(field)) {
return NumericRangeQuery.newIntRange(field,
Integer.parseInt(query.getLowerTerm()),
Integer.parseInt(query.getUpperTerm()),
query.includesLower(), query.includesUpper());
} else {
return query;
}
}
}
使用其进行范围检索:
public void rangeSearch() {
IndexReader reader;
try {
reader = IndexReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
// QueryParser parser = new QueryParser(Version.LUCENE_36, "f0",
// analyzer);//有问题
NumericQueryParser parser = new NumericQueryParser(
Version.LUCENE_36, "f0", analyzer);
Query query = parser.parse("f3:[30 TO 60]");
TopDocs docs = searcher.search(query, 20);
ScoreDoc[] sds = docs.scoreDocs;
for (ScoreDoc sd : sds) {
Document doc = reader.document(sd.doc);
System.out.println(doc.get("f1") + "\t" + doc.get("f2") + "\t"
+ doc.get("f3") + "\t" + doc.get("f4"));
}
} catch (CorruptIndexException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (ParseException e) {
e.printStackTrace();
}
}
自我提醒:
1、有的问题从表面上不要考虑太多,例如上面的排序,如果是索引的是int,排序int肯定没有问题,不要再去尝试string,或者其他数值类型。没有太多意义!
2、如果要把这些问题考虑情况,从本质下手,从源码开始!