如果业务上有需求,需要进行自定义数据类型。这个就需要对solr的数据类型的实现有一定的了解。
shema
solr的很多配置向都是在一个confset,配置集里面去定义的。比如说前面有提到,要想添加自定义的processor就需要在solrconfig.xml配置文件中进行定义,当然后端还是要写自己的实现。数据类型也是一样。先看一下solr本身是如何定义数据类型的。
简单类比于数据库的表,type的定义见下图。
可以看到,最终数据类型的定义在solr.*这些类里面。solr本身就提供了基础数据类型的存储。另外还有很多其它的属性。indexed表示是否为本字段建立索引,stored是否存储等等,类似的还有docValue的属性指的是是否是用正排索引。其它的暂时用不到。
除了这些solr还定义了一些动态字段。不用修改配置文件,直接根据通配符就可以定义字段名。
自定义数据类型
知道上面的关系之后,自定义其实就很简单了。
自己构造一个类继承与相应的数据类型就可以了。在managed-schema中配置自定义的类可以实现这些功能。一下有几个注意的点:
属性
首先,属性完全可以自定义,随便加上一个length什么的都可以。但要注意的是,solr本身会对属性进行检查,最好是在获取到属性之后将其remove调。
@Override
protected void init(IndexSchema schema, Map<String,String> args) {
super.init(schema, args); // 重写init方法, 在此处将args中的对应属性remove掉。
}
另外还有一点就是,属性如果是boolean类型的就比较麻烦。在FieldPropertites类中,定制了boolean类型的属性名,直接添加这边的验证是不通过的。所以需要修改源码,添加自定义的属性。
protected final static int INDEXED = 0x00000001;
protected final static int TOKENIZED = 0x00000002;
protected final static int STORED = 0x00000004;
protected final static int BINARY = 0x00000008;
protected final static int OMIT_NORMS = 0x00000010;
protected final static int OMIT_TF_POSITIONS = 0x00000020;
protected final static int STORE_TERMVECTORS = 0x00000040;
protected final static int STORE_TERMPOSITIONS = 0x00000080;
protected final static int STORE_TERMOFFSETS = 0x00000100;
protected final static int MULTIVALUED = 0x00000200;
protected final static int SORT_MISSING_FIRST = 0x00000400;
protected final static int SORT_MISSING_LAST = 0x00000800;
protected final static int REQUIRED = 0x00001000;
protected final static int OMIT_POSITIONS = 0x00002000;
protected final static int STORE_OFFSETS = 0x00004000;
protected final static int DOC_VALUES = 0x00008000;
protected final static int STORE_TERMPAYLOADS = 0x00010000;
protected final static int USE_DOCVALUES_AS_STORED = 0x00020000;
static final String[] propertyNames = {
"indexed", "tokenized", "stored",
"binary", "omitNorms", "omitTermFreqAndPositions",
"termVectors", "termPositions", "termOffsets",
"multiValued",
"sortMissingFirst","sortMissingLast","required", "omitPositions",
"storeOffsetsWithPositions", "docValues", "termPayloads", "useDocValuesAsStored"
};// boolean类型的属性名只有这几个,上面使用二进制的每一位来代表属性。
parseProperties时会对所有的boolean类型进行检查,如果没有在上面定义,会报错。
static int parseProperties(Map<String,?> properties, boolean which, boolean failOnError) {
int props = 0;
for (Map.Entry<String,?> entry : properties.entrySet()) {
Object val = entry.getValue();
if(val == null) continue;
boolean boolVal = val instanceof Boolean ? (Boolean)val : Boolean.parseBoolean(val.toString());
if (boolVal == which) { // 最简单的方法就是在这里直接写死做屏蔽。
props |= propertyNameToInt(entry.getKey(), failOnError);
}
}
return props;
}
排序字段
如果自定义的数据类型支持排序,solr会进行一次check。所有有时候有必要去重写getSortField方法,去掉验证的步骤。
public void checkSortability() throws SolrException {
// 如果一下两个条件都不满足,不好意思会报错。
if (! (indexed() || hasDocValues()) ) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"can not sort on a field which is neither indexed nor has doc values: "
+ getName());
}
if ( multiValued() ) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"can not sort on multivalued field: "
+ getName());
}
}
屏蔽lucene数据的写入
既然都自定义数据类型了,也有可能这个字段的存储由自己来管理,那就没必要在lucene中再次存储一份,浪费空间。OK,依然有办法解决。DocumentBuilder,将solr的Document转换成lucene的Document,只有这一个地方进行转换。
public static Document toDocument( SolrInputDocument doc, IndexSchema schema )
{
Document out = new Document();
final float docBoost = doc.getDocumentBoost();
Set<String> usedFields = Sets.newHashSet();
// Load fields from SolrDocument to Document
for( SolrInputField field : doc ) { // 已经找到Field了,怎么处置就随便你了
String name = field.getName();
SchemaField sfield = schema.getFieldOrNull(name);
boolean used = false;
// Make sure it has the correct number
if( sfield!=null && !sfield.multiValued() && field.getValueCount() > 1 ) {
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
"ERROR: "+getID(doc, schema)+"multiple values encountered for non multiValued field " +
sfield.getName() + ": " +field.getValue() );
}
float fieldBoost = field.getBoost();
boolean applyBoost = sfield != null && sfield.indexed() && !sfield.omitNorms();
if (applyBoost == false && fieldBoost != 1.0F) {
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
"ERROR: "+getID(doc, schema)+"cannot set an index-time boost, unindexed or norms are omitted for field " +
sfield.getName() + ": " +field.getValue() );
}
// Lucene no longer has a native docBoost, so we have to multiply
// it ourselves
float compoundBoost = fieldBoost * docBoost;
List<CopyField> copyFields = schema.getCopyFieldsList(name);
if( copyFields.size() == 0 ) copyFields = null;
// load each field value
boolean hasField = false;
try {
for( Object v : field ) {
if( v == null ) {
continue;
}
hasField = true;
if (sfield != null) {
used = true;
addField(out, sfield, v, applyBoost ? compoundBoost : 1f);
// record the field as having a value
usedFields.add(sfield.getName());
}
// Check if we should copy this field value to any other fields.
// This could happen whether it is explicit or not.
if( copyFields != null ){
for (CopyField cf : copyFields) {
SchemaField destinationField = cf.getDestination();
final boolean destHasValues = usedFields.contains(destinationField.getName());
// check if the copy field is a multivalued or not
if (!destinationField.multiValued() && destHasValues) {
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"ERROR: "+getID(doc, schema)+"multiple values encountered for non multiValued copy field " +
destinationField.getName() + ": " + v);
}
used = true;
// Perhaps trim the length of a copy field
Object val = v;
if( val instanceof String && cf.getMaxChars() > 0 ) {
val = cf.getLimitedValue((String)val);
}
// we can't copy any boost unless the dest field is
// indexed & !omitNorms, but which boost we copy depends
// on whether the dest field already contains values (we
// don't want to apply the compounded docBoost more then once)
final float destBoost =
(destinationField.indexed() && !destinationField.omitNorms()) ?
(destHasValues ? fieldBoost : compoundBoost) : 1.0F;
addField(out, destinationField, val, destBoost);
// record the field as having a value
usedFields.add(destinationField.getName());
}
}
// The final boost for a given field named is the product of the
// *all* boosts on values of that field.
// For multi-valued fields, we only want to set the boost on the
// first field.
fieldBoost = compoundBoost = 1.0f;
}
}
catch( SolrException ex ) {
throw ex;
}
catch( Exception ex ) {
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
"ERROR: "+getID(doc, schema)+"Error adding field '" +
field.getName() + "'='" +field.getValue()+"' msg=" + ex.getMessage(), ex );
}
// make sure the field was used somehow...
if( !used && hasField ) {
throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
"ERROR: "+getID(doc, schema)+"unknown field '" +name + "'");
}
}
定义数据的存储方式
数据的存储方式solr提供了两种一种叫single-value,一种是multi-value。这里只看单值存储。需要重写createField方法。
另外注意boost的时候,看到的情况是说其类似于计算score时的一个权值。
public IndexableField createField(SchemaField field, Object value, float boost) {
if (!field.indexed() && !field.stored()) {
if (log.isTraceEnabled())
log.trace("Ignoring unindexed/unstored field: " + field); // 这里面还有有一个check,如果不需要就去掉。
return null;
}
String val;
try {
val = toInternal(value.toString());
} catch (RuntimeException e) {
throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, "Error while creating field '" + field + "' from value '" + value + "'", e);
}
if (val==null) return null;
org.apache.lucene.document.FieldType newType = new org.apache.lucene.document.FieldType();
newType.setTokenized(field.isTokenized());
newType.setStored(field.stored());
newType.setOmitNorms(field.omitNorms());
newType.setIndexOptions(field.indexed() ? getIndexOptions(field, val) : IndexOptions.NONE);
newType.setStoreTermVectors(field.storeTermVector());
newType.setStoreTermVectorOffsets(field.storeTermOffsets());
newType.setStoreTermVectorPositions(field.storeTermPositions());
newType.setStoreTermVectorPayloads(field.storeTermPayloads());
return createField(field.getName(), val, newType, boost);
}
另外注意一点,在url请求或者是使用javabin导入数据的时候,我看到的流程上都没有进行createField的调用。