solr源码分析--数据类型

最新推荐文章于 2023-12-31 01:07:57 发布

ferraborghini

最新推荐文章于 2023-12-31 01:07:57 发布

阅读量1.4k

点赞数

分类专栏：检索 solr 大数据文章标签： solr 源码

本文链接：https://blog.csdn.net/u011426341/article/details/78939998

版权

检索同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

大数据

5 篇文章 0 订阅

订阅专栏

solr

4 篇文章 0 订阅

订阅专栏

如果业务上有需求，需要进行自定义数据类型。这个就需要对solr的数据类型的实现有一定的了解。

shema

　　solr的很多配置向都是在一个confset，配置集里面去定义的。比如说前面有提到，要想添加自定义的processor就需要在solrconfig.xml配置文件中进行定义，当然后端还是要写自己的实现。数据类型也是一样。先看一下solr本身是如何定义数据类型的。
　　这里写图片描述
　　简单类比于数据库的表，type的定义见下图。
　　
　　可以看到，最终数据类型的定义在solr.*这些类里面。solr本身就提供了基础数据类型的存储。另外还有很多其它的属性。indexed表示是否为本字段建立索引，stored是否存储等等，类似的还有docValue的属性指的是是否是用正排索引。其它的暂时用不到。
　　除了这些solr还定义了一些动态字段。不用修改配置文件，直接根据通配符就可以定义字段名。
　　这里写图片描述

自定义数据类型

　　知道上面的关系之后，自定义其实就很简单了。
　　自己构造一个类继承与相应的数据类型就可以了。在managed-schema中配置自定义的类可以实现这些功能。一下有几个注意的点：

属性

　　首先，属性完全可以自定义，随便加上一个length什么的都可以。但要注意的是，solr本身会对属性进行检查，最好是在获取到属性之后将其remove调。

  @Override
  protected void init(IndexSchema schema, Map<String,String> args) {
    super.init(schema, args); // 重写init方法， 在此处将args中的对应属性remove掉。
  }

　　另外还有一点就是，属性如果是boolean类型的就比较麻烦。在FieldPropertites类中，定制了boolean类型的属性名，直接添加这边的验证是不通过的。所以需要修改源码，添加自定义的属性。
　　

  protected final static int INDEXED             = 0x00000001;
  protected final static int TOKENIZED           = 0x00000002;
  protected final static int STORED              = 0x00000004;
  protected final static int BINARY              = 0x00000008;
  protected final static int OMIT_NORMS          = 0x00000010;
  protected final static int OMIT_TF_POSITIONS   = 0x00000020;
  protected final static int STORE_TERMVECTORS   = 0x00000040;
  protected final static int STORE_TERMPOSITIONS = 0x00000080;
  protected final static int STORE_TERMOFFSETS   = 0x00000100;


  protected final static int MULTIVALUED         = 0x00000200;
  protected final static int SORT_MISSING_FIRST  = 0x00000400;
  protected final static int SORT_MISSING_LAST   = 0x00000800;

  protected final static int REQUIRED            = 0x00001000;
  protected final static int OMIT_POSITIONS      = 0x00002000;

  protected final static int STORE_OFFSETS       = 0x00004000;
  protected final static int DOC_VALUES          = 0x00008000;

  protected final static int STORE_TERMPAYLOADS  = 0x00010000;
  protected final static int USE_DOCVALUES_AS_STORED  = 0x00020000;

  static final String[] propertyNames = {
          "indexed", "tokenized", "stored",
          "binary", "omitNorms", "omitTermFreqAndPositions",
          "termVectors", "termPositions", "termOffsets",
          "multiValued",
          "sortMissingFirst","sortMissingLast","required", "omitPositions",
          "storeOffsetsWithPositions", "docValues", "termPayloads", "useDocValuesAsStored" 
  };// boolean类型的属性名只有这几个，上面使用二进制的每一位来代表属性。

parseProperties时会对所有的boolean类型进行检查，如果没有在上面定义，会报错。

  static int parseProperties(Map<String,?> properties, boolean which, boolean failOnError) {
    int props = 0;
    for (Map.Entry<String,?> entry : properties.entrySet()) {
      Object val = entry.getValue();
      if(val == null) continue;
      boolean boolVal = val instanceof Boolean ? (Boolean)val : Boolean.parseBoolean(val.toString());
      if (boolVal == which) { // 最简单的方法就是在这里直接写死做屏蔽。
        props |= propertyNameToInt(entry.getKey(), failOnError);
      }
    }
    return props;
  }

排序字段

　　如果自定义的数据类型支持排序，solr会进行一次check。所有有时候有必要去重写getSortField方法，去掉验证的步骤。
　　

  public void checkSortability() throws SolrException {
  // 如果一下两个条件都不满足，不好意思会报错。
    if (! (indexed() || hasDocValues()) ) {
      throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, 
                              "can not sort on a field which is neither indexed nor has doc values: " 
                              + getName());
    }
    if ( multiValued() ) {
      throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, 
                              "can not sort on multivalued field: " 
                              + getName());
    }
  }

屏蔽lucene数据的写入

　　既然都自定义数据类型了，也有可能这个字段的存储由自己来管理，那就没必要在lucene中再次存储一份，浪费空间。OK，依然有办法解决。DocumentBuilder，将solr的Document转换成lucene的Document，只有这一个地方进行转换。　　

  public static Document toDocument( SolrInputDocument doc, IndexSchema schema )
  { 
    Document out = new Document();
    final float docBoost = doc.getDocumentBoost();
    Set<String> usedFields = Sets.newHashSet();
    // Load fields from SolrDocument to Document
    for( SolrInputField field : doc ) { // 已经找到Field了，怎么处置就随便你了
      String name = field.getName();
      SchemaField sfield = schema.getFieldOrNull(name);
      boolean used = false;


      // Make sure it has the correct number
      if( sfield!=null && !sfield.multiValued() && field.getValueCount() > 1 ) {
        throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
            "ERROR: "+getID(doc, schema)+"multiple values encountered for non multiValued field " + 
              sfield.getName() + ": " +field.getValue() );
      }

      float fieldBoost = field.getBoost();
      boolean applyBoost = sfield != null && sfield.indexed() && !sfield.omitNorms();

      if (applyBoost == false && fieldBoost != 1.0F) {
        throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
            "ERROR: "+getID(doc, schema)+"cannot set an index-time boost, unindexed or norms are omitted for field " + 
              sfield.getName() + ": " +field.getValue() );
      }

      // Lucene no longer has a native docBoost, so we have to multiply 
      // it ourselves 
      float compoundBoost = fieldBoost * docBoost;

      List<CopyField> copyFields = schema.getCopyFieldsList(name);
      if( copyFields.size() == 0 ) copyFields = null;

      // load each field value
      boolean hasField = false;
      try {
        for( Object v : field ) {
          if( v == null ) {
            continue;
          }
          hasField = true;
          if (sfield != null) {
            used = true;
            addField(out, sfield, v, applyBoost ? compoundBoost : 1f);
            // record the field as having a value
            usedFields.add(sfield.getName());
          }

          // Check if we should copy this field value to any other fields.
          // This could happen whether it is explicit or not.
          if( copyFields != null ){
            for (CopyField cf : copyFields) {
              SchemaField destinationField = cf.getDestination();

              final boolean destHasValues = usedFields.contains(destinationField.getName());

              // check if the copy field is a multivalued or not
              if (!destinationField.multiValued() && destHasValues) {
                throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
                        "ERROR: "+getID(doc, schema)+"multiple values encountered for non multiValued copy field " +
                                destinationField.getName() + ": " + v);
              }

              used = true;

              // Perhaps trim the length of a copy field
              Object val = v;
              if( val instanceof String && cf.getMaxChars() > 0 ) {
                val = cf.getLimitedValue((String)val);
              }

              // we can't copy any boost unless the dest field is 
              // indexed & !omitNorms, but which boost we copy depends
              // on whether the dest field already contains values (we
              // don't want to apply the compounded docBoost more then once)
              final float destBoost = 
                (destinationField.indexed() && !destinationField.omitNorms()) ?
                (destHasValues ? fieldBoost : compoundBoost) : 1.0F;

              addField(out, destinationField, val, destBoost);
              // record the field as having a value
              usedFields.add(destinationField.getName());
            }
          }

          // The final boost for a given field named is the product of the 
          // *all* boosts on values of that field. 
          // For multi-valued fields, we only want to set the boost on the
          // first field.
          fieldBoost = compoundBoost = 1.0f;
        }
      }
      catch( SolrException ex ) {
        throw ex;
      }
      catch( Exception ex ) {
        throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
            "ERROR: "+getID(doc, schema)+"Error adding field '" + 
              field.getName() + "'='" +field.getValue()+"' msg=" + ex.getMessage(), ex );
      }

      // make sure the field was used somehow...
      if( !used && hasField ) {
        throw new SolrException( SolrException.ErrorCode.BAD_REQUEST,
            "ERROR: "+getID(doc, schema)+"unknown field '" +name + "'");
      }
    }

定义数据的存储方式

　　数据的存储方式solr提供了两种一种叫single-value，一种是multi-value。这里只看单值存储。需要重写createField方法。　　
　　另外注意boost的时候，看到的情况是说其类似于计算score时的一个权值。

  public IndexableField createField(SchemaField field, Object value, float boost) {
    if (!field.indexed() && !field.stored()) { 
      if (log.isTraceEnabled())
        log.trace("Ignoring unindexed/unstored field: " + field); // 这里面还有有一个check，如果不需要就去掉。
      return null;
    }
    String val;
    try {
      val = toInternal(value.toString());
    } catch (RuntimeException e) {
      throw new SolrException( SolrException.ErrorCode.SERVER_ERROR, "Error while creating field '" + field + "' from value '" + value + "'", e);
    }
    if (val==null) return null;
    org.apache.lucene.document.FieldType newType = new org.apache.lucene.document.FieldType();
    newType.setTokenized(field.isTokenized());
    newType.setStored(field.stored());
    newType.setOmitNorms(field.omitNorms());
    newType.setIndexOptions(field.indexed() ? getIndexOptions(field, val) : IndexOptions.NONE);
    newType.setStoreTermVectors(field.storeTermVector());
    newType.setStoreTermVectorOffsets(field.storeTermOffsets());
    newType.setStoreTermVectorPositions(field.storeTermPositions());
    newType.setStoreTermVectorPayloads(field.storeTermPayloads());
    return createField(field.getName(), val, newType, boost);
  }