关于Lucene4.x起Field对象不在指定域排序的一点解惑

在Lucene4.x以前 

向Document中添加Filed是如此操作

Field field = new Field("filename", f.getName(),  Field.Store.YES, Field.Index.NOT_ANALYZED); 

但是之后的版本,包括5.0都将Field.Index参数废弃掉了,建议直接使用与域类型相关的域,比如:

Field pathField = new StringField("path", filetoIndex.getPath(),Field.Store.YES); 

那么 我们看一下源码中是怎样说的:

/** A field that is indexed but not tokenized: the entire
 *  String value is indexed as a single token.  For example
 *  this might be used for a 'country' field or an 'id'
 *  field, or any field that you intend to use for sorting
 *  or access through the field cache. */

public final class StringField extends Field 
说明StringField默认是不被分词器解析,直接作为单个StreamToken被索引的,所以如果搜索这样的域只有整个值都匹配了才能搜索出来。

/** A field that is indexed and tokenized, without term
 *  vectors.  For example this would be used on a 'body'
 *  field, that contains the bulk of a document's text. */

public final class TextField extends Field 

上面是TextField的官方说明,所以,如果想要将一段文字采用分词的方式进行索引,可以使用这个类,但是这个Field默认是不会存储源数据的,如果有这样的需求,可以采用这个构造方法指定

/** Creates a new TextField with String value. 
   * @param name field name
   * @param value string value
   * @param store Store.YES if the content should also be stored
   * @throws IllegalArgumentException if the field name or value is null.
   */
  public TextField(String name, String value, Store store) {
    super(name, value, store == Store.YES ? TYPE_STORED : TYPE_NOT_STORED);
  }

虽然TextField应该是处理富文本而存在的,但是例如Word,PDF这样的富文本如果直接索引得到的只是一顿乱码,需要使用Tika先进行解析工作。


国外StackOverFlow对于这个API改动的讨论如下:


Question:
           Until Lucene version 3.9 , we could specify to index or not to index a field by using FIELD.INDEX.NO or FIELD.INDEX.ANALYZED etc. But in lucene 4.0 there is no constructor available, in which we may define this . How do we control indexing in this version? I mean if i want a field "name" to be stored in index but doesn't want to index it, then how can i do it in lucene 4.0?

Answer:
           Constructors taking Field.Index arguments are available, but are deprecated in 4.0, and should not be used. Instead, you should look to subclasses of Field to control how a field is indexed.

StringField is the standard un-analyzed indexed field. The field is indexed is a single token. It is appropriate things like identifiers, for which you only need to search for exact matches.

TextField is the standard analyzed (and, of course, indexed) field, for textual content. It is an appropriate choice for full-text searching.

StoredField is a stored field that is not indexed at all (and so, is not searchable).

Except StoredField, each of these can be passed a Field.Store value as a constructor argument, similar to Lucene 3.6.

For more information on this change, take a look at the Lucene Migration Guide, particularly the sections titled: "Separate IndexableFieldType from Field instances"


以上就基本能解决这个API改动的困惑















  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值