LUCENE3.0 自学吧 3 由lucene demo引出的思考

最新推荐文章于 2024-08-29 06:57:42 发布

sustbeckham

最新推荐文章于 2024-08-29 06:57:42 发布

阅读量961

点赞数

分类专栏： Lucene 文章标签： lucene file vector token string null

Lucene 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

org.apache.lucene.demo.IndexFiles 类中，使用递归的方式去索引文件。在构造了一个IndexWriter索引器之后，就可以向索引器中添加Doucument 了，执行真正地建立索引的过程。遍历每个目录，因为每个目录中可能还存在目录，进行深度遍历，采用递归技术找到处于叶节点处的文件(普通的具有扩展名的文件，比如my.txt文件)，然后调用如下代码中：

static void indexDocs(IndexWriter writer, File file) throws IOException { // file可以读取 if (file.canRead()) { if (file.isDirectory()) { // 如果file是一个目录(该目录下面可能有文件、目录文件、空文件三种情况) String[] files = file.list(); // 获取file目录下的所有文件(包括目录文件)File对象，放到数组files里 // 如果files!=null if (files != null) { for (int i = 0; i < files.length; i++) { // 对files数组里面的File对象递归索引，通过广度遍历 indexDocs(writer, new File(file, files[i])); } } } else { // 到达叶节点时，说明是一个File，而不是目录，则建立索引 System.out.println("adding " + file); try { writer.addDocument(FileDocument.Document(file)); } catch (FileNotFoundException fnfe) { ; } } } }

上面这一句：

writer.addDocument(FileDocument.Document(file));

其实做了很多工作。每当递归到叶子节点，获得一个文件，而非目录文件，比如文件myWorld.txt。然后对这个文件进行了复杂的操作：

先根据由myWorld.txt构造的File对象f，通过f获取myWorld.txt的具体信息，比如存储路径、修改时间等等，构造多个Field对象，再由这些不同Field的聚合，构建出一个Document对象，最后把Document对象加入索引器IndexWriter对象中，通过索引器可以对这些聚合的Document 的Field中信息进行分词、过滤处理，方便检索。

org.apache.lucene.demo.FileDocument类的源代码如下所示： package org.apache.lucene.demo; import java.io.File; import java.io.FileReader; import org.apache.lucene.document.DateTools; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public class FileDocument { public static Document Document(File f) throws java.io.FileNotFoundException { // 实例化一个Document Document doc = new Document(); // 根据传进来的File f，构造多个Field对象，然后把他们都添加到Document中 // 通过f的所在路径构造一个Field对象，并设定该Field对象的一些属性： // “path”是构造的Field的名字，通过该名字可以找到该Field // Field.Store.YES表示存储该Field；Field.Index.UN_TOKENIZED表示不对该Field进行分词，但是对其进行索引，以便检索 doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED)); // 构造一个具有最近修改修改时间信息的Field doc.add(new Field("modified", DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.UN_TOKENIZED)); // 构造一个Field，这个Field可以从一个文件流中读取，必须保证由f所构造的文件流是打开的 doc.add(new Field("contents", new FileReader(f))); return doc; } private FileDocument() {} }

通过上面的代码，可以看出Field是何其的重要，必须把Field完全掌握了。

Field类定义了两个很有用enum：Store和Index，用它们来设置对Field进行索引时的一些属性。

/** Specifies whether and how a field should be stored. */ public static enum Store { /** Store the original field value in the index. This is useful for short texts * like a document's title which should be displayed with the results. The * value is stored in its original form, i.e. no analyzer is used before it is * stored. */ YES { @Override public boolean isStored() { return true; } }, /** Do not store the field value in the index. */ NO { @Override public boolean isStored() { return false; } }; public abstract boolean isStored(); } /** Specifies whether and how a field should be indexed. */ public static enum Index { /** Do not index the field value. This field can thus not be searched, * but one can still access its contents provided it is * {@link Field.Store stored}. */ NO { @Override public boolean isIndexed() { return false; } @Override public boolean isAnalyzed() { return false; } @Override public boolean omitNorms() { return true; } }, /** Index the tokens produced by running the field's * value through an Analyzer. This is useful for * common text. */ ANALYZED { @Override public boolean isIndexed() { return true; } @Override public boolean isAnalyzed() { return true; } @Override public boolean omitNorms() { return false; } }, /** Index the field's value without using an Analyzer, so it can be searched. * As no analyzer is used the value will be stored as a single term. This is * useful for unique Ids like product numbers. */ NOT_ANALYZED { @Override public boolean isIndexed() { return true; } @Override public boolean isAnalyzed() { return false; } @Override public boolean omitNorms() { return false; } }, /** Expert: Index the field's value without an Analyzer, * and also disable the storing of norms. Note that you * can also separately enable/disable norms by calling * {@link Field#setOmitNorms}. No norms means that * index-time field and document boosting and field * length normalization are disabled. The benefit is * less memory usage as norms take up one byte of RAM * per indexed field for every document in the index, * during searching. Note that once you index a given * field <i>with</i> norms enabled, disabling norms will * have no effect. In other words, for this to have the * above described effect on a field, all instances of * that field must be indexed with NOT_ANALYZED_NO_NORMS * from the beginning. */ NOT_ANALYZED_NO_NORMS { @Override public boolean isIndexed() { return true; } @Override public boolean isAnalyzed() { return false; } @Override public boolean omitNorms() { return true; } }, /** Expert: Index the tokens produced by running the * field's value through an Analyzer, and also * separately disable the storing of norms. See * {@link #NOT_ANALYZED_NO_NORMS} for what norms are * and why you may want to disable them. */ ANALYZED_NO_NORMS { @Override public boolean isIndexed() { return true; } @Override public boolean isAnalyzed() { return true; } @Override public boolean omitNorms() { return true; } };

Field类中还有一个内部类，它的声明如下：

public static enum TermVector { /** Do not store term vectors. */ NO { @Override public boolean isStored() { return false; } @Override public boolean withPositions() { return false; } @Override public boolean withOffsets() { return false; } }, /** Store the term vectors of each document. A term vector is a list * of the document's terms and their number of occurrences in that document. */ YES { @Override public boolean isStored() { return true; } @Override public boolean withPositions() { return false; } @Override public boolean withOffsets() { return false; } }, /** * Store the term vector + token position information * * @see #YES */ WITH_POSITIONS { @Override public boolean isStored() { return true; } @Override public boolean withPositions() { return true; } @Override public boolean withOffsets() { return false; } }, /** * Store the term vector + Token offset information * * @see #YES */ WITH_OFFSETS { @Override public boolean isStored() { return true; } @Override public boolean withPositions() { return false; } @Override public boolean withOffsets() { return true; } }, /** * Store the term vector + Token position and offset information * * @see #YES * @see #WITH_POSITIONS * @see #WITH_OFFSETS */ WITH_POSITIONS_OFFSETS { @Override public boolean isStored() { return true; } @Override public boolean withPositions() { return true; } @Override public boolean withOffsets() { return true; } };

这是一个与词条有关的枚举类型。

在3.0之前的lucene中，通常store index termvector都是被设置为静态内部类。。3.0开始设置为枚举类型。。。。。。

同时，Field的值可以构造成很多类型，Field类中定义了4种：String、Reader、byte[]、TokenStream。

然后就是Field对象的构造，应该看它的构造方法，它有9种构造方法：

还要注意了，通过Field类的声明：

public final class Field extends AbstractField implements Fieldable , Serializable

可以看出，应该对它继承的父类AbstractField类有一个了解，下面的是AbstractField类的属性：

protected String name = "body"; protected boolean storeTermVector = false; protected boolean storeOffsetWithTermVector = false; protected boolean storePositionWithTermVector = false; protected boolean omitNorms = false; protected boolean isStored = false; protected boolean isIndexed = true; protected boolean isTokenized = true; protected boolean isBinary = false; protected boolean isCompressed = false; protected boolean lazy = false; protected float boost = 1.0f; protected Object fieldsData = null;

还有Field实现了Fieldable接口，添加了一些对对应的Document中的Field进行管理判断的方法信息。

sustbeckham

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
LUCENE3.0 自学吧 3 由lucene demo引出的思考

org.apache.lucene.demo.IndexFiles类中，使用递归的方式去索引文件。在构造了一个IndexWriter索引器之后，就可以向索引器中添加Doucument了，执行真正地建立索引的过程。遍历每个目录，因为每个目录中可能还存在目录，进行深度遍历，采用递归技术找到处于叶节点处的文件(普通的具有扩展名的文件，比如my.txt文件)，然后调用如下代码中：static void indexDocs(IndexWriter writer, File file) thr
复制链接

扫一扫

专栏目录