lucene对要解析的内容方面的限制及注意事项

最新推荐文章于 2024-05-22 12:24:46 发布

iteye_19410

最新推荐文章于 2024-05-22 12:24:46 发布

阅读量79

点赞数

分类专栏： Lucene 文章标签： lucene Google

本文链接：https://blog.csdn.net/iteye_19410/article/details/81935559

版权

Lucene 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

对内容长短的限制：

主要目的是防止内部不足而产生的内存泄露问题。只要内存足够大，这个值可以设置成Integer.MAX_VALUE,能覆盖目前可能的文档大小。

参考内容：

Documents are truncated by default

The indexer by default truncates documents to IndexWriter.DEFAULT_MAX_FIELD_LENGTH or 10,000 terms in Lucene 2.0.

Rule of thumb: an average page of English text contains about 250 words. (Source: Google Answers.) This means only about 40 pages are indexed by default. If any of your documents are longer than this (and you want them indexed in full), you should raise the limit with IndexWriter.setMaxFieldLength().

public void setMaxFieldLength(int maxFieldLength)

The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. This setting refers to the number of running terms, not to the number of different terms.

Note: this silently truncates large documents, excluding from the index all terms that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.

By default, no more than DEFAULT_MAX_FIELD_LENGTH terms will be indexed for a field.

iteye_19410

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
lucene对要解析的内容方面的限制及注意事项

对内容长短的限制：主要目的是防止内部不足而产生的内存泄露问题。只要内存足够大，这个值可以设置成Integer.MAX_VALUE,能覆盖目前可能的文档大小。参考内容：Documents are truncated by defaultThe indexer by default truncates documents to IndexWriter.DEF...
复制链接

扫一扫