How to make indexing faster

How to make indexing faster

Here are some things to try to speed up the indexing speed of your Lucene application. Please see ImproveSearchingSpeed for how to speed up searching.

  • Be sure you really need to speed things up. Many of the ideas here are simple to try, but others will necessarily add some complexity to your application. So be sure your indexing speed is indeed too slow and the slowness is indeed within Lucene.

  • Make sure you are using the latest version of Lucene.

  • Use a local filesystem. Remote filesystems are typically quite a bit slower for indexing. If your index needs to be on the remote fileysystem, consider building it first on the local filesystem and then copying it up to the remote filesystem.

  • Get faster hardware, especially a faster IO system. If possible, use a solid-state disk (SSD). These devices have come down substantially in price recently, and much lower cost of seeking can be a very sizable speedup in cases where the index cannot fit entirely in the OS's IO cache.

  • Open a single writer and re-use it for the duration of your indexing session.

  • Flush by RAM usage instead of document count.

    For Lucene <= 2.2: call writer.ramSizeInBytes() after every added doc then call flush() when it's using too much RAM. This is especially good if you have small docs or highly variable doc sizes. You need to first set maxBufferedDocs large enough to prevent the writer from flushing based on document count. However, don't set it too large otherwise you may hit LUCENE-845. Somewhere around 2-3X your "typical" flush count should be OK.

    For Lucene >= 2.3: IndexWriter can flush according to RAM usage itself. Call writer.setRAMBufferSizeMB() to set the buffer size. Be sure you don't also have any leftover calls to setMaxBufferedDocs since the writer will flush "either or" (whichever comes first).

  • Use as much RAM as you can afford.

    More RAM before flushing means Lucene writes larger segments to begin with which means less merging later. Testing in LUCENE-843 found that around 48 MB is the sweet spot for that content set, but, your application could have a different sweet spot.

  • Turn off compound file format.

    Call setUseCompoundFile(false). Building the compound file format takes time during indexing (7-33% in testing for LUCENE-888). However, note that doing this will greatly increase the number of file descriptors used by indexing and by searching, so you could run out of file descriptors if mergeFactor is also large.

  • Re-use Document and Field instances As of Lucene 2.3 there are new setValue(...) methods that allow you to change the value of a Field. This allows you to re-use a single Field instance across many added documents, which can save substantial GC cost. It's best to create a single Document instance, then add multiple Field instances to it, but hold onto these Field instances and re-use them by changing their values for each added document. For example you might have an idField, bodyField, nameField, storedField1, etc. After the document is added, you then directly change the Field values (idField.setValue(...), etc), and then re-add your Document instance.

    Note that you cannot re-use a single Field instance within a Document, and, you should not change a Field's value until the Document containing that Field has been added to the index. See Field for details.

  • Always add fields in the same order to your Document, when using stored fields or term vectors

    Lucene's merging has an optimization whereby stored fields and term vectors can be bulk-byte-copied, but the optimization only applies if the field name -> number mapping is the same across segments. Future Lucene versions may attempt to assign the same mapping automatically (see LUCENE-1737), but until then the only way to get the same mapping is to always add the same fields in the same order to each document you index.

  • Re-use a single Token instance in your analyzer Analyzers often create a new Token for each term in sequence that needs to be indexed from a Field. You can save substantial GC cost by re-using a single Token instance instead.

  • Use the char[] API in Token instead of the String API to represent token Text

    As of Lucene 2.3, a Token can represent its text as a slice into a char array, which saves the GC cost of new'ing and then reclaiming String instances. By re-using a single Token instance and using the char[] API you can avoid new'ing any objects for each term. See Token for details.

  • Use autoCommit=false when you open your IndexWriter

    In Lucene 2.3 there are substantial optimizations for Documents that use stored fields and term vectors, to save merging of these very large index files. You should see the best gains by using autoCommit=false for a single long-running session of IndexWriter. Note however that searchers will not see any of the changes flushed by this IndexWriter until it is closed; if that is important you should stick with autoCommit=true instead or periodically close and re-open the writer.

  • Instead of indexing many small text fields, aggregate the text into a single "contents" field and index only that (you can still store the other fields).

  • Increase mergeFactor, but not too much.

    Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

  • Turn off any features you are not in fact using. If you are storing fields but not using them at query time, don't store them. Likewise for term vectors. If you are indexing many fields, turning off norms for those fields may help performance.

  • Use a faster analyzer.

    Sometimes analysis of a document takes alot of time. For example, StandardAnalyzer is quite time consuming, especially in Lucene version <= 2.2. If you can get by with a simpler analyzer, then try it.

  • Speed up document construction. Often the process of retrieving a document from somewhere external (database, filesystem, crawled from a Web site, etc.) is very time consuming.

  • Don't optimize... ever.

  • Use multiple threads with one IndexWriter. Modern hardware is highly concurrent (multi-core CPUs, multi-channel memory architectures, native command queuing in hard drives, etc.) so using more than one thread to add documents can give good gains overall. Even on older machines there is often still concurrency to be gained between IO and CPU. Test the number of threads to find the best performance point.

  • Index into separate indices then merge. If you have a very large amount of content to index then you can break your content into N "silos", index each silo on a separate machine, then use the writer.addIndexesNoOptimize to merge them all into one final index.

  • Run a Java profiler.

    If all else fails, profile your application to figure out where the time is going. I've had success with a very simple profiler called JMP. There are many others. Often you will be pleasantly surprised to find some silly, unexpected method is taking far too much time.


1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看REAdMe.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看REAdMe.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看READme.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 、 1资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看READmE.文件(md如有),本项目仅用作交流学习参考,请切勿用于商业用途。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。
### 回答1: To perform a MySQL ngram join tables fulltext search, you can follow these steps: 1. Enable ngram fulltext search in your MySQL server: You can enable ngram fulltext search by adding the following line to your MySQL configuration file (e.g., `/etc/my.cnf` or `/etc/mysql/my.cnf`): ``` [mysqld] innodb_ft_server_stopword_table= '' innodb_ft_enable_stopword=0 innodb_ft_min_token_size=2 innodb_ft_enable_keys=1 innodb_ft_ngram_token_size=3 ``` The above configuration enables ngram tokenization and indexing for fulltext searches in your MySQL server. 2. Create and populate the tables: Create the tables you want to join and add a fulltext index to each table using the ngram tokenization technique. For example, if you have two tables, `table1` and `table2`, you can create and populate them as follows: ``` CREATE TABLE table1 ( id INT(11) NOT NULL AUTO_INCREMENT, text TEXT, FULLTEXT INDEX (text) ) ENGINE=InnoDB; CREATE TABLE table2 ( id INT(11) NOT NULL AUTO_INCREMENT, text TEXT, FULLTEXT INDEX (text) ) ENGINE=InnoDB; INSERT INTO table1 (text) VALUES ('The quick brown fox'); INSERT INTO table1 (text) VALUES ('jumps over the lazy dog'); INSERT INTO table2 (text) VALUES ('The quick brown fox'); INSERT INTO table2 (text) VALUES ('jumps over the lazy dog'); ``` 3. Join the tables: To perform a fulltext search using the ngram technique, you can join the tables on their fulltext index columns. For example, if you want to find all the records that contain the word "brown" in both `table1` and `table2`, you can use the following query: ``` SELECT table1.*, table2.* FROM table1 JOIN table2 ON MATCH(table1.text) AGAINST('brown' IN NATURAL LANGUAGE MODE) AND MATCH(table2.text) AGAINST('brown' IN NATURAL LANGUAGE MODE) ``` This query joins `table1` and `table2` on their fulltext index columns and performs a fulltext search for the word "brown" using the ngram technique. Note: Keep in mind that ngram fulltext search can be resource-intensive, especially if you have a large amount of data. You may need to optimize your MySQL server or adjust the ngram tokenization parameters to improve performance. ### 回答2: 要实现MySQL的Ngram连接表全文搜索,你可以按照以下步骤操作: 1. 确保你的MySQL版本支持Ngram全文索引。Ngram索引是一种特殊的全文索引,可以在词组级别上进行索引和匹配。从MySQL 5.7版本开始,Ngram索引被引入并成为MySQL的一部分。 2. 在需要进行全文搜索的表中添加Ngram索引。你需要将需要搜索的文本字段定义为Ngram类型,并为该字段创建Ngram索引。例如,假设你的表名为"table_name",字段名为"textfield_name",你可以使用以下命令创建Ngram索引: ALTER TABLE table_name ADD FULLTEXT INDEX idx_ngram_textfield_name (textfield_name) WITH PARSER NGRAM; 3. 要进行Ngram连接表全文搜索,你需要使用JOIN语句将两个表连接起来,并使用MATCH AGAINST条件进行全文搜索。例如,假设你有两个表名为"table1"和"table2",它们之间通过字段"field1"和"field2"进行连接,你可以使用以下语句进行全文搜索: SELECT table1.* FROM table1 INNER JOIN table2 ON MATCH (table1.field1) AGAINST (table2.field2 IN NATURAL LANGUAGE MODE); 注意,这里的关键是使用MATCH AGAINST条件进行全文搜索,并设置适当的匹配模式。你可以根据需要选择不同的匹配模式,如BOOLEAN模式、NATURAL LANGUAGE MODE等。 4. 执行以上查询语句后,MySQL将根据Ngram索引的内容从两个表中进行匹配,并返回与搜索条件匹配的结果。 需要注意的是,Ngram全文搜索可能在大型数据集上性能不佳,因为它需要对每个词组进行索引和匹配。在使用Ngram全文搜索时,你应该考虑数据库性能,并根据实际情况进行性能优化和调整。 ### 回答3: 要实现MySQL ngram联接表的全文搜索,需要按照以下步骤进行: 1. 安装MySQL插件:首先,需要安装MySQL插件,以支持ngram搜索功能。在安装过程中,需要注意将插件正确地配置和加载到MySQL中。 2. 创建表:根据需要创建要进行联接的表。确保每个表都具有适当的字段定义,并使用适当的字段类型,例如varchar或text。 3. 添加索引:为了实现快速全文搜索,需要在每个表的相关字段上添加适当的全文搜索索引。对于ngram搜索,可以使用FULLTEXT索引类型。 4. 创建联接查询:编写SQL查询语句,以使用ngram联接表和全文搜索功能。联接查询应该使用JOIN操作符来连接相关表,并在WHERE子句中使用MATCH AGAINST语句来进行全文搜索。 5. 配置ngram参数:通过设置MySQL的ngram参数,可以调整ngram搜索的行为。例如,可以设置ngram的最小和最大字词长度,以及最小的重要性阈值。 6. 测试查询:运行查询语句并测试数据库的响应时间和返回结果的准确性。根据需要,对查询进行调整和优化,以获得更好的性能和结果。 总结起来,要实现MySQL ngram联接表的全文搜索,需要安装插件、创建表、添加索引、编写联接查询、配置ngram参数,并进行测试和优化。这些步骤将帮助您有效地实现所需的功能。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值