coreseek实时索引更新之增量索引_coreseek 增量索引更新-CSDN博客

本文链接：https://blog.csdn.net/ms_x0828/article/details/7679229

本文探讨了核心检索系统中的实时索引更新技术，包括基于磁盘的索引与实时索引的区别与应用。重点介绍了如何在大量文档更新场景下，通过设置主索引与增量索引来提高索引效率与响应速度。详细说明了在配置文件中定义主索引和增量索引的过程，并提供了避免索引重复与优化索引合并的解决策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

coreseek实时索引更新有两种选择:

1.使用基于磁盘的索引，手动分区，然后定期重建较小的分区（被称为“增量”）。通过尽可能的减小重建部分的大小，可以将平均索引滞后时间降低到30~60秒.在0.9.x版本中，这是唯一可用的方法。在一个巨大的文档集上，这可能是最有效的一种方法

2.版本1.x（从版本1.10-beta开始）增加了实时索引（简写为Rt索引）的支持，用于及时更新全文数据。在RT索引上的更新，可以在1~2毫秒（0.001-0.002秒）内出现在搜索结果中。然而，RT实时索引在处理较大数据量的批量索引上效率并不高。

这篇我们只要是增量索引

基本思路是设置两个数据源和两个索引，对很少更新或根本不更新的数据建立主索引，而对新增文档建立增量索引

在配置文件中定义了主索引和增量索引之后,不能直接用indexer –config d:\coreseek\csft.conf –all,再添加数据到数据库中,再用indexer –config d:\coreseek\csft.confg main delta –rotate来弄(我居然这样弄了两次)。正确的步骤为:

1.创建主索引:indexer –cd:\coreseek\csft.conf --all

2.添加数据

3.再创建增量索引:indexer –cd:\coreseek\csft.conf delta --rotate

4.合并索引:indexer –cd:\coreseek\csft.conf --merge main delta –rotate(为了防止多个关键字指向同一个文档加上--merge-dst-range deleted 0 0)

增量配置文件如下:

#增量索引
source main
{
    type                    = mysql
    sql_host                = localhost
    sql_user                = root
    sql_pass                = 123456
    sql_db                  = hottopic
    sql_port                = 3306
    sql_query_pre           = SET NAMES utf8
    sql_query_pre	    = replace into sph_counter select 1,max(id) from st_info
    sql_query_range	    = select 1,max(id) from st_info
    sql_range_step          = 1000

    sql_query               = SELECT id, pubDate, title, description,nav_id,rss_id FROM st_info where id>=$start and id <=$end and \
				id <=(select max_doc_id from sph_counter where counter_id=1)
    sql_attr_uint           = nav_id          
    sql_attr_uint	    = rss_id
    sql_attr_timestamp      = pubDate 
}

source delta : main
{
    sql_query_pre           = SET NAMES utf8
    sql_query		    = SELECT id, pubDate, title, description,nav_id,rss_id FROM st_info where id>=$start and id <=$end and \
				id >(select max_doc_id from sph_counter where counter_id=1)
    sql_query_post_index    = replace into sph_counter select 1,max(id) from st_info
}

#index定义
index main
{
    source              = main            
    path                = D:/coreseek/coreseek-4.1-win32/var/data/mysqlInfoSPHMain 
    docinfo             = extern
    mlock               = 0
    morphology          = none
    min_word_len        = 1
    html_strip          = 0
    stopwords		=

    charset_dictpath    =  D:/coreseek/coreseek-4.1-win32/etc    
    charset_type        = zh_cn.utf-8
}

index delta : main
{
    source		= delta
    path                = D:/coreseek/coreseek-4.1-win32/var/data/mysqlInfoSPHDelta
   
}

#全局index定义
indexer
{
    mem_limit            = 128M
}

#searchd服务定义
searchd
{
    listen			= 127.0.0.1:9312
    read_timeout		= 5
    max_children		= 30
    max_matches			= 1000
    seamless_rotate		= 0
    preopen_indexes		= 0
    unlink_old			= 1
    pid_file			= D:/coreseek/coreseek-4.1-win32/var/log/searchd_mysqlInfoSph.pid
    log				= D:/coreseek/coreseek-4.1-win32/var/log/searchd_mysqlInfoSph.log
    query_log			= D:/coreseek/coreseek-4.1-win32/var/log/query_mysqlInfoSph.log
    binlog_path			=          
    compat_sphinxql_magics	= 0
}

注意问题:如果我的主索引为50W条我前天建立的,我昨天增加了10W条的数据,并且建立了增量索引还和主索引合并了,我今天增加了10W的数据并且建立增量索引而且也和主索引合并了,在这两天内我是没有重新建立主索引的,问题来了：昨天是对10W数据进行建立,今天就是20W的数据建立,并且这20W数据中有10W数据其实在主索引中了,这个是非常可怕的?解决方案: