使用Coreseek建立全文搜索索引

最新推荐文章于 2021-01-19 00:56:13 发布

混沌极致

最新推荐文章于 2021-01-19 00:56:13 发布

阅读量700

点赞数

分类专栏： solr-sphinx-coreseek

本文链接：https://blog.csdn.net/marujunyy/article/details/8466295

版权

solr-sphinx-coreseek 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

首先安装Coreseek ,具体如何安装：gitHub source

需要注意的是官网给的指导是安装libtool-2.2.6b，但是在安装mmseg的时候会报错，

所以我们安装的时候需要安装libtool-2.2.10

安装好之后我们需要测试下：

    cd coreseek-3.2.14/testpack
    cat var/test/test.xml    #此时应该正确显示中文
    /usr/local/mmseg3/bin/mmseg -d /usr/local/mmseg3/etc var/test/test.xml
    /usr/local/coreseek/bin/indexer -c etc/csft.conf --all
    /usr/local/coreseek/bin/search -c etc/csft.conf 网络搜索

其中：mmseg是用于分词，/usr/local/mmseg3/etc是mmseg的安装目录里的词库路径

var/test/test.xml是我们需要分词的xml文件路径etc/csft.conf是coreseek配置文件的路径，这个文件是必须存在的，如果indexer或者search命令后面没有-cetc/csft.conf的话，就按照系统的默认配置，按照官网指导的方法安装的话，默认配置的路径则为：/usr/local/coreseek/etc/

search后面的"网络搜索"为我们需要进行检索的关键字

xml数据源(csft.conf)的配置格式：

修改charset_dictpath, 让它可以找到coreseek的分词表,其实只要指到coreseek的etc目录就行了.

#源定义
source xml
{
        type            = xmlpipe2
        xmlpipe_command = cat var/test/test.xml     #此处也可使用其他可执行程序输出xml数据
}

#index定义
index xml
{
        source                  = xml             #对应的source名称
        path                    = var/data/xml
        docinfo                 = extern
        mlock                   = 0
        morphology              = none
        min_word_len            = 1
        html_strip                              = 0
        charset_dictpath = /usr/local/mmseg3/etc/       #BSD、Linux环境下设置，/符号结尾
        #charset_dictpath = etc/                        #Windows环境下设置，/符号结尾
        charset_type            = zh_cn.utf-8
}

xml数据格式：

<?xml version="1.0" encoding="utf-8"?>
<sphinx:docset>
        <sphinx:schema>
        <sphinx:field name="subject"/> 
        <sphinx:field name="content"/>
        <sphinx:attr name="published" type="timestamp"/>
        <sphinx:attr name="author_id" type="int" bits="16" default="1"/>
        </sphinx:schema>
        <sphinx:document id="1">
                <subject>愚人节最佳蛊惑爆料 谷歌300亿美元收购百度</subject>
                <published>1270131607</published>
                <content>据国外媒体报道，谷歌将巨资收购百度，涉及金额高达300亿美元。谷歌借此重返大陆市场。......

                </content>
                <author_id>1</author_id>
        </sphinx:document>
        <sphinx:document id="2">
                <subject>Twitter主页改版 推普通用户消息增加趋势话题</subject>
                <published>1270135548</published>
                <content>4月1日消息，据国外媒体报道，Twitter本周二推出新版主页，目的很简单：帮助新用户了解Twitter和增加用户黏稠度。......

                </content>
                <author_id>1</author_id>
        </sphinx:document>
        <sphinx:document id="3">
                <subject>死都要上！Opera Mini 体验版抢先试用</subject>
                <published>1270094460</published>
                <content>Opera一直都被认为是浏览速度飞快，同时在移动平台上更是占有不少的份额。......

                </content>
                <author_id>2</author_id>
        </sphinx:document>
</sphinx:docset>

<sphinx:schema>里是对数据字段的描述,

<sphinx:field>里的name定义的是下面<sphinx:document>里需要建立索引的字段名称, 它的例子里给的是subject和content, 可以根据需要改成别的, 并且可以是多个.

<sphinx:document>里的id属性, 是该文档所对应的文章源中的文章ID, 应该是保证唯一的. 只要保证字段描述这些标签的名字和例子一模一样就可以了.

把这些都配置好了，我们就可以进行全文搜索了

/usr/local/coreseek/bin/search -c etc/csft.conf网络搜索