Sphinx + PHP + scws构建MySQL准实时分布式全文检索

http://www.zrwm.com/?p=5326

http://www.qixing318.com/article/sphinx-php-scws-build-mysql-quasi-realtime-distributed-fulltext-retrieval-of-actual-combat.html

Sphinx全文检索
  • Sphinx安装

    安装Sphinx:

    # yum install expat expat-devel
    # wget -c http://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz
    # tar zxvf sphinx-2.0.7-release.tar.gz
    # cd sphinx-2.0.7-release
    # ./configure --prefix=/usr/local/sphinx  --with-mysql=/usr/local/mysql --with-libexpat --enable-id64
    # make && make install

    安装libsphinxclient:

    # cd api/libsphinxclient
    # ./configure --prefix=/usr/local/sphinx/libsphinxclient
    # make && make install

    安装Sphinx的PHP扩展:

    # wget -c http://pecl.php.net/get/sphinx-1.3.0.tgz
    # tar zxvf sphinx-1.3.0.tgz
    # cd sphinx-1.3.0
    # phpize
    # ./configure --with-sphinx=/usr/local/sphinx/libsphinxclient/ --with-php-config=/usr/local/php5410/bin/php-config
    # make && make install
    # echo "[Sphinx]" >> /usr/local/php5410/etc/php.ini
    # echo "extension = sphinx.so" >> /usr/local/php5410/etc/php.ini
  • 全文检索测试的数据表
    CREATE SCHEMA IF NOT EXISTS `sphinx_items`;
    USE sphinx_items;
    CREATE TABLE IF NOT EXISTS `items` (
        `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY ,
        `title` VARCHAR( 255 ) NOT NULL ,
        `content` TEXT NOT NULL ,
        `created` DATETIME NOT NULL
    ) ENGINE = MYISAM;

    创建表后填充数据…

  • Sphinx配置

    创建增量索引标示的计数表:

    CREATE TABLE sph_counter (
        counter_id INTEGER PRIMARY KEY NOT NULL,
        max_doc_id INTEGER NOT NULL
    );

    以下采用”Main + Delta”(“主索引”+”增量索引”)的索引策略,使用Sphinx自带的一元分词.
    Sphinx配置:

    # cat /usr/local/sphinx/etc/sphinx.conf
    source items {
        type = mysql
        sql_host = 192.168.1.191
        sql_user = root
        sql_pass = 123456
        sql_db = sphinx_items
        
        sql_query_pre = SET NAMES utf8
        sql_query_pre = SET SESSION query_cache_type = OFF
        sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM items
        
        sql_query_range = SELECT MIN(id), MAX(id) FROM items \
                                WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
        sql_range_step = 1000
        sql_ranged_throttle = 1000
    
        sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
                        WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) \
                                    AND id >= $start AND id <= $end  
        
        sql_attr_timestamp = created
        sql_attr_bool = deleted
    }
    
    source items_delta : items {
        sql_query_pre = SET NAMES utf8
        sql_query_range = SELECT MIN(id), MAX(id) FROM items \
                                WHERE id > (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
        sql_query = SELECT id, title, content, created, 0 as deleted FROM items \
                        WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) \
                            AND id >= $start AND id <= $end
        sql_query_post_index = set @max_doc_id :=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)
        sql_query_post_index = REPLACE INTO sph_counter SELECT 2, IF($maxid, $maxid, @max_doc_id)
    }
    #主索引
    index items {
        source = items
        path = /usr/local/sphinx/var/data/items
        docinfo = extern
        morphology = none
        min_word_len = 1
        min_prefix_len = 0
        html_strip = 1
        html_remove_elements = style, script
        ngram_len = 1
        ngram_chars = U+3000..U+2FA1F
        charset_type = utf-8
        charset_table = 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F
        preopen = 1   
    }
    
    #增量索引
    index items_delta : items {
        source = items_delta
        path = /usr/local/sphinx/var/data/items-delta
    }
    
    #分布式索引
    index master {
        type = distributed
        local = items
        local = items_delta
    }
    
    indexer {
        mem_limit = 256M
    }
    
    searchd {
        listen                   = 9312
        listen                   = 9306:mysql41 #Used for SphinxQL
        log                      = /usr/local/sphinx/var/log/searchd.log
        #query_log                = /usr/local/sphinx/var/log/query.log
        compat_sphinxql_magics   = 0
        attr_flush_period        = 600
        mva_updates_pool         = 16M
        read_timeout             = 5
        max_children             = 0
        dist_threads             = 2 
        pid_file                 = /usr/local/sphinx/var/log/searchd.pid
        max_matches              = 1000
        seamless_rotate          = 1
        preopen_indexes          = 1
        unlink_old               = 1
        workers                  = threads # for RT to work
        binlog_path              = /usr/local/sphinx/var/data
    }
    
  • Sphinx索引

    第一次/重新索引:

    # indexer -c /usr/local/sphinx/etc/sphinx.conf --all
    Sphinx 2.0.7-release (r3759)
    Copyright (c) 2001-2012, Andrew Aksyonoff
    Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
    
    using config file '/usr/local/sphinx/etc/sphinx.conf'...
    indexing index 'items'...
    collected 5111 docs, 27.1 MB
    sorted 5.2 Mhits, 100.0% done
    total 5111 docs, 27115351 bytes
    total 9.001 sec, 3012302 bytes/sec, 567.79 docs/sec
    indexing index 'items_delta'...
    collected 0 docs, 0.0 MB
    total 0 docs, 0 bytes
    total 1.005 sec, 0 bytes/sec, 0.00 docs/sec
    skipping non-plain index 'master'...
    total 4 reads, 0.004 sec, 3303.6 kb/call avg, 1.1 msec/call avg
    total 40 writes, 0.019 sec, 697.7 kb/call avg, 0.4 msec/call avg
    

    索引完之后启动searchd:

    # searchd -c /usr/local/sphinx/etc/sphinx.conf
    Sphinx 2.0.7-release (r3759)
    Copyright (c) 2001-2012, Andrew Aksyonoff
    Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
    
    using config file '/usr/local/sphinx/etc/sphinx.conf'...
    listening on all interfaces, port=9312
    listening on all interfaces, port=9306
    precaching index 'items'
    precaching index 'items_delta'                              
    precached 2 indexes in 0.002 sec
    

    停止Searchd:

    # searchd -c /usr/local/sphinx/etc/sphinx.conf --stop

    查看Searchd状态:

    # searchd -c /usr/local/sphinx/etc/sphinx.conf --status
  • 索引更新及使用说明

    “增量索引”每N分钟更新一次.通常在每天晚上低负载的时候进行一次索引合并,同时重新建立”增量索引”.当然”主索引”数据不多的话,也可以直接重新建立”主索引”.
    API搜索的时候,同时使用”主索引”和”增量索引”,这样可以获得准实时的搜索数据.本文的Sphinx配置将”主索引”和”增量索引”放到分布式索引master中,因此只需查询分布式索引”master”即可获得全部匹配数据(包括最新数据).

    索引的更新与合并的操作可以放到cron job完成:

    # crontab -l
    */1 * * * *  /usr/local/sphinx/shell/delta_index_update.sh
    0 3 * * *    /usr/local/sphinx/shell/merge_daily_index.sh

    cron job所用的shell脚本例子:
    delta_index_update.sh:

    #!/bin/bash
    /usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1

    merge_daily_index.sh:

    #!/bin/bash
    indexer=`which indexer`
    mysql=`which mysql`
    
    QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;"
    index_counter=$($mysql -h192.168.1.191 -uroot -p123456 -sN -e "$QUERY")
    
    ## merge "main + delta" indexes
    $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1
    
    if [ "$?" -eq 0 ]; then
        ##update sphinx counter
        if [ ! -z $index_counter ]; then
            $mysql -h192.168.1.191 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, '$index_counter')"
        fi
        ##rebuild delta index to avoid confusion with main index
        $indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1
    fi
    
中文分词
  • PHP中文分词scws

    scws安装:

    # wget -c http://www.xunsearch.com/scws/down/scws-1.2.1.tar.bz2
    # tar jxvf scws-1.2.1.tar.bz2
    # cd scws-1.2.1
    # ./configure --prefix=/usr/local/scws
    # make && make install

    scws的PHP扩展安装:

    # cd ./phpext
    # phpize 
    # ./configure --with-php-config=/usr/local/php5410/bin/php-config
    # make && make install
    # echo "[scws]" >> /usr/local/php5410/etc/php.ini 
    # echo "extension = scws.so" >> /usr/local/php5410/etc/php.ini
    # echo "scws.default.charset = utf-8" >> /usr/local/php5410/etc/php.ini
    # echo "scws.default.fpath = /usr/local/scws/etc/" >> /usr/local/php5410/etc/php.ini

    词库安装:

    # wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2
    # tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/
    # chown www:www /usr/local/scws/etc/dict.utf8.xdb
PHP使用Sphinx+scws
  • PHP使用Sphinx+scws测试例子

    在Sphinx源码API中,有好几种语言的api调用.其中有一个是sphinxapi.php.
    不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分.

    测试用的搜索类Search.php:

    <?php
    class Search {
        /**
         * @var SphinxClient 
        **/
        protected $client;
        /**
         * @var string
        **/
        protected $keywords;
        /**
         * @var resource
        **/
        private static $dbconnection = null;
        
        /**
         * Constructor
         **/
        public function __construct($options = array()) {
            $defaults = array(
                'query_mode' => SPH_MATCH_EXTENDED2,
                'sort_mode' => SPH_SORT_EXTENDED,
                'ranking_mode' => SPH_RANK_PROXIMITY_BM25,
                'field_weights' => array(),
                'max_matches' => 1000,
                'snippet_enabled' => true,
                'snippet_index' => 'items',
                'snippet_fields' => array(), 
            );
            $this->options = array_merge($defaults, $options);
            $this->client = new SphinxClient();
            $this->client->setMatchMode($this->options['query_mode']);
            if ($this->options['field_weights'] !== array()) {
                $this->client->setFieldWeights($this->options['field_weights']);
            }
            /*
            if ( in_array($this->options['query_mode'], [SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED]) ) {
                $this->client->setRankingMode($this->options['ranking_mode']);
            }
            */
        }
       
        /**
         * Query
         *
         * @param string  $keywords
         * @param integer $offset
         * @param integer $limit
         * @param string  $index
         * @return array
         **/
        public function query($keywords, $offset = 0, $limit = 10, $index = '*') {
            $this->keywords = $keywords;
            $max_matches = $limit > $this->options['max_matches'] ? $limit : $this->options['max_matches'];
            $this->client->setLimits($offset, $limit, $max_matches);
            $query_results = $this->client->query($keywords, $index);
    
            if ($query_results === false) {
                $this->log('error:' . $this->client->getLastError());
            }
                   
            $res = [];
            if ( empty($query_results['matches']) ) {
                return $res;
            }
            $res['total'] = $query_results['total'];
            $res['total_found'] = $query_results['total_found'];
            $res['time'] = $query_results['time'];
            $doc_ids = array_keys($query_results['matches']);
            unset($query_results);
            $res['data'] = $this->fetch_data($doc_ids);
            if ($this->options['snippet_enabled']) {
                $this->buildExcerptRows($res['data']);
            }
            
            return $res;
        }
        
        /**
         * custom sorting 
         * 
         * @param string $sortBy
         * @param int $mode
         * @return bool
         **/
        public function setSortBy($sortBy = '', $mode = 0) {
            if ($sortBy) {
                $mode = $mode ?: $this->options['sort_mode'];
                $this->client->setSortMode($mode, $sortBy);
            } else {
                $this->client->setSortMode(SPH_SORT_RELEVANCE);
            }
        }
        
        /**
         * fetch data based on matched doc_ids
         * 
         * @param array $doc_ids
         * @return array
         **/    
        protected function fetch_data($doc_ids) {
            $ids = implode(',', $doc_ids);
            $queries = self::getDBConnection()->query("SELECT * FROM items WHERE id in ($ids)", PDO::FETCH_ASSOC);
            return iterator_to_array($queries);
        }
        
        /**
         * build excerpts for data
         * 
         * @param array $rows
         * @return array
         **/ 
        protected function buildExcerptRows(&$rows) {
            $options = array(
                'before_match' => '<b style="color:red">',
                'after_match'  => '</b>',
                'chunk_separator' => '...',
                'limit'    => 256,
                'around'   => 3,
                'exact_phrase' => false,
                'single_passage' => true,
                'limit_words' => 5,
            );
            $scount = count($this->options['snippet_fields']);
            foreach ($rows as &$row) {
                foreach ($row as $fk => $item) {
                    if (!is_string($item) || ($scount && !in_array($fk, $this->options['snippet_fields'])) ) continue;
                    $item = preg_replace('/[\r\t\n]+/', '', strip_tags($item));
                    $res = $this->client->buildExcerpts(array($item), $this->options['snippet_index'], $this->keywords, $options);
                    $row[$fk] = $res === false ? $item : $res[0];
                }
            }        
            return $rows;
        }
        
        /**
         * database connection
         *
         * @return resource
         **/ 
        private static function getDBConnection() {
            $dsn = 'mysql:host=192.168.1.191;dbname=sphinx_items';
            $user = 'root';
            $pass = '123456';
            if (!self::$dbconnection) {
                try {
                    self::$dbconnection = new PDO($dsn, $user, $pass);
                } catch (PDOException $e) {
                    die('Connection failed: ' . $e->getMessage());
                }
            }
            return self::$dbconnection;
        }
    
        /**
         * Chinese words segmentation
         *
         **/
        public function wordSplit($keywords) {
            $fpath = ini_get('scws.default.fpath');
            $so = scws_new();
            $so->set_charset('utf-8');
            $so->add_dict($fpath . '/dict.utf8.xdb');
            //$so->add_dict($fpath .'/custom_dict.txt', SCWS_XDICT_TXT);
            $so->set_rule($fpath . '/rules.utf8.ini');
            $so->set_ignore(true);
            $so->set_multi(false);
            $so->set_duality(false);
            $so->send_text($keywords);
            $words = [];
            $results =  $so->get_result();
            foreach ($results as $res) {
                $words[] = '(' . $res['word'] . ')';
            }
            $words[] = '(' . $keywords . ')';
            return join('|', $words);
        }
        
        /**
         * get current sphinx client
         *
         * @return resource
         **/
        public function getClient() {
            return $this->client;
        }
        /**
         * log error
         *
         **/
        public function log($msg) {
            // log errors here
            //echo $msg;
            
        }    
        /**
         * magic methods
         **/
        public function __call($method, $args) {
            $rc = new ReflectionClass('SphinxClient');
            if ( !$rc->hasMethod($method) ) {
                throw new Exception('invalid method :' . $method);
            }
            return call_user_func_array(array($this->client, $method), $args);
        }
    }
    

    测试文件test.php:

    <?php
    require('Search.php');
    $s = new Search([
            'snippet_fields' => ['title', 'content'],
            'field_weights' => ['title' => 20, 'content' => 10],
        ]);
    $s->setSortMode(SPH_SORT_EXTENDED, 'created desc,@weight desc');
    //$s->setSortBy('created desc,@weight desc');
    $words = $s->wordSplit("MySQL复制");
    $res = $s->query($words, 0, 10, 'master');
    var_dump($res);
SphinxQL
  • SphinxQL

    要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置).

    [root@centos190 etc]# mysql -h127.0.0.1 -P9306 -uroot -p
    Enter password: 
    Welcome to the MySQL monitor.  Commands end with ; or \g.
    Your MySQL connection id is 1
    Server version: 2.0.7-release (r3759)-->当前使用的Sphinx版本
    
    Copyright (c) 2000, 2012, Oracle and/or its affiliates. All rights reserved.
    
    Oracle is a registered trademark of Oracle Corporation and/or its
    affiliates. Other names may be trademarks of their respective
    owners.
    
    Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
    
    mysql> show global variables;
    +----------------------+---------+
    | Variable_name        | Value   |
    +----------------------+---------+
    | autocommit           | 1       |
    | collation_connection | libc_ci |
    | query_log_format     | plain   |
    | log_level            | info    |
    +----------------------+---------+
    4 rows in set (0.01 sec)
    mysql> show tables;
    +-------------+-------------+
    | Index       | Type        |
    +-------------+-------------+
    | items       | local       |
    | items_delta | local       |
    | master      | distributed |
    +-------------+-------------+
    3 rows in set (0.00 sec)
    
    mysql> desc master;
    ERROR 1146 (42S02): no such index 'master'
    mysql> desc items;
    +---------+-----------+
    | Field   | Type      |
    +---------+-----------+
    | id      | integer   |
    | title   | field     |
    | content | field     |
    | created | timestamp |
    +---------+-----------+
    4 rows in set (0.00 sec)
    
    mysql> select * from master where match ('MySQL复制') limit 10;
    +------+---------+
    | id   | created |
    +------+---------+
    | 2831 |    2013 |
    | 2834 |    2013 |
    | 2844 |    2013 |
    | 2845 |    2013 |
    | 2846 |    2013 |
    | 2847 |    2013 |
    | 2848 |    2013 |
    | 2849 |    2013 |
    | 2850 |    2013 |
    | 2851 |    2013 |
    +------+---------+
    10 rows in set (0.01 sec)
    
    mysql> show meta;
    +---------------+-------+
    | Variable_name | Value |
    +---------------+-------+
    | total         | 1000  |
    | total_found   | 1144  |
    | time          | 0.005 |
    | keyword[0]    | mysql |
    | docs[0]       | 2375  |
    | hits[0]       | 60076 |
    | keyword[1]    | 复    |
    | docs[1]       | 1346  |
    | hits[1]       | 6077  |
    | keyword[2]    | 制    |
    | docs[2]       | 1845  |
    | hits[2]       | 7607  |
    +---------------+-------+
    12 rows in set (0.00 sec)
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值