scws sphinx mysql_php+中文分词scws+sphinx+mysql打造千万级数据全文搜索

Sphinx是由俄罗斯人Andrew Aksyonoff开发的一个全文检索引擎。意图为其他应用提供高速、低空间占用、高结果 相关度的全文搜索功能。Sphinx可以非常容易的与SQL数据库和脚本语言集成。当前系统内置MySQL和PostgreSQL 数据库数据源的支持,也支持从标准输入读取特定格式 的XML数据。

Sphinx创建索引的速度为:创建100万条记录的索引只需3~4分钟,创建1000万条记录的索引可以在50分钟内完成,而只包含最新10万条记录的增量索引,重建一次只需几十秒。

Sphinx的特性如下:

a) 高速的建立索引(在当代CPU上,峰值性能可达到10 MB/秒);

b) 高性能的搜索(在2 – 4GB 的文本数据上,平均每次检索响应时间小于0.1秒);

c) 可处理海量数据(目前已知可以处理超过100 GB的文本数据, 在单一CPU的系统上可处理100 M 文档);

d) 提供了优秀的相关度算法,基于短语相似度和统计(BM25)的复合Ranking方法;

e) 支持分布式搜索;

f) 支持短语搜索

g) 提供文档摘要生成

h) 可作为MySQL的存储引擎提供搜索服务;

i) 支持布尔、短语、词语相似度等多种检索模式;

j) 文档支持多个全文检索字段(最大不超过32个);

k) 文档支持多个额外的属性信息(例如:分组信息,时间戳等);

l) 支持断词;

虽然mysql的MYISAM提供全文索引,但是性能却不敢让人恭维

开始搭建

系统环境:centos6.5+php5.6+apache+mysql

1、安装依赖包

1 yum -y install make gcc g++ gcc-c++ libtool autoconf automake imake php-devel mysql-devel libxml2-devel expat-devel

2、安装Sphinx

1 yum install expat expat-devel2 wget -c http://sphinxsearch.com/files/sphinx-2.0.7-release.tar.gz

3 tar zxvf sphinx-2.0.7-release.tar.gz4 cd sphinx-2.0.7-release5 ./configure --prefix=/usr/local/sphinx --with-mysql --with-libexpat --enable-id646 make && make install

3、安装libsphinxclient,php扩展用到

1 cd api/libsphinxclient2 ./configure --prefix=/usr/local/sphinx/libsphinxclient3 make && make install

4、安装Sphinx的PHP扩展:我的是5.6需装sphinx-1.3.3.tgz,如果是php5.4以下可sphinx-1.3.0.tgz

wget-c http://pecl.php.net/get/sphinx-1.3.3.tgz

tar zxvf sphinx-1.3.3.tgz

cd sphinx-1.3.3phpize./configure --with-sphinx=/usr/local/sphinx/libsphinxclient/ --with-php-config=/usr/bin/php-config

make&&make install

成功后会提示:

Installing shared extensions: /usr/lib64/php/modules/

echo "[Sphinx]" >> /etc/php.iniecho "extension = sphinx.so" >> /etc/php.ini#重启apache

service httpd restart

5、创建测试数据

CREATE TABLE IFNOT EXISTS `items` (

`id` int(11) NOT NULL AUTO_INCREMENT,`title` varchar(255) NOT NULL,`content` text NOTNULL,`created` datetime NOTNULL,PRIMARYKEY(`id`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='全文检索测试的数据表' AUTO_INCREMENT=11;

INSERT INTO `items` (`id`, `title`, `content`,`created`) VALUES

(1, 'linux mysql集群安装', 'MySQL Cluster 是MySQL 适合于分布式计算环境的高实用、可拓展、高性能、高冗余版本', '2016-09-07 00:00:00'),(2, 'mysql主从复制', 'mysql主从备份(复制)的基本原理 mysql支持单向、异步复制,复制过程中一个服务器充当主服务器,而一个或多个其它服务器充当从服务器', '2016-09-06 00:00:00'),(3, 'hello', 'can you search me?', '2016-09-05 00:00:00'),(4, 'mysql', 'mysql is the best database?', '2016-09-03 00:00:00'),(5, 'mysql索引', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '2016-09-01 00:00:00'),(6, '集群', '关于MySQL索引的好处,如果正确合理设计并且使用索引的MySQL是一辆兰博基尼的话,那么没有设计和使用索引的MySQL就是一个人力三轮车', '0000-00-00 00:00:00'),(9, '复制原理', 'redis也有复制', '0000-00-00 00:00:00'),(10, 'redis集群', '集群技术是构建高性能网站架构的重要手段,试想在网站承受高并发访问压力的同时,还需要从海量数据中查询出满足条件的数据,并快速响应,我们必然想到的是将数据进行切片,把数据根据某种规则放入多个不同的服务器节点,来降低单节点服务器的压力', '0000-00-00 00:00:00');

CREATE TABLEIFNOT EXISTS `sph_counter` (

`counter_id` int(11) NOT NULL,`max_doc_id` int(11) NOT NULL,PRIMARYKEY(`counter_id`)

) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT='增量索引标示的计数表';

以下采用"Main + Delta" ("主索引"+"增量索引")的索引策略,使用Sphinx自带的一元分词。

6、Sphinx配置:注意修改数据源配置信息

vi /usr/local/sphinx/etc/sphinx.conf

source items {

type= mysqlsql_host=localhost

sql_user=root

sql_pass= 123456sql_db=sphinx_items

sql_query_pre=SET NAMES utf8

sql_query_pre= SET SESSION query_cache_type =OFF

sql_query_pre= REPLACE INTO sph_counter SELECT 1, MAX(id) FROM items

sql_query_range= SELECT MIN(id), MAX(id) FROM items \

WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

sql_range_step= 1000sql_ranged_throttle= 1000sql_query= SELECT id, title, content, created, 0 asdeleted FROM items \

WHERE id<=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1) \

AND id>= $start AND id <= $endsql_attr_timestamp=created

sql_attr_bool=deleted

}

source items_delta:items {

sql_query_pre=SET NAMES utf8

sql_query_range= SELECT MIN(id), MAX(id) FROM items \

WHERE id> (SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

sql_query= SELECT id, title, content, created, 0 asdeleted FROM items \

WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1) \

AND id>= $start AND id <= $endsql_query_post_index= set @max_doc_id :=(SELECT max_doc_id FROM sph_counter WHERE counter_id=1)

sql_query_post_index= REPLACE INTO sph_counter SELECT 2, IF($maxid, $maxid,@max_doc_id)

}#主索引

index items {

source=items

path= /usr/local/sphinx/var/data/items

docinfo=extern

morphology=none

min_word_len= 1min_prefix_len= 0html_strip= 1html_remove_elements= style,script

ngram_len= 1ngram_chars= U+3000..U+2FA1F

charset_type= utf-8charset_table= 0..9, A..Z->a..z, _, a..z, U+410..U+42F->U+430..U+44F, U+430..U+44F

preopen= 1min_infix_len= 1}#增量索引

index items_delta :items {

source=items_delta

path= /usr/local/sphinx/var/data/items-delta

}#分布式索引

index master {

type=distributed

local=items

local=items_delta

}

indexer {

mem_limit=256M

}

searchd {

listen= 9312listen= 9306:mysql41 #Used for SphinxQL

log = /usr/local/sphinx/var/log/searchd.logquery_log= /usr/local/sphinx/var/log/query.logcompat_sphinxql_magics= 0attr_flush_period= 600mva_updates_pool=16M

read_timeout= 5max_children= 0dist_threads= 2pid_file= /usr/local/sphinx/var/log/searchd.pid

max_matches= 1000seamless_rotate= 1preopen_indexes= 1unlink_old= 1workers= threads #for RT to work

binlog_path = /usr/local/sphinx/var/data

}

保存退出

7、Sphinx创建索引

#第一次需重建索引:

[root@localhost bin]#./indexer -c /usr/local/sphinx/etc/sphinx.conf --all

Sphinx 2.0.7-id64-release (r3759)

Copyright (c)2001-2012,Andrew Aksyonoff

Copyright (c)2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using configfile '/usr/local/sphinx/etc/sphinx.conf'...indexing index'items'...collected8 docs, 0.0MB

sorted0.0 Mhits, 100.0% donetotal8 docs, 1121bytes

total1.017 sec, 1101 bytes/sec, 7.86 docs/sec

indexing index'items_delta'...collected0 docs, 0.0MB

total0 docs, 0bytes

total1.007 sec, 0 bytes/sec, 0.00 docs/sec

skipping non-plain index 'master'...total4 reads, 0.000 sec, 0.7 kb/call avg, 0.0 msec/call avg

total14 writes, 0.001 sec, 0.5 kb/call avg, 0.1 msec/call avg#启动sphinx

[root@localhost bin]#./searchd -c /usr/local/sphinx/etc/sphinx.conf

Sphinx 2.0.7-id64-release (r3759)

Copyright (c)2001-2012,Andrew Aksyonoff

Copyright (c)2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)

using configfile '/usr/local/sphinx/etc/sphinx.conf'...listening on all interfaces, port=9312listening on all interfaces, port=9306precaching index'items'precaching index'items_delta'rotating index'items_delta':success

precached2 indexes in 0.012 sec

#查看进程

[root@localhost bin]#ps -ef | grep searchd

root 30431 1 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf

root30432 30431 0 23:59 ? 00:00:00 ./searchd -c /usr/local/sphinx/etc/sphinx.conf

root30437 1490 0 23:59 pts/0 00:00:00grep searchd#停止Searchd:

./searchd -c /usr/local/sphinx/etc/sphinx.conf --stop#查看Searchd状态:

./searchd -c /usr/local/sphinx/etc/sphinx.conf --status

索引更新及使用说明

"增量索引"每N分钟更新一次.通常在每天晚上低负载的时进行一次索引合并,同时重新建立"增量索引"。当然"主索引"数据不多的话,也可以直接重新建立"主索引"。

API搜索的时,同时使用"主索引"和"增量索引",这样可以获得准实时的搜索数据.本文的Sphinx配置将"主索引"和"增量索引"放到分布式索引master中,因此只需查询分布式索引"master"即可获得全部匹配数据(包括最新数据)。

索引的更新与合并的操作可以放到cron job完成:

crontab -e*/1 * * * * /usr/local/sphinx/shell/delta_index_update.sh0 3 * * * /usr/local/sphinx/shell/merge_daily_index.sh

crontab-l

cron job所用的shell脚本例子:delta_index_update.sh:[php] view plaincopy

#!/bin/bash

/usr/local/sphinx/bin/indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta > /dev/null 2>&1merge_daily_index.sh:[php] view plaincopy

#!/bin/bash

indexer=`which indexer`mysql=`which mysql`

QUERY="use sphinx_items;select max_doc_id from sph_counter where counter_id = 2 limit 1;"index_counter=$($mysql -h192.168.1.198 -uroot -p123456 -sN -e "$QUERY")#merge "main + delta" indexes

$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate --merge items items_delta --merge-dst-range deleted 0 0 >> /usr/local/sphinx/var/index_merge.log 2>&1

if [ "$?" -eq 0]; then##update sphinx counter

if [ ! -z $index_counter]; then$mysql -h192.168.1.198 -uroot -p123456 -Dsphinx_items -e "REPLACE INTO sph_counter VALUES (1, '$index_counter')"fi##rebuild delta index to avoid confusion with main index

$indexer -c /usr/local/sphinx/etc/sphinx.conf --rotate items_delta >> /usr/local/sphinx/var/rebuild_deltaindex.log 2>&1fi

8、php中文分词scws安装:注意扩展的版本和php的版本

wget -c http://www.xunsearch.com/scws/down/scws-1.2.3.tar.bz2

tar jxvf scws-1.2.3.tar.bz2

cd scws-1.2.3

./configure --prefix=/usr/local/scws

make&& make install

9、scws的PHP扩展安装:

cd ./phpext

phpize./configure

make&&make installecho "[scws]" >> /etc/php.iniecho "extension = scws.so" >> /etc/php.iniecho "scws.default.charset = utf-8" >> /etc/php.iniecho "scws.default.fpath = /usr/local/scws/etc/" >> /etc/php.ini

10、词库安装:

wget http://www.xunsearch.com/scws/down/scws-dict-chs-utf8.tar.bz2

tar jxvf scws-dict-chs-utf8.tar.bz2 -C /usr/local/scws/etc/

chown www:www /usr/local/scws/etc/dict.utf8.xdb

11、php使用Sphinx+scws测试例子

在Sphinx源码API中,有好几种语言的API调用.其中有一个是sphinxapi.php。

不过以下的测试使用的是Sphinx的PHP扩展.具体安装见本文开头的Sphinx安装部分。

测试用的搜索类Search.php:注意修改getDBConnection()信息为自己的

* @var SphinxClient

**/

protected $client;/**

* @var string

**/

protected $keywords;/**

* @var resource

**/

private static $dbconnection = null;/**

* Constructor

**/

public function __construct($options = array()) {$defaults = array('query_mode' => SPH_MATCH_EXTENDED2,

'sort_mode' => SPH_SORT_EXTENDED,

'ranking_mode' => SPH_RANK_PROXIMITY_BM25,

'field_weights' => array(),

'max_matches' => 1000,

'snippet_enabled' => true,

'snippet_index' => 'items',

'snippet_fields' => array(),);$this->options = array_merge($defaults, $options);$this->client = newSphinxClient();//$this->client->setServer("192.168.1.198", 9312);

$this->client->setMatchMode($this->options['query_mode']);if ($this->options['field_weights'] !== array()) {$this->client->setFieldWeights($this->options['field_weights']);

}/*if ( in_array($this->options['query_mode'], [SPH_MATCH_EXTENDED2,SPH_MATCH_EXTENDED]) ) {

$this->client->setRankingMode($this->options['ranking_mode']);

}*/}/**

* Query

*

* @param string $keywords

* @param integer $offset

* @param integer $limit

* @param string $index

* @return array

**/

public function query($keywords, $offset = 0, $limit = 10, $index = '*') {$this->keywords = $keywords;$max_matches = $limit > $this->options['max_matches'] ? $limit : $this->options['max_matches'];$this->client->setLimits($offset, $limit, $max_matches);$query_results = $this->client->query($keywords, $index);if ($query_results === false) {$this->log('error:' . $this->client->getLastError());

}$res =[];if ( empty($query_results['matches']) ) {return $res;

}$res['total'] = $query_results['total'];$res['total_found'] = $query_results['total_found'];$res['time'] = $query_results['time'];$doc_ids = array_keys($query_results['matches']);unset($query_results);$res['data'] = $this->fetch_data($doc_ids);if ($this->options['snippet_enabled']) {$this->buildExcerptRows($res['data']);

}return $res;

}/**

* custom sorting

*

* @param string $sortBy

* @param int $mode

* @return bool

**/

public function setSortBy($sortBy = '', $mode = 0) {if ($sortBy) {$mode = $mode ?: $this->options['sort_mode'];$this->client->setSortMode($mode, $sortBy);

}else{$this->client->setSortMode(SPH_SORT_RELEVANCE);

}

}/**

* fetch data based on matched doc_ids

*

* @param array $doc_ids

* @return array

**/

protected function fetch_data($doc_ids) {$ids = implode(',', $doc_ids);$queries = self::getDBConnection()->query("SELECT * FROM items WHERE id in ($ids)", PDO::FETCH_ASSOC);return iterator_to_array($queries);

}/**

* build excerpts for data

*

* @param array $rows

* @return array

**/

protected function buildExcerptRows(&$rows) {$options = array('before_match' => '',

'after_match' => '',

'chunk_separator' => '...',

'limit' => 256,

'around' => 3,

'exact_phrase' => false,

'single_passage' => true,

'limit_words' => 5,);$scount = count($this->options['snippet_fields']);foreach ($rows as &$row) {foreach ($row as $fk => $item) {if (!is_string($item) || ($scount && !in_array($fk, $this->options['snippet_fields'])) ) continue;$item = preg_replace('/[\r\t\n]+/', '', strip_tags($item));$res = $this->client->buildExcerpts(array($item), $this->options['snippet_index'], $this->keywords, $options);$row[$fk] = $res === false ? $item : $res[0];

}

}return $rows;

}/**

* database connection

*

* @return resource

**/

private static functiongetDBConnection() {$dsn = 'mysql:host=192.168.1.198;dbname=sphinx_items';$user = 'root';$pass = '123456';if (!self::$dbconnection) {try{

self::$dbconnection = new PDO($dsn, $user, $pass);

}catch (PDOException $e) {die('Connection failed: ' . $e->getMessage());

}

}return self::$dbconnection;

}/**

* Chinese words segmentation

*

**/

public function wordSplit($keywords) {$fpath = ini_get('scws.default.fpath');$so =scws_new();$so->set_charset('utf-8');$so->add_dict($fpath . '/dict.utf8.xdb');//$so->add_dict($fpath .'/custom_dict.txt', SCWS_XDICT_TXT);

$so->set_rule($fpath . '/rules.utf8.ini');$so->set_ignore(true);$so->set_multi(false);$so->set_duality(false);$so->send_text($keywords);$words =[];$results = $so->get_result();foreach ($results as $res) {$words[] = '(' . $res['word'] . ')';

}$words[] = '(' . $keywords . ')';return join('|', $words);

}/**

* get current sphinx client

*

* @return resource

**/

public functiongetClient() {return $this->client;

}/**

* log error

**/

public function log($msg) {//log errors here

//echo $msg;

}/**

* magic methods

**/

public function __call($method, $args) {$rc = new ReflectionClass('SphinxClient');if ( !$rc->hasMethod($method) ) {throw new Exception('invalid method :' . $method);

}return call_user_func_array(array($this->client, $method), $args);

}

}

测试文件test.php:

<?phprequire ('Search.php');$s = newSearch(['snippet_fields' => ['title', 'content'],

'field_weights' => ['title' => 20, 'content' => 10],]);$s->setSortMode(SPH_SORT_EXTENDED, 'created desc,@weight desc');//$s->setSortBy('created desc,@weight desc');

$words = $s->wordSplit("mysql集群");//先分词 结果:(mysql)|(mysql集群)

//print_r($words);exit;

$res = $s->query($words, 0, 10, 'master');echo '

';print_r($res); 
 

测试结果:

0c7bfefabfd20c5d9d871722029024c3.png

12、SphinxQL测试

要使用SphinxQL需要在Searchd的配置里面增加相应的监听端口(参考上文配置)。

[root@localhost bin]#mysql -h127.0.0.1 -P9306 -uroot -p

Enter password:Welcome to theMySQL monitor. Commands end with ; or \g.YourMySQL connection id is 1Server version: 2.0.7-id64-release (r3759)

Copyright (c)2000, 2016, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or its

affiliates.Other names may be trademarks of their respective

owners.Type'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show globalvariables;+----------------------+---------+

| Variable_name | Value |

+----------------------+---------+

| autocommit | 1 |

| collation_connection | libc_ci |

| query_log_format | plain |

| log_level | info |

+----------------------+---------+

4 rows in set (0.00sec)mysql>desc items;+---------+-----------+

| Field | Type |

+---------+-----------+

| id | bigint |

| title | field |

| content | field |

| created | timestamp |

| deleted | bool |

+---------+-----------+

5 rows in set (0.00sec)mysql> select * from master where match ('mysql集群') limit 10;+------+---------+---------+

| id | created | deleted |

+------+---------+---------+

| 1 | 2016 | 0 |

| 6 | 0 | 0 |

+------+---------+---------+

2 rows in set (0.00sec)mysql>show meta;+---------------+-------+

| Variable_name | Value |

+---------------+-------+

| total | 2 |

| total_found | 2 |

| time | 0.006 |

| keyword[0] | mysql |

| docs[0] | 5 |

| hits[0] | 15 |

| keyword[1] | 集 |

| docs[1] | 3 |

| hits[1] | 4 |

| keyword[2] | 群 |

| docs[2] | 3 |

| hits[2] | 4 |

+---------------+-------+

12 rows in set (0.00sec)mysql>

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值