如何实现高性能的中文检索

前言

百度是最强大的中文搜索引擎,试想如果没有百度,我们的生活将会怎样? 我们将会很难快速了解到我们生活必需的流行歌曲,体育赛事,好莱坞大片,旅游胜地,名医名院,书籍课程,等各种信息;像鄙人这种水平的低阶码农,也将没办法从技术博客中copy到现成代码来修改,妥妥的要失业。百度是高性能的检索工具,它要尽可能多的爬取全万维网的网页储存起来,还要能够在相当短的时间内响应用户的搜索,这绝对是不简单的。我们做不到百度那么牛,那么我们通过开源技术又如何实现高性能的检索呢?以下是我学习MySQL,MongoDB,ElasticSearch,PostgreSQL的相关笔记

MySQL实现

安装MySQL
docker pull hub.c.163.com/library/mysql
docker run -di --name mysql -p 33306:3306 -e MYSQL_ROOT_PASSWORD=root hub.c.163.com/library/mysql

进入容器内授权
docker exec -it mysql /bin/bash
mysql -u root -p
grant all privileges on *.* to root@"%" identified by "root" with grant option;
flush privileges;

使用B+树索引使用like查询,在关键字前加%来做模糊匹配将不会利用到索引,全表扫描必然很慢,无法实现高性能模糊匹配,而且对于text类型的字段,B+树索引必须指定长度,输入超过该长度索引就会失效,长度越长检索性能应当是越慢

CREATE TABLE `article` (
	`id` INT(11) NOT NULL AUTO_INCREMENT,
	`title` VARCHAR(200) NULL DEFAULT NULL,
	`body` TEXT NULL,
	PRIMARY KEY (`id`),
	INDEX `title` (`title`),
	INDEX `body` (`body`(100))
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
INSERT INTO `article` (`id`, `title`, `body`) VALUES (1, '靓仔', '凌章是一个靓仔');

-- 走索引 ref 可以查到
explain select * from article where title = '靓仔';
-- 走索引 range 可以查到 超过长度应该差不到
explain select * from article where body like '凌章是一个靓仔';
-- 走索引 range 可以查到
explain select * from article where body like '凌章是%';
-- 走索引 range 查不到
explain select * from article where body like '一个靓仔%';
-- 不走索引 ALL 可以查到
explain select * from article where body like '%一个靓仔';
-- 不走索引 ALL 可以查到
explain select * from article where body like '%一个靓仔%';

MySQL实现快速的模糊检索必须使用全文索引,全文索引也是使用倒排索引实现,可以实现高性能,对英文输入支持还算理想,但是没办法对输入的中文输入进行分词检索,而且在MySQL5.6.24之前全文索引只能用在MYISAM存储引擎的表上

CREATE TABLE `article_fulltext` (
	`id` INT(11) NOT NULL AUTO_INCREMENT,
	`title` VARCHAR(200) NULL DEFAULT NULL,
	`body` TEXT NULL,
	PRIMARY KEY (`id`),
	INDEX `title` (`title`),
	FULLTEXT INDEX `body` (`body`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
INSERT INTO `article_fulltext` (`id`, `title`, `body`) VALUES (1, '靓仔', '凌章是一个靓仔');
INSERT INTO `article_fulltext` (`id`, `title`, `body`) VALUES (2, 'goodlooking', 'ling is goodlooking boy');

-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('凌章是一个靓仔' IN BOOLEAN MODE);
-- 不走索引 ALL 可以查到
explain select * from article_fulltext where body like '凌章是%';  
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('+goodlooking' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('+凌章' IN BOOLEAN MODE);
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('>goodlooking' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('>凌章' IN BOOLEAN MODE);
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('goodlooking hello' IN BOOLEAN MODE);
explain select * from article_fulltext where match(body) AGAINST('hello boy' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('凌章 靓仔' IN BOOLEAN MODE);

MongoDB实现

MongoDB也可以利用全文索引,据说MongoDB从3.2版本以后添加了对中文索引的支持,但我试了一下貌似不行,MongoDB也可以用regex方式进行匹配,这其实类似于数据的like查询,应该也是不能利用索引提供查询性能的

dpcler pull mongo
docker run -p 27017:27017  --name mongo -d mongo
docker exec -it mongo /bin/bash
进入客户端
mongo
# 进入 admin 的数据库
use admin
# 创建管理员用户
db.createUser(
   {
     user: "admin",
     pwd: "admin",
     roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
   }
 )

use chinese_search
db.article_fulltext.insert({title:"靓仔",body:"凌章是一个靓仔"});
db.article_fulltext.insert({title:"goodlooking",body:"ling is a goodlooking body"});
db.article_fulltext.createIndex({body:"text"});
走索引 IXSCAN 可以查到
db.article_fulltext.find({$text:{$search:"goodlooking"}}).explain();
走索引 IXSCAN 查不到
b.article_fulltext.find({$text:{$search:"靓仔"}}).explain()

PostgreSQL实现

PostgreSQl支持MySQL的普通索引和全文索引(GIN),可以安装中文分词插件(zhparser),只能说PostgerSQL确实是最强大的开源数据库,支持地理位置检索,窗口函数等等,写此博客时没想到还能支持中文分词检索,但我们做站内垂直搜索还是应当使用ElasticSearch,毕竟数据库是最大的性能瓶颈,不应当让它承受太多的工作

安装PostgreSQL
docker pull hub.c.163.com/library/postgres
docker run --name postgres1 -e POSTGRES_PASSWORD=password -p 5432:5432 -d hub.c.163.com/library/postgres
使用客户端连接上
create extension pg_trgm;
create database chinese_search;
CREATE TABLE public.article
(
    id integer NOT NULL DEFAULT nextval('article_id_seq'::regclass),
    title character varying(200) COLLATE pg_catalog."default",
    body tsvector,
    CONSTRAINT article_pkey PRIMARY KEY (id)
)
WITH (
    OIDS = FALSE
)
TABLESPACE pg_default;

ALTER TABLE public.article
    OWNER to postgres;

CREATE INDEX gin_body
    ON public.article USING gin
    (body)
    TABLESPACE pg_default;

INSERT INTO article (id, title, body) VALUES (1, '靓仔', '凌章是一个靓仔');
INSERT INTO article (id, title, body) VALUES (2, 'goodlooking', 'ling is a goodlooking boy');

-- 走索引
EXPLAIN SELECT * FROM article WHERE body @@ to_tsquery('ling')

ElasticSearch实现

最后是重头戏ElasticSearch,ElasticSearch是使用Lucene作为底层来实现的开源搜索引擎技术,屏蔽了相当复杂和专业的信息检索知识,也是通过倒排索引提高查询性能,并可以安装中文分词插件使之支持中文分词检索,是最理想的技术了

ES分片集群搭建并安装中文IK分词
安装ES
docker pull hub.c.163.com/library/elasticsearch

es1.yml配置,其他改名即可
cluster.name: elasticsearch-cluster
node.name: es-node1
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9200
transport.tcp.port: 9300
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true 
node.data: true  
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2

es2.yml配置
cluster.name: elasticsearch-cluster
node.name: es-node2
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9201
transport.tcp.port: 9301
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true 
node.data: true  
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2

es3.yml配置
cluster.name: elasticsearch-cluster
node.name: es-node3
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9202
transport.tcp.port: 9302
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true 
node.data: true  
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2

docker启动
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 -v /root/es-cluster/es1.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data1:/usr/share/elasticsearch/data --name es01 hub.c.163.com/library/elasticsearch
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9201:9201 -p 9301:9301 -v /root/es-cluster/es2.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data2:/usr/share/elasticsearch/data --name es02 hub.c.163.com/library/elasticsearch
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9202:9202 -p 9302:9302 -v /root/es-cluster/es3.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data3:/usr/share/elasticsearch/data --name es03 hub.c.163.com/library/elasticsearch
docker run --name kibana -e ELASTICSEARCH_URL=http://192.168.198.141:9200 -p 5601:5601 -d hub.c.163.com/library/kibana

查看节点状态
http://192.168.198.141:9200/_cat/nodes?pretty

进入每一个容器,安装IK分词
docker exec -it es01
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip
也可以直接下载,解压到plugins目录
 PUT article_ik
{
  "mappings": {
    "properties": {
      "title": {
        "type":     "text",
        "analyzer": "ik_smart"
      }
    }
  }
}

PUT /article_ik/_doc/1
{
  "title":"靓仔",
  "body":"凌章是一个靓仔" 
}

PUT /article_ik/_doc/2
{
  "title":"goodlooking",
  "body":"ling is a goodlooking boy" 
}

GET /article/_search
{
  "query": {
    "match": {
      "body": "凌章"
    }
  }
}

GET /article/_search
{
  "query": {
    "match": {
      "body": "靓仔"
    }
  }
}

分词处理查看
GET _analyze
{
  "analyzer": "ik_smart",
  "text": "凌章是一个靓仔"
}
  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值