前言
百度是最强大的中文搜索引擎,试想如果没有百度,我们的生活将会怎样? 我们将会很难快速了解到我们生活必需的流行歌曲,体育赛事,好莱坞大片,旅游胜地,名医名院,书籍课程,等各种信息;像鄙人这种水平的低阶码农,也将没办法从技术博客中copy到现成代码来修改,妥妥的要失业。百度是高性能的检索工具,它要尽可能多的爬取全万维网的网页储存起来,还要能够在相当短的时间内响应用户的搜索,这绝对是不简单的。我们做不到百度那么牛,那么我们通过开源技术又如何实现高性能的检索呢?以下是我学习MySQL,MongoDB,ElasticSearch,PostgreSQL的相关笔记
MySQL实现
安装MySQL
docker pull hub.c.163.com/library/mysql
docker run -di --name mysql -p 33306:3306 -e MYSQL_ROOT_PASSWORD=root hub.c.163.com/library/mysql
进入容器内授权
docker exec -it mysql /bin/bash
mysql -u root -p
grant all privileges on *.* to root@"%" identified by "root" with grant option;
flush privileges;
使用B+树索引使用like查询,在关键字前加%来做模糊匹配将不会利用到索引,全表扫描必然很慢,无法实现高性能模糊匹配,而且对于text类型的字段,B+树索引必须指定长度,输入超过该长度索引就会失效,长度越长检索性能应当是越慢
CREATE TABLE `article` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`title` VARCHAR(200) NULL DEFAULT NULL,
`body` TEXT NULL,
PRIMARY KEY (`id`),
INDEX `title` (`title`),
INDEX `body` (`body`(100))
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
INSERT INTO `article` (`id`, `title`, `body`) VALUES (1, '靓仔', '凌章是一个靓仔');
-- 走索引 ref 可以查到
explain select * from article where title = '靓仔';
-- 走索引 range 可以查到 超过长度应该差不到
explain select * from article where body like '凌章是一个靓仔';
-- 走索引 range 可以查到
explain select * from article where body like '凌章是%';
-- 走索引 range 查不到
explain select * from article where body like '一个靓仔%';
-- 不走索引 ALL 可以查到
explain select * from article where body like '%一个靓仔';
-- 不走索引 ALL 可以查到
explain select * from article where body like '%一个靓仔%';
MySQL实现快速的模糊检索必须使用全文索引,全文索引也是使用倒排索引实现,可以实现高性能,对英文输入支持还算理想,但是没办法对输入的中文输入进行分词检索,而且在MySQL5.6.24之前全文索引只能用在MYISAM存储引擎的表上
CREATE TABLE `article_fulltext` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`title` VARCHAR(200) NULL DEFAULT NULL,
`body` TEXT NULL,
PRIMARY KEY (`id`),
INDEX `title` (`title`),
FULLTEXT INDEX `body` (`body`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;
INSERT INTO `article_fulltext` (`id`, `title`, `body`) VALUES (1, '靓仔', '凌章是一个靓仔');
INSERT INTO `article_fulltext` (`id`, `title`, `body`) VALUES (2, 'goodlooking', 'ling is goodlooking boy');
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('凌章是一个靓仔' IN BOOLEAN MODE);
-- 不走索引 ALL 可以查到
explain select * from article_fulltext where body like '凌章是%';
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('+goodlooking' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('+凌章' IN BOOLEAN MODE);
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('>goodlooking' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('>凌章' IN BOOLEAN MODE);
-- 走索引 fulltext 可以查到
explain select * from article_fulltext where match(body) AGAINST('goodlooking hello' IN BOOLEAN MODE);
explain select * from article_fulltext where match(body) AGAINST('hello boy' IN BOOLEAN MODE);
-- 走索引 fulltext 查不到,中文支持不好
explain select * from article_fulltext where match(body) AGAINST('凌章 靓仔' IN BOOLEAN MODE);
MongoDB实现
MongoDB也可以利用全文索引,据说MongoDB从3.2版本以后添加了对中文索引的支持,但我试了一下貌似不行,MongoDB也可以用regex方式进行匹配,这其实类似于数据的like查询,应该也是不能利用索引提供查询性能的
dpcler pull mongo
docker run -p 27017:27017 --name mongo -d mongo
docker exec -it mongo /bin/bash
进入客户端
mongo
# 进入 admin 的数据库
use admin
# 创建管理员用户
db.createUser(
{
user: "admin",
pwd: "admin",
roles: [ { role: "userAdminAnyDatabase", db: "admin" } ]
}
)
use chinese_search
db.article_fulltext.insert({title:"靓仔",body:"凌章是一个靓仔"});
db.article_fulltext.insert({title:"goodlooking",body:"ling is a goodlooking body"});
db.article_fulltext.createIndex({body:"text"});
走索引 IXSCAN 可以查到
db.article_fulltext.find({$text:{$search:"goodlooking"}}).explain();
走索引 IXSCAN 查不到
b.article_fulltext.find({$text:{$search:"靓仔"}}).explain()
PostgreSQL实现
PostgreSQl支持MySQL的普通索引和全文索引(GIN),可以安装中文分词插件(zhparser),只能说PostgerSQL确实是最强大的开源数据库,支持地理位置检索,窗口函数等等,写此博客时没想到还能支持中文分词检索,但我们做站内垂直搜索还是应当使用ElasticSearch,毕竟数据库是最大的性能瓶颈,不应当让它承受太多的工作
安装PostgreSQL
docker pull hub.c.163.com/library/postgres
docker run --name postgres1 -e POSTGRES_PASSWORD=password -p 5432:5432 -d hub.c.163.com/library/postgres
使用客户端连接上
create extension pg_trgm;
create database chinese_search;
CREATE TABLE public.article
(
id integer NOT NULL DEFAULT nextval('article_id_seq'::regclass),
title character varying(200) COLLATE pg_catalog."default",
body tsvector,
CONSTRAINT article_pkey PRIMARY KEY (id)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
ALTER TABLE public.article
OWNER to postgres;
CREATE INDEX gin_body
ON public.article USING gin
(body)
TABLESPACE pg_default;
INSERT INTO article (id, title, body) VALUES (1, '靓仔', '凌章是一个靓仔');
INSERT INTO article (id, title, body) VALUES (2, 'goodlooking', 'ling is a goodlooking boy');
-- 走索引
EXPLAIN SELECT * FROM article WHERE body @@ to_tsquery('ling')
ElasticSearch实现
最后是重头戏ElasticSearch,ElasticSearch是使用Lucene作为底层来实现的开源搜索引擎技术,屏蔽了相当复杂和专业的信息检索知识,也是通过倒排索引提高查询性能,并可以安装中文分词插件使之支持中文分词检索,是最理想的技术了
ES分片集群搭建并安装中文IK分词
安装ES
docker pull hub.c.163.com/library/elasticsearch
es1.yml配置,其他改名即可
cluster.name: elasticsearch-cluster
node.name: es-node1
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9200
transport.tcp.port: 9300
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2
es2.yml配置
cluster.name: elasticsearch-cluster
node.name: es-node2
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9201
transport.tcp.port: 9301
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2
es3.yml配置
cluster.name: elasticsearch-cluster
node.name: es-node3
network.bind_host: 0.0.0.0
network.publish_host: 192.168.198.141
http.port: 9202
transport.tcp.port: 9302
http.cors.enabled: true
http.cors.allow-origin: "*"
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["192.168.9.219:9300","192.168.9.219:9301","192.168.9.219:9302"]
discovery.zen.minimum_master_nodes: 2
docker启动
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9200:9200 -p 9300:9300 -v /root/es-cluster/es1.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data1:/usr/share/elasticsearch/data --name es01 hub.c.163.com/library/elasticsearch
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9201:9201 -p 9301:9301 -v /root/es-cluster/es2.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data2:/usr/share/elasticsearch/data --name es02 hub.c.163.com/library/elasticsearch
docker run -e ES_JAVA_OPTS="-Xms256m -Xmx256m" -d -p 9202:9202 -p 9302:9302 -v /root/es-cluster/es3.yml:/usr/share/elasticsearch/config/elasticsearch.yml -v /root/es-cluster/data3:/usr/share/elasticsearch/data --name es03 hub.c.163.com/library/elasticsearch
docker run --name kibana -e ELASTICSEARCH_URL=http://192.168.198.141:9200 -p 5601:5601 -d hub.c.163.com/library/kibana
查看节点状态
http://192.168.198.141:9200/_cat/nodes?pretty
进入每一个容器,安装IK分词
docker exec -it es01
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.5.1/elasticsearch-analysis-ik-5.5.1.zip
也可以直接下载,解压到plugins目录
PUT article_ik
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_smart"
}
}
}
}
PUT /article_ik/_doc/1
{
"title":"靓仔",
"body":"凌章是一个靓仔"
}
PUT /article_ik/_doc/2
{
"title":"goodlooking",
"body":"ling is a goodlooking boy"
}
GET /article/_search
{
"query": {
"match": {
"body": "凌章"
}
}
}
GET /article/_search
{
"query": {
"match": {
"body": "靓仔"
}
}
}
分词处理查看
GET _analyze
{
"analyzer": "ik_smart",
"text": "凌章是一个靓仔"
}