分词
一个tokenizer(分词器)接收一个字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流。
例如:whitespace tokenizer遇到空白字符时分割文本。它会将文本“Quick brown fox!”分割为[Quick,brown,fox!]。
该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。
elasticsearch提供了很多内置的分词器,可以用来构建custom analyzers(自定义分词器)。
关于分词器: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis.html
POST _analyze
{
"tokenizer": "standard",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
结果:
{
"tokens" : [
{
"token" : "The",
"start_offset" : 0,
"end_offset" : 3,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "2",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<NUM>",
"position" : 1
},
{
"token" : "QUICK",
"start_offset" : 6,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "Brown",
"start_offset" : 12,
"end_offset" : 17,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "Foxes",
"start_offset" : 18,
"end_offset" : 23,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "jumped",
"start_offset" : 24,
"end_offset" : 30,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "over",
"start_offset" : 31,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 36,
"end_offset" : 39,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "lazy",
"start_offset" : 40,
"end_offset" : 44,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "dog's",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
},
{
"token" : "bone",
"start_offset" : 51,
"end_offset" : 55,
"type" : "<ALPHANUM>",
"position" : 10
}
]
}
安装ik分词器
所有的语言分词,默认使用的都是“Standard Analyzer”,但是这些分词器针对于中文的分词,并不友好。为此需要安装中文的分词器。
注意:不能用默认elasticsearch-plugin install xxx.zip 进行自动安装 https://github.com/medcl/elasticsearch-analysis-ik/tags 对应es版本安装
在前面安装的elasticsearch时,我们已经将elasticsearch容器的“/usr/share/elasticsearch/plugins”目录,映射到宿主机的“ /mydata/elasticsearch/plugins”目录下,所以比较方便的做法就是下载“/elasticsearch-analysis-ik-7.6.2.zip”文件,然后解压到该文件夹下即可。安装完毕后,需要重启elasticsearch容器。
步骤如下:
- 查看elasticsearch版本号:
http://192.168.56.10:9200/
执行结果
{
"name": "8970f0253f14",
"cluster_name": "elasticsearch",
"cluster_uuid": "LGyOo9JCR1GicCnxaOyzIw",
"version": {
"number": "7.6.2",
"build_flavor": "default",
"build_type": "docker",
"build_hash": "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
"build_date": "2020-03-26T06:34:37.794943Z",
"build_snapshot": false,
"lucene_version": "8.4.0",
"minimum_wire_compatibility_version": "6.8.0",
"minimum_index_compatibility_version": "6.0.0-beta1"
},
"tagline": "You Know, for Search"
}
- 下载对应版本,并上传plugins目录
- 进入容器,查看插件是否安装成功
## 进入容器
[root@localhost ~]# docker exec -it 8970 /bin/bash
## 查看所在目录
[root@8970f0253f14 elasticsearch]# pwd
/usr/share/elasticsearch
## 查看当前目录下 目录和文件
[root@8970f0253f14 elasticsearch]# ls
LICENSE.txt NOTICE.txt README.asciidoc bin config data jdk lib logs modules plugins
## 进入bin目录
[root@8970f0253f14 elasticsearch]# cd bin/
## 查看相关脚本
[root@8970f0253f14 bin]# ls
elasticsearch elasticsearch-cli elasticsearch-env-from-file elasticsearch-node elasticsearch-setup-passwords elasticsearch-sql-cli-7.6.2.jar x-pack-env
elasticsearch-certgen elasticsearch-croneval elasticsearch-keystore elasticsearch-plugin elasticsearch-shard elasticsearch-syskeygen x-pack-security-env
elasticsearch-certutil elasticsearch-env elasticsearch-migrate elasticsearch-saml-metadata elasticsearch-sql-cli elasticsearch-users x-pack-watcher-env
## 查看插件的帮助文档
[root@8970f0253f14 bin]# elasticsearch-plugin -h
A tool for managing installed elasticsearch plugins
Commands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch
Non-option arguments:
command
Option Description
------ -----------
-E <KeyValuePair> Configure a setting
-h, --help Show help
-s, --silent Show minimal output
-v, --verbose Show verbose output
## 列出所有安装的插件
[root@8970f0253f14 bin]# elasticsearch-plugin list
[root@8970f0253f14 bin]#
- 重启ElasticSearch容器
docker restart elasticsearch
- 测试分词器
- 使用默认
GET my_index/_analyze
{
"text":"我是中国人"
}
- 使用ik_smart
GET my_index/_analyze
{
"analyzer": "ik_smart",
"text":"我是中国人"
}
- 使用ik_max_word
GET my_index/_analyze
{
"analyzer": "ik_max_word",
"text":"我是中国人"
}
补充-修改linux网络设置&开启root密码访问
## 查看ip对应的网卡 为eth1
[root@localhost network-scripts]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
valid_lft 76287sec preferred_lft 76287sec
inet6 fe80::5054:ff:fe4d:77d3/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 08:00:27:40:58:7e brd ff:ff:ff:ff:ff:ff
inet 192.168.56.10/24 brd 192.168.56.255 scope global noprefixroute eth1
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fe40:587e/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:af:67:85:51 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
inet6 fe80::42:afff:fe67:8551/64 scope link
valid_lft forever preferred_lft forever
6: veth71675e9@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 7a:08:eb:2a:48:c7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet6 fe80::7808:ebff:fe2a:48c7/64 scope link
valid_lft forever preferred_lft forever
8: veth2aa7a7c@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 16:7f:5d:16:ee:ff brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::147f:5dff:fe16:eeff/64 scope link
valid_lft forever preferred_lft forever
10: vethea63ce0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether d2:fb:17:1e:93:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 2
inet6 fe80::d0fb:17ff:fe1e:93e3/64 scope link
valid_lft forever preferred_lft forever
12: veth0bbed57@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default
link/ether 0e:d9:aa:ed:0d:4f brd ff:ff:ff:ff:ff:ff link-netnsid 3
inet6 fe80::cd9:aaff:feed:d4f/64 scope link
valid_lft forever preferred_lft forever
## 进入到/etc/sysconfig/network-scripts目录下,修改网卡信息
[root@localhost network-scripts]# pwd
/etc/sysconfig/network-scripts
[root@localhost network-scripts]# ls
ifcfg-eth0 ifdown ifdown-ippp ifdown-post ifdown-sit ifdown-tunnel ifup-bnep ifup-ipv6 ifup-plusb ifup-routes ifup-TeamPort init.ipv6-global
ifcfg-eth1 ifdown-bnep ifdown-ipv6 ifdown-ppp ifdown-Team ifup ifup-eth ifup-isdn ifup-post ifup-sit ifup-tunnel network-functions
ifcfg-lo ifdown-eth ifdown-isdn ifdown-routes ifdown-TeamPort ifup-aliases ifup-ippp ifup-plip ifup-ppp ifup-Team ifup-wireless network-functions-ipv6
## vi 进行修改
[root@localhost network-scripts]# vi ifcfg-eth1
#VAGRANT-BEGIN
# The contents below are automatically generated by Vagrant. Do not modify.
NM_CONTROLLED=yes
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.56.10
NETMASK=255.255.255.0
GATEWAY=192.168.56.1 ## 配置网购
DNS1=114.114.114.114 ## 配置DNS
DNS2=8.8.8.8 ## 配置备用DNS
DEVICE=eth1
PEERDNS=no
#VAGRANT-END
[root@localhost network-scripts]# service network restart
Restarting network (via systemctl): [ OK ]
[root@localhost network-scripts]# ping baidu.com
PING baidu.com (220.181.38.148) 56(84) bytes of data.
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=1 ttl=45 time=7.19 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=2 ttl=45 time=6.98 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=3 ttl=45 time=7.00 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=4 ttl=45 time=7.97 ms
^C
--- baidu.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 6.985/7.290/7.978/0.418 ms
[root@localhost network-scripts]#
安装wget
yum install wget
## 查看是否安装成功
[root@localhost network-scripts]# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...
Try `wget --help' for more options.
安装unzip
## 添加 -y 就不用在输入 y 了,相当于默认都是yes
yum install -y unzip
自定义词库
安装nginx
- 随便启动一个nginx实例,只是为了复制出配置
执行这个命令,如果nginx镜像不存在,会先下载,在启动
[root@localhost nginx]# docker run -p 80:80 --name nginx -d nginx:1.10
Unable to find image 'nginx:1.10' locally
1.10: Pulling from library/nginx
6d827a3ef358: Pull complete
1e3e18a64ea9: Pull complete
556c62bb43ac: Pull complete
Digest: sha256:6202beb06ea61f44179e02ca965e8e13b961d12640101fca213efbfd145d7575
Status: Downloaded newer image for nginx:1.10
f667981b1e63fa47cddd1b05f2c42ec84814f1b5c94dddbe594a70fd0c887381
[root@localhost nginx]#
- 将容器内的配置文件拷贝到当前目录
别忘了后面的点
docker container cp nginx:/etc/nginx .
- 修改文件名称,并移动到/mydata/nginx下
mv nginx conf
- 终止原容器
docker stop nginx
docker rm nginx
- 创建新的nginx,执行下面命令
docker run -p 80:80 --name nginx \
-v /mydata/nginx/html:/usr/share/nginx/html \
-v /mydata/nginx/logs:/var/log/nginx \
-v /mydata/nginx/conf:/etc/nginx \
-d nginx:1.10
修改/usr/share/elasticsearch/plugins/ik/config中的IKAnalyzer.cfg.xml /usr/share/elasticsearch/plugins/ik/config
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict"></entry>
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://192.168.56.10/es/fenci.txt</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>
修改完成后,需要重启elasticsearch容器,否则修改不生效。
更新完成后,es只会对于新增的数据用更新分词。历史数据是不会重新分词的。如果想要历史数据重新分词,需要执行:
POST my_index/_update_by_query?conflicts=proceed
http://192.168.56.10/es/fenci.txt,这个是nginx上资源的访问路径
在运行下面实例之前,需要安装nginx,然后创建“fenci.txt”文件,内容如下:
echo "尚硅谷谷粒学院" > /mydata/nginx/html/fenci.txt
进行测试
GET my_index/_analyze
{
"analyzer": "ik_max_word",
"text":"尚硅谷谷粒学院"
}