122-124、全文检索-ElasticSearch-分词-分词&安装ik分词、补充-修改linux网络设置&开启root密码访问、分词-自定义扩展词库

分词

一个tokenizer(分词器)接收一个字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流。

例如:whitespace tokenizer遇到空白字符时分割文本。它会将文本“Quick brown fox!”分割为[Quick,brown,fox!]。

该tokenizer(分词器)还负责记录各个terms(词条)的顺序或position位置(用于phrase短语和word proximity词近邻查询),以及term(词条)所代表的原始word(单词)的start(起始)和end(结束)的character offsets(字符串偏移量)(用于高亮显示搜索的内容)。

elasticsearch提供了很多内置的分词器,可以用来构建custom analyzers(自定义分词器)。

关于分词器: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis.html

标准分词:

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

结果:

{
  "tokens" : [
    {
      "token" : "The",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "2",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<NUM>",
      "position" : 1
    },
    {
      "token" : "QUICK",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "Brown",
      "start_offset" : 12,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "Foxes",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "jumped",
      "start_offset" : 24,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "over",
      "start_offset" : 31,
      "end_offset" : 35,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "the",
      "start_offset" : 36,
      "end_offset" : 39,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "lazy",
      "start_offset" : 40,
      "end_offset" : 44,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "dog's",
      "start_offset" : 45,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "bone",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

安装ik分词器

所有的语言分词,默认使用的都是“Standard Analyzer”,但是这些分词器针对于中文的分词,并不友好。为此需要安装中文的分词器。

注意:不能用默认elasticsearch-plugin install xxx.zip 进行自动安装 https://github.com/medcl/elasticsearch-analysis-ik/tags 对应es版本安装

在前面安装的elasticsearch时,我们已经将elasticsearch容器的“/usr/share/elasticsearch/plugins”目录,映射到宿主机的“ /mydata/elasticsearch/plugins”目录下,所以比较方便的做法就是下载“/elasticsearch-analysis-ik-7.6.2.zip”文件,然后解压到该文件夹下即可。安装完毕后,需要重启elasticsearch容器。

步骤如下:

  1. 查看elasticsearch版本号:
http://192.168.56.10:9200/

执行结果

{
	"name": "8970f0253f14",
	"cluster_name": "elasticsearch",
	"cluster_uuid": "LGyOo9JCR1GicCnxaOyzIw",
	"version": {
		"number": "7.6.2",
		"build_flavor": "default",
		"build_type": "docker",
		"build_hash": "ef48eb35cf30adf4db14086e8aabd07ef6fb113f",
		"build_date": "2020-03-26T06:34:37.794943Z",
		"build_snapshot": false,
		"lucene_version": "8.4.0",
		"minimum_wire_compatibility_version": "6.8.0",
		"minimum_index_compatibility_version": "6.0.0-beta1"
	},
	"tagline": "You Know, for Search"
}
  1. 下载对应版本,并上传plugins目录
  2. 进入容器,查看插件是否安装成功
## 进入容器
[root@localhost ~]# docker exec -it 8970 /bin/bash
## 查看所在目录
[root@8970f0253f14 elasticsearch]# pwd
/usr/share/elasticsearch
## 查看当前目录下 目录和文件
[root@8970f0253f14 elasticsearch]# ls
LICENSE.txt  NOTICE.txt  README.asciidoc  bin  config  data  jdk  lib  logs  modules  plugins
## 进入bin目录
[root@8970f0253f14 elasticsearch]# cd bin/
## 查看相关脚本
[root@8970f0253f14 bin]# ls
elasticsearch           elasticsearch-cli       elasticsearch-env-from-file  elasticsearch-node           elasticsearch-setup-passwords  elasticsearch-sql-cli-7.6.2.jar  x-pack-env
elasticsearch-certgen   elasticsearch-croneval  elasticsearch-keystore       elasticsearch-plugin         elasticsearch-shard            elasticsearch-syskeygen          x-pack-security-env
elasticsearch-certutil  elasticsearch-env       elasticsearch-migrate        elasticsearch-saml-metadata  elasticsearch-sql-cli          elasticsearch-users              x-pack-watcher-env
## 查看插件的帮助文档
[root@8970f0253f14 bin]# elasticsearch-plugin -h  
A tool for managing installed elasticsearch plugins

Commands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch

Non-option arguments:
command              

Option             Description        
------             -----------        
-E <KeyValuePair>  Configure a setting
-h, --help         Show help          
-s, --silent       Show minimal output
-v, --verbose      Show verbose output
## 列出所有安装的插件
[root@8970f0253f14 bin]# elasticsearch-plugin list
[root@8970f0253f14 bin]# 
  1. 重启ElasticSearch容器
docker restart elasticsearch
  1. 测试分词器
  • 使用默认
GET my_index/_analyze
{
   "text":"我是中国人"
}
  • 使用ik_smart
GET my_index/_analyze
{
   "analyzer": "ik_smart", 
   "text":"我是中国人"
}
  • 使用ik_max_word
GET my_index/_analyze
{
   "analyzer": "ik_max_word", 
   "text":"我是中国人"
}

补充-修改linux网络设置&开启root密码访问


## 查看ip对应的网卡 为eth1
[root@localhost network-scripts]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 52:54:00:4d:77:d3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.15/24 brd 10.0.2.255 scope global noprefixroute dynamic eth0
       valid_lft 76287sec preferred_lft 76287sec
    inet6 fe80::5054:ff:fe4d:77d3/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:40:58:7e brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.10/24 brd 192.168.56.255 scope global noprefixroute eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe40:587e/64 scope link 
       valid_lft forever preferred_lft forever
4: docker0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default 
    link/ether 02:42:af:67:85:51 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
       valid_lft forever preferred_lft forever
    inet6 fe80::42:afff:fe67:8551/64 scope link 
       valid_lft forever preferred_lft forever
6: veth71675e9@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 7a:08:eb:2a:48:c7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet6 fe80::7808:ebff:fe2a:48c7/64 scope link 
       valid_lft forever preferred_lft forever
8: veth2aa7a7c@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 16:7f:5d:16:ee:ff brd ff:ff:ff:ff:ff:ff link-netnsid 1
    inet6 fe80::147f:5dff:fe16:eeff/64 scope link 
       valid_lft forever preferred_lft forever
10: vethea63ce0@if9: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether d2:fb:17:1e:93:e3 brd ff:ff:ff:ff:ff:ff link-netnsid 2
    inet6 fe80::d0fb:17ff:fe1e:93e3/64 scope link 
       valid_lft forever preferred_lft forever
12: veth0bbed57@if11: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker0 state UP group default 
    link/ether 0e:d9:aa:ed:0d:4f brd ff:ff:ff:ff:ff:ff link-netnsid 3
    inet6 fe80::cd9:aaff:feed:d4f/64 scope link 
       valid_lft forever preferred_lft forever
## 进入到/etc/sysconfig/network-scripts目录下,修改网卡信息
[root@localhost network-scripts]# pwd
/etc/sysconfig/network-scripts
[root@localhost network-scripts]# ls
ifcfg-eth0  ifdown       ifdown-ippp  ifdown-post    ifdown-sit       ifdown-tunnel  ifup-bnep  ifup-ipv6  ifup-plusb  ifup-routes  ifup-TeamPort  init.ipv6-global
ifcfg-eth1  ifdown-bnep  ifdown-ipv6  ifdown-ppp     ifdown-Team      ifup           ifup-eth   ifup-isdn  ifup-post   ifup-sit     ifup-tunnel    network-functions
ifcfg-lo    ifdown-eth   ifdown-isdn  ifdown-routes  ifdown-TeamPort  ifup-aliases   ifup-ippp  ifup-plip  ifup-ppp    ifup-Team    ifup-wireless  network-functions-ipv6
## vi 进行修改
[root@localhost network-scripts]# vi ifcfg-eth1
#VAGRANT-BEGIN
# The contents below are automatically generated by Vagrant. Do not modify.
NM_CONTROLLED=yes
BOOTPROTO=none
ONBOOT=yes
IPADDR=192.168.56.10
NETMASK=255.255.255.0
GATEWAY=192.168.56.1 ## 配置网购
DNS1=114.114.114.114 ## 配置DNS
DNS2=8.8.8.8         ## 配置备用DNS
DEVICE=eth1
PEERDNS=no
#VAGRANT-END
[root@localhost network-scripts]# service network restart
Restarting network (via systemctl):  [  OK  ]
[root@localhost network-scripts]# ping baidu.com
PING baidu.com (220.181.38.148) 56(84) bytes of data.
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=1 ttl=45 time=7.19 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=2 ttl=45 time=6.98 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=3 ttl=45 time=7.00 ms
64 bytes from 220.181.38.148 (220.181.38.148): icmp_seq=4 ttl=45 time=7.97 ms
^C
--- baidu.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 6.985/7.290/7.978/0.418 ms
[root@localhost network-scripts]# 

在这里插入图片描述
安装wget

yum install wget
## 查看是否安装成功
[root@localhost network-scripts]# wget
wget: missing URL
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

安装unzip

## 添加 -y  就不用在输入 y 了,相当于默认都是yes
yum install -y unzip

自定义词库

安装nginx
  1. 随便启动一个nginx实例,只是为了复制出配置
    执行这个命令,如果nginx镜像不存在,会先下载,在启动
[root@localhost nginx]# docker run  -p 80:80 --name nginx -d nginx:1.10
Unable to find image 'nginx:1.10' locally
1.10: Pulling from library/nginx
6d827a3ef358: Pull complete 
1e3e18a64ea9: Pull complete 
556c62bb43ac: Pull complete 
Digest: sha256:6202beb06ea61f44179e02ca965e8e13b961d12640101fca213efbfd145d7575
Status: Downloaded newer image for nginx:1.10
f667981b1e63fa47cddd1b05f2c42ec84814f1b5c94dddbe594a70fd0c887381
[root@localhost nginx]# 
  1. 将容器内的配置文件拷贝到当前目录
    别忘了后面的点
docker container cp nginx:/etc/nginx . 
  1. 修改文件名称,并移动到/mydata/nginx下
mv nginx conf 
  1. 终止原容器
docker stop nginx
docker rm nginx
  1. 创建新的nginx,执行下面命令
docker run -p 80:80 --name nginx \
-v /mydata/nginx/html:/usr/share/nginx/html \
-v /mydata/nginx/logs:/var/log/nginx \
-v /mydata/nginx/conf:/etc/nginx \
-d nginx:1.10

修改/usr/share/elasticsearch/plugins/ik/config中的IKAnalyzer.cfg.xml /usr/share/elasticsearch/plugins/ik/config

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
	<comment>IK Analyzer 扩展配置</comment>
	<!--用户可以在这里配置自己的扩展字典 -->
	<entry key="ext_dict"></entry>
	 <!--用户可以在这里配置自己的扩展停止词字典-->
	<entry key="ext_stopwords"></entry>
	<!--用户可以在这里配置远程扩展字典 -->
	<entry key="remote_ext_dict">http://192.168.56.10/es/fenci.txt</entry> 
	<!--用户可以在这里配置远程扩展停止词字典-->
	<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

修改完成后,需要重启elasticsearch容器,否则修改不生效。

更新完成后,es只会对于新增的数据用更新分词。历史数据是不会重新分词的。如果想要历史数据重新分词,需要执行:

POST my_index/_update_by_query?conflicts=proceed

http://192.168.56.10/es/fenci.txt,这个是nginx上资源的访问路径

在运行下面实例之前,需要安装nginx,然后创建“fenci.txt”文件,内容如下:

echo "尚硅谷谷粒学院" > /mydata/nginx/html/fenci.txt 

进行测试

GET my_index/_analyze
{
   "analyzer": "ik_max_word", 
   "text":"尚硅谷谷粒学院"
}
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值