es全文检索论文_使用ElasticSearch为本博客做全文检索(未完成)

最新推荐文章于 2024-04-21 20:26:29 发布

weixin_39578899

最新推荐文章于 2024-04-21 20:26:29 发布

阅读量77

点赞数

文章标签： es全文检索论文

本文链接：https://blog.csdn.net/weixin_39578899/article/details/112881252

版权

安装ElasticSearch

ElasticSearch安装参数本博客相关文章。

安装 IK Analysis分词器(暂未实测通过，忽略)

wget -c https://github.com/medcl/elasticsearch-analysis-ik/archive/v1.9.0.zip

unzip v1.9.0.zip

cd elasticsearch-analysis-ik-1.9.0

mvn package

# 获取编译后的elasticsearch-analysis-ik-1.9.0.zip,解压到ES的plugin目录下

unzip target/releases/elasticsearch-analysis-ik-1.9.0.zip -d /opt/app/blog/elasticsearch/plugins/ik/配置同义词

Elasticsearch 自带一个名为 synonym 的同义词 filter。为了能让 IK 和 synonym 同时工作，我们需要定义新的 analyzer，用IK做tokenizer，synonym 做 filter。

打开~/es_root/config/elasticsearch.yml文件，加入以下配置：

15index:

analysis:

analyzer:

ik_syno:

type: custom

tokenizer: ik_max_word

filter: [my_synonym_filter]

ik_syno_smart:

type: custom

tokenizer: ik_smart

filter: [my_synonym_filter]

filter:

my_synonym_filter:

type: synonym

synonyms_path: analysis/synonym.txt

以上配置定义了 ik_syno 和 ik_syno_smart 这两个新的 analyzer，分别对应 IK 的 ik_max_word 和 ik_smart 两种分词策略。根据 IK 的文档，二者区别如下：

ik_max_word：会将文本做最细粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、中华人民、中华、华人、人民共和国、人民、人、民、共和国、共和、和、国国、国歌」，会穷尽各种可能的组合；

ik_smart：会将文本做最粗粒度的拆分，例如「中华人民共和国国歌」会被拆分为「中华人民共和国、国歌」；

ik_syno 和 ik_syno_smart 都会使用 synonym filter 实现同义词转换。

创建配置文件文件，配置同义词(需要保存为utf-8格式)：

mkdir es_root/config/analysis/

vi es_root/config/analysis/synonym.txt

synonym.txt内容如下：

ua,user-agent,userAgent

js,javascript

谷歌=>google

查看集群健康状态

curl -s 'http://localhost:9200/_cluster/health' | python -mjson.tool

{

"active_primary_shards": 0,

"active_shards": 0,

"cluster_name": "elasticsearch",

"delayed_unassigned_shards": 0,

"initializing_shards": 0,

"number_of_data_nodes": 1,

"number_of_in_flight_fetch": 0,

"number_of_nodes": 1,

"number_of_pending_tasks": 0,

"relocating_shards": 0,

"status": "green",

"timed_out": false,

"unassigned_shards": 0

}

ElasticSearch

ES的几个基本概念与传统数据库的对比如下表。在ES中保存的每条记录叫一个 document ，它可以是一个包含很多字段的对象，默认情况下每个字段都能被搜索。

数据库

ElasticSearch

Databases

索引(Indices)

Tables

类型(Types)

Rows

文档(Documents)

Columns

字段(Fields)

Schema

Mapping

操作

基本操作

创建索引Indices

POST /blog

{

"settings" : {

"number_of_shards" : 1,

"number_of_replicas" : 1

}

}创建Types并进行映射

POST /blog/post/_mapping

{

"post": {

"properties": {

"id": {

"type": "long"

"title": {

"type": "string",

"term_vector": "with_positions_offsets"

"published": {

"type": "date"

"hidden": {

"type": "boolean"

"category": {

"type": "string"

"markdown": {

"type": "string"

"content": {

"type": "string",

"term_vector": "with_positions_offsets"

}

weixin_39578899

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫