使用Elasticsearch 7.9.1实现对word，pdf，txt文件的全文内容检索

最新推荐文章于 2024-06-20 14:35:17 发布

梁晓山（ben）

最新推荐文章于 2024-06-20 14:35:17 发布

阅读量1.2k

点赞数 1

分类专栏：随手笔记文章标签： elasticsearch

本文链接：https://blog.csdn.net/oThrowsException/article/details/120223454

版权

随手笔记专栏收录该内容

17 篇文章 0 订阅

订阅专栏

1.安装插件

#预处理
./bin/elasticsearch-plugin install ingest-attachment
#分词
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...这里找你的版本

2.定义文本抽取管道

PUT /_ingest/pipeline/attachment
{
    "description": "Extract attachment information",
    "processors": [
        {
            "attachment": {
                "field": "content",
                "ignore_missing": true
            }
        },
        {
            "remove": {
                "field": "content"
            }
        }
    ]
}

在attachment中指定要过滤的字段为content，所以写入Elasticsearch时需要将文档内容放在content字段

3.建立文档结构映射

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "ik_max_word"
      },
      "type":{
        "type": "keyword"
      },
      "attachment": {
        "properties": {
          "content":{
            "type": "text",
            "analyzer": "ik_smart"
          }
        }
      }
    }
  }
}

在 ElasticSearch 中增加了attachment字段，这个字段是attachment命名pipeline抽取文档附件中文本后自动附加的字段。这是一个嵌套字段，其包含多个子字段，包括抽取文本 content 和一些文档信息元数据。

同是对文件的名字name指定分析器analyzer为 ik_max_word，以让 ElasticSearch在建立全文索引时对它们进行中文分词。

4.上传
将文件流转化为base64后提交到es

梁晓山（ben）

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
打赏
0
评论
使用Elasticsearch 7.9.1实现对word，pdf，txt文件的全文内容检索

1.安装插件#预处理./bin/elasticsearch-plugin install ingest-attachment#分词./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...这里找你的版本2.定义文本抽取管道PUT /_ingest/pipeline/attachment{ "description": "Extra
复制链接

扫一扫