elasticsearch-ik

最新推荐文章于 2024-04-21 20:01:31 发布

??yy

最新推荐文章于 2024-04-21 20:01:31 发布

阅读量104

点赞数

文章标签：大数据

原文链接：http://www.cnblogs.com/DennyZhao/p/9442853.html

版权

因lucene默认采用英文且英文通过空格就可以断句。而中文则是词组，如果不加载中文词库或插件则会变为一个一个字而非词组，因此需要加载中文词库。

不加分词库所看到的中文分词效果。

post _analyze
{
   "text": "中国人民" 
}

结果变为了1个字一个字的：

{
   "tokens": [
      {
         "token": "中",
         "start_offset": 0,
         "end_offset": 1,
         "type": "<IDEOGRAPHIC>",
         "position": 0
      },
      {
         "token": "国",
         "start_offset": 1,
         "end_offset": 2,
         "type": "<IDEOGRAPHIC>",
         "position": 1
      },
      {
         "token": "人",
         "start_offset": 2,
         "end_offset": 3,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "民",
         "start_offset": 3,
         "end_offset": 4,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      }
   ]
}

词库下载地址： https://github.com/medcl/elasticsearch-analysis-ik/releases

https://github.com/medcl/elasticsearch-analysis-ik （readme.txt 阅读安装）

将下载的内容copy到elasticsearch的plugin/ik文件夹下，如果没有则建立此文件夹。重启有效。

ik的作用域

standard

　　不需要特别定义（默认）

system

　　在es早期版本可通过在yml中配置 index.analysis.analyzer.default.type: ik

　　错误 "node settings must not contain any index level settings"

　　5.x之后elastic不允许在yml文件中添加以index开头的配置文件，要求这些都必须在es启动后通过接口传递

index

　　首先创建index，然后对index设定属性，最后查看。这里使用的是 sense

　　目前ik的analysis只能通过挂载在index下，对指定的属性使用。如果新加属性要使用ik，则先到map中进行维护增加属性要使用的analysis。

//创建索引

put /testindex

// 设置analysis 注意_mapping中一个索引对属性创建的map，一旦建立后不能修改，只能新增。

POST /testindex/fulltext/_mapping
{
        "properties": {
            "content": {
                "type": "text",
                "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
            }
        }
}

以上可以看出仅content属性具有ik索词效果，其它属性不具备。

mapping添加 address错误： Mapper for [address] conflicts with existing mapping in other types:\n[mapper [address] has different [analyzer]]

这是因为已经建立了一个属性数据，而这个属性数据在建立时会自动给分配一个mapping映射，因此在建立mapper时说已经存在有一个不同类型的属性。即使删除这笔数也不行，因为属性是只能增加不能修改。

切记切记！！！

ik_max_word 以最大的分词形式进行分词，精细粒度

ik_smart 以最敏捷的分词形式分词，粗粒度

注意：原先老版本的ik已经被ik_max_word，ik_smart取代

// 测试不要加type不然以为是create数据

post testindex/_analyze
{
"analyzer": "ik_max_word",
   "text": "中国人民" 
}

// 结果

{
   "tokens": [
      {
         "token": "中国人民",
         "start_offset": 0,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 0
      },
      {
         "token": "中国人",
         "start_offset": 0,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 1
      },
      {
         "token": "中国",
         "start_offset": 0,
         "end_offset": 2,
         "type": "CN_WORD",
         "position": 2
      },
      {
         "token": "国人",
         "start_offset": 1,
         "end_offset": 3,
         "type": "CN_WORD",
         "position": 3
      },
      {
         "token": "人民",
         "start_offset": 2,
         "end_offset": 4,
         "type": "CN_WORD",
         "position": 4
      }
   ]
}

同时可通过建立索引模板来创建索引统一格式。

DELETE _template/temp_ik

POST _template/temp_ik
{
  "index_patterns": ["ik_*", "*_ik"],
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "type1": {
      "_source": {
        "enabled": true
      },
      "properties": {
        "title": {
          "type": "text",
            "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
        },
        "name":{
            "type": "text",
            "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word" 
        },
        "content":{
            "type": "text",
             "analyzer": "ik_max_word",
                "search_analyzer": "ik_max_word"
        },
        "create_date": {
          "type": "date",
          "format": "EEE MMM dd HH:mm:ss Z YYYY"
        }
      }
    }
  }
}

PUT ik_test

POST ik_test/type1/
{
    "title": "人民银行",
    "name":"7月金融数据传政策暖意",
    "content":"社会融资规模增速等数据也传递出这样的信息"
}

POST ik_test/type1/_search?pretty=true
{
    "query": {"match": {
       "title": "人民"
    }}
}