elasticSearch中文文档

elasticSearch-权威指南-中文版
Kibana 用户手册
elasticSearch-中文社区
elasticSearch-参考手册-api
elasticSearch-客户端-api

分词器调试

以下是分词器调试的全部过程

开始调试

  1. 查看分词结果语句
//myindex:index
//_analyze:要进行的查看分词结果操作
//whitespace:使用的分词器
//You're the 1st runner home!:实验用的文本数据
GET myindex/_analyze
{
 "tokenizer":"whitespace",
 "text":"You're the 1st runner home!" 
}
  1. 分词过程解析
    做全文搜索就需要对文档分析、建索引。从文档中提取词元(Token)的算法称为分词器(Tokenizer),在分词前预处理的算法称为字符过滤器(Character Filter),进一步处理词元的算法称为词元过滤器(Token Filter),最后得到词(Term)。这整个分析算法称为分析器(Analyzer)。
    文档包含词的数量称为词频(Frequency)。搜索引擎会建立词与文档的索引,称为倒排索引(Inverted Index)。
String
String
Tokens
Tokens
Input
CharacterFilters
Tokenizer
TokenFilters
Output

Analyzer 按顺序做三件事:

使用 CharacterFilter 过滤字符
使用 Tokenizer 分词
使用 TokenFilter 过滤词

每一部分都可以指定多个组件。
Elasticsearch 默认提供了多种 CharacterFilter、Tokenizer、TokenFilter、Analyzer,你也可以下载第三方的 Analyzer 等组件
Analyzer 一般会提供一些设置。如 standard Analyzer 提供了 stop_words 停用词过滤配置。

以下样例构造了名为 standard 的 standard Analyzer 类型的带停用词列表的分析器:

PUT /my-index/_settings
{
  "index": {
    "analysis": {
      "analyzer": {
        "standard": {
          "type": "standard",
          "stop_words": [ "it", "is", "a" ]
        }
      }
    }
  }
}

你也可以通过 Setting API 构造组合自定义的 Analyzer。如:

PUT /my-index/_settings
{
  "index": {
    "analysis": {
      "analyzer": {
        "custom": {
          "type": "custom",
          "char_filter": [ "html_strip" ],
          "tokenizer": "standard",
          "filter": [ "lowercase", "stop", "snowball" ]
        }
      }
    }
  }
}

这构造了名为 custom 的 Analyzer,它完成以下工作:

使用 html_strip 字符过滤器,移除 html 标签
使用 standard 分词器,分词
使用 lowercase 词过滤器,转为小写单词
使用 stop 词过滤器,过滤停用词
使用 snowball 词过滤器,用 snowball 雪球算法 提取词干

使用 Analyze API 分析给定文档,通过这种方式可以检查配置的行为是正确的。如:

POST /my-index/_analyze?analyzer=standard
quick brown

返回:

{
  "tokens": [
    {
      "token": "quick",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "brown",
      "start_offset": 6,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

在给目标索引建映射时,指定待分析的字段的分析器来使用我们构造的分析器。如:

PUT /my-index/_mapping/my-type
{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "custom"
      }
    }
  }
}

如果希望使用多种分析器得到不同的分词,可以使用 multi-fields 特性,指定多个产生字段:

PUT /my-index/_mapping/my-type
{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "analyzer": "standard",
        "fields": {
          "custom1": {
            "type": "string",
            "analyzer": "custom1"
          },
          "custom2": {
            "type": "string",
            "analyzer": "custom2"
          }
        }
      }
    }
  }
}

这样你可以通过 name、name.custom1、name.custom2 来使用不同的分析器得到的分词。

查询时也可以指定分析器。如:

POST /my-index/my-type/_search
{
  "query": {
    "match": {
      "name": {
        "query": "it's brown",
        "analyzer": "standard"
      }
    }
  }
}

或者在映射中分别指定他们。如:

PUT /my-index/_mapping/my-type
{
  "my-type": {
    "properties": {
      "name": {
        "type": "string",
        "index_analyzer": "custom",
        "search_analyzer": "standard" 
      }
    }
  }
}

然后索引一些文档,使用简单的 match 查询检查一下,如果发现问题,使用 Validate API 检查一下。如:

POST /my-index/my-type/_validate/query?explain
{
  "query": {
    "match": {
      "name": "it's brown"
    }
  }
}
  1. 部分更新文档

执行语句如下:

POST /website/blog/1/_update
{
  "doc" : {
     "tags" : [ "testing" ],
     "views": 0
  }
}

//响应结果:

{
   "_index" :   "website",
   "_id" :      "1",
   "_type" :    "blog",
   "_version" : 3
}

//更新后的 _source 字段:

{
   "_index":    "website",
   "_type":     "blog",
   "_id":       "1",
   "_version":  3,
   "found":     true,
   "_source": {
      "title":  "My first blog entry",
      "text":   "Starting to get the hang of this...",
      "tags": [ "testing" ], 
      "views":  0 
   }
}
>//更新的文档可能还不存在,则进行新增
>//retry_on_conflict:若发生冲突,重试的次数
>//先执行script操作,若文档尚不存在,则执行upsert的新增操作
>//script还可改为ctx._source.views="value",此时value值不支持特殊字符
>POST /website/pageviews/1/_update?retry_on_conflict=5 
{
   "script" : "ctx._source.views+=1",
   "upsert": {
       "views": 0
   }
}
  1. 新增文档

//由es自动生成id

POST /website/blog/
{ ... }

//自定义id

 >PUT /website/blog/123?op_type=create
{ ... }

//自定义id

PUT /website/blog/123/_create
{ ... }
  1. 删除文档
DELETE /website/blog/123
  1. 取回多个文档[^1] 功能

//执行语句

GET /_mget
{
  "docs" : [
     {
        "_index" : "website",
        "_type" :  "blog",
        "_id" :    2
     },
     {
        "_index" : "website",
        "_type" :  "pageviews",
        "_id" :    1,
        "_source": "views"
     }
  ]
}

//响应体

{
   "docs" : [
      {
         "_index" :   "website",
         "_id" :      "2",
         "_type" :    "blog",
         "found" :    true,
         "_source" : {
            "text" :  "This is a piece of cake...",
            "title" : "My first external blog entry"
         },
         "_version" : 10
      },
      {
         "_index" :   "website",
         "_id" :      "1",
         "_type" :    "pageviews",
         "found" :    true,
         "_version" : 2,
         "_source" : {
            "views" : 2
         }
      }
   ]
}

//如果有相同的index和type

GET /website/blog/_mget
{
   "docs" : [
      { "_id" : 2 },
      { "_type" : "pageviews", "_id" :   1 }
   ]
}
 GET /website/blog/_mget
{
   "ids" : [ "2", "1" ]
}
  1. 自定义分词器,停词
{
   "settings":{
       "index":{
           "number_of_shards":"8",
           "number_of_replicas":"1"
       },
   	"analysis": {
   		"analyzer": {
   		//自定义分词器及过滤器
   			"my_lowercaser": {
   				"tokenizer": "standard",
   				"filter":  [ "lowercase","param_stop" ] 
   			}
   		},
   		"filter": {
   		//自定义停词
   			"param_stop": {
   				"type":"stop",
   				"stopwords":[",","-","_"]
   			}
   		}
   	}
   },
   "mappings": {
         "typeName": {
           "_ttl": {
             "enabled": false
           },
           "dynamic": "false",
           "_timestamp": {
             "enabled": true
           },
           "_all": {
             "enabled": false
           },
           "properties": {
             "updateDate": {
               "format": "strict_date_optional_time||epoch_millis",
               "type": "date"
             },
             "cId": {
             //不进行分词
               "index": "not_analyzed",
               "type": "string"
             },
             "name": {
             //定义搜索时分词器
               "search_analyzer": "my_lowercaser",
               //定义创建分词索引时的分词器
               "analyzer": "my_lowercaser",
               "store": true,
               "type": "string"
             }
      }
   }
 }
}
  1. 查询
GET /_search
{
   "query": {
       "match": {
           "tweet": "elasticsearch"
       }
   }
}

14.合并查询

//叶子语句(Leaf clauses) (就像 match 语句) 被用于将查询字符串
//和一个字段(或者多个字段)对比。复合(Compound) 语句 主要用于 
//合并其它查询语句。 比如,一个 bool 语句 允许在你需要的时候
//组合其它语句,无论是 must 匹配、 must_not 匹配还是 should 匹配
//,同时它可以包含不评分的过滤器(filters)
{
    "bool": {
        "must":     { "match": { "tweet": "elasticsearch" }},
        "must_not": { "match": { "name":  "mary" }},
        "should":   { "match": { "tweet": "full text" }},
        "filter":   { "range": { "age" : { "gt" : 30 }} }
    }
}

复合查询

//找出信件正文包含 business opportunity 的星标邮件,或者在收件
//箱正文包含 business opportunity 的非垃圾邮件
{
    "bool": {
        "must": { "match":   { "email": "business opportunity" }},
        "should": [
            { "match":       { "starred": true }},
            { "bool": {
                "must":      { "match": { "folder": "inbox" }},
                "must_not":  { "match": { "spam": true }}
            }}
        ],
        "minimum_should_match": 1
    }
}

使用sql查询

POST /_xpack/sql?format=txt
{
  "query": "SELECT * FROM my_index"
}
//将sql转为标准的dsl语言
POST _xpack/sql/translate
{
  "query": "SELECT * FROM my_index"
}

设置高亮及查询

PUT my_index
{
 "settings": {
   "analysis": {
     "analyzer": {
       "my_analyzer": {
         "tokenizer": "standard",
         "char_filter": [
           "my_char_filter"
         ],
         "filter": [
           "lowercase"
         ]
       }
     },
     "char_filter": {
       "my_char_filter": {
         "type": "pattern_replace",
         "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
         "replacement": " "
       }
     }
   }
 },
 "mappings": {
   "my_type": {
     "properties": {
       "name": {
         "type": "text",
         "analyzer": "my_analyzer",
         "index_options" : "offsets"
       },
       "cname": {
         "type": "text",
         "analyzer": "my_analyzer",
         "index_options" : "offsets"
       }
     }
   }
 }
}

GET my_index/_search
{
   "query" : {
     "bool": {
       "should": [
         {
           "match": {
             "name": "fox"
           }
         },
         {
           "match": {
             "cname": "thousand"
           }
         }
       ]
     }
   },
   "highlight" : {
       "fields" : {
           "name":{
             "type": "plain",
               "fragment_size" : 150,
               "number_of_fragments" : 2,
               "fragmenter": "simple"
           },
           "cname":{
             "type": "plain",
               "fragment_size" : 500,
               "number_of_fragments" : 3,
               "fragmenter": "simple"
           }
       }
   }
}

PUT my_index/my_type/1
{
 "name" : "For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other. You'll be the only boy in the world for me. I'll be the only fox in the world for you.",
 "cname":"For you I'm only a fox like a hundred thousand other foxes. But if you tame me, we'll need each other."
}

  • 7
    点赞
  • 121
    收藏
    觉得还不错? 一键收藏
  • 6
    评论
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值