倒排索引
倒排索引包含两个部分:单词词典、倒排列表。
单词词典(Term Dictionary)
记录所有文档的单词,记录单词到倒排列表的关联关系
单词词典一般比较大,可以通过B+树或哈希拉链法实现,以满足高性能的插入与查询
倒排列表(Posting List)
记录了单词对应的文档结合,有倒排索引项组成
倒排索引项(Posting)
- 文档ID
- 词频(TF)- 该单词在文档中出现的次数,用于相关性评分
- 位置(Position)- 单词在文档中分词的位置,用于语句搜索
- 偏移(Offset)- 记录单词的开始和结束位置,实现高亮显示
Analysis
文本分析是把全文本转换成一系列单词(term/token)的过程,也叫分词。Analysis是通过Analyzer来实现的,可使用ES内置的分词器或者按需定制化的分析器。除了在数据写入时转换词条,匹配Query语句时候也需要用相同的分析器对查询语句进行分析。
Analyzer的组成
分词器是专门处理分词的组件,Analyzer由三部分组成:
- Character Filter :针对原始文本处理,例如去除html
- Tokenizer :按照规则切分单词
- Token Filter:将切分的单词进行加工,如小写、删除stopwords、增加同义词等
IK analyze:分析=分词过程:字符过滤器(过滤特殊符号外加量词,的,stopWord停用词) -> 字符处理(词库词典)-> 分词过滤(分词转换,词干转换)
自定义分词器
PUT /my_index
{
"settings":{
"analysis":{
"char_filter":{
"&_to_and":{
"type":"mapping",
"mappings":[
"&=> and "
]
}
},
"filter":{
"my_stopwords":{
"type":"stop",
"stopwords":[
"the",
"a"
]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "standard"
}
},
"analyzer":{
"my_analyzer":{
"type":"custom",
"char_filter":[
"html_strip",
"&_to_and"
],
"tokenizer":"my_tokenizer",
"filter":[
"lowercase",
"my_stopwords"
]
}
}
}
}
}
多字段类型
多字段特性:
- 厂商名字实现精确匹配(增加一个keyword字段)
- 使用不同的analyzer(不同语言、pinyin字段搜索、还支持为搜索和索引指定不同的analyzer)
PUT products
{
"mappings": {
"properties": {
"company": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"comment": {
"type": "text",
"fields": {
"english_comment": {
"type": "text",
"analyzer": "english",
"search_analyzer": "english"
}
}
}
}
}
}
tmdb实例测试
创建mapping
#tmdb创建
PUT movie
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "english"
},
"tagline": {
"type": "text",
"analyzer": "english"
},
"release_date": {
"type": "date",
"format": "8yyyy/MM/dd||yyyy/M/dd||yyyy/MM/d||yyyy/M/d"
},
"popularity": {
"type": "double"
},
"overview": {
"type": "text",
"analyzer": "english"
},
"cast": {
"type": "object",
"properties": {
"character": {
"type": "text",
"analyzer": "standard"
},
"name": {
"type": "text",
"analyzer": "standard"
}
}
}
}
}
}
查询-单字段查询
match查询和term查询的区别:term查询不进行分词的分析,直接去索引内精确搜索,大小写必须匹配;match查询按照字段上定义的分词,分析后会去索引内查询;
GET /movie/_analyze
{
"field": "title",
"text": "steve jobs"
}
#精确匹配 相当于 where = ''
GET movie/_search
{
"query": {
"term": {
"title": "Steve Zissou"
}
}
}
#分词后的and和or的逻辑, match 默认使用的是or
GET /movie/_search
{
"query": {
"match": {
"title": "basketball with cartoom aliens"
}
}
}
#分词后的and和or的逻辑, match设置成and
GET /movie/_search
{
"query": {
"match": {
"title": {
"query": "basketball with cartoom aliens love",
"operator": "and",
}
}
}
}
#最小词匹配项 minimum_should_match:至少2项匹配
GET movie/_search
{
"query": {
"match": {
"title": {
"query": "basketball with cartoom aliens love",
"operator": "or",
"minimum_should_match": 2
}
}
}
}
#短语查询 相当于 like '%x%'
GET movie/_search
{
"query": {
"match_phrase": {
"title": "steve zissou"
}
}
}
查询-多字段查询
# 多字段查询 multi_query方式
# title^10指的是title字段的分放大10倍
# 如果没加tie_breaker字段的话,最后取得分是 Math.max(title的得分, overview的得分)
# 加了tie_breaker字段的话,最后取得分是 title的得分 + 0.3 * overview的得分
GET movie/_search
{
"query": {
"multi_match": {
"query": "basketball with cartoom aliens",
"fields": ["title^10", "overview"]
"tie_breaker": 0.3
}
}
}
#查看算法算法的结构
GET movie/_validate/query?explain
{
"query": {
"multi_match": {
"query": "basketball with cartoom aliens",
"fields": ["title^10", "overview"],
"type": "best_fields"
}
}
}
# cross_field以词为维度 指的是 Sum( Math.max(title中的basketbal得分, overview的basketbal的得分) + Math.max(title中的cartoom得分, overview的cartoom的得分))
GET movie/_validate/query?explain
{
"query": {
"multi_match": {
"query": "basketball cartoom",
"fields": ["title^10", "overview"],
"type": "cross_fields"
}
}
}
#bool查询
#must: 必须都是true
#must not: 必须都为false
#should: 其中一个为true即可
#为true的越多则得分越高 (相当于sum 加法运算)
GET movie/_search
{
"explain": true,
"query": {
"bool": {
"should": [
{"match": {"title": "basketball with cartoom aliens"}},
{"match": {"overview": "basketball with cartoom aliens"}}
]
}
}
}
# filter过滤查询 这个查询不会计算分数
# 一般用于where查询进行过滤 配合term
GET /movie/_search
{
"query":{
"bool":{
"filter":[
{"term":{"title":"steve"}},
{"term":{"cast.name":"gaspard"}},
{"range": { "release_date": { "lte": "2015/01/01" }}},
{"range": { "popularity": { "gte": "25" }}}
]
}
},
"sort":[
{"popularity":{"order":"desc"}}
]
}
# functionscore 调整分数
# 详细说明: log(10 * popularity) * log(5 * popularity) * 查询出来的分
# 可以指定成对应的计算,如 log(10 * popularity) + log(5 * popularity) + 查询出来的分
GET movie/_search
{
"explain": true,
"query": {
"function_score": {
"query": {
"multi_match": {
"query": "steve job",
"fields": ["title", "overview"],
"operator": "or",
"type": "most_fields"
}
},
"functions": [
{
"field_value_factor": {
"field": "popularity",
"modifier": "log2p",
"factor": 10
}
},
{
"field_value_factor": {
"field": "popularity",
"modifier": "log2p",
"factor": 5
}
}
],
"score_mode": "multiply", # 设置的是 functions里面的条件的操作,默认是相乘
"boost_mode": "multiply" # 设置的是 function输出的分和原来的查询出来的分数进行的操作,默认是相乘
}
}
}
搜索原理
TF: token frequency, 分词在document字段(待搜索的字段)中出现的次数
IDF:inverse document frequency, 逆文档频率,代表分词在整个文档中出现的频率,取反