原
Elasticsearch 之(31)fielddata原理初探
版权声明: https://blog.csdn.net/wuzhiwei549/article/details/80479738
1、
《
Elasticsearch 之(6)kibana嵌套聚合,下钻分析,聚合分析》提到 对于分词的field执行aggregation,发现报错
-
GET /test_index/test_type/_search
-
{
-
”aggs”: {
-
”group_by_test_field”: {
-
”terms”: {
-
”field”:
”test_field”
-
}
-
}
-
}
-
}
-
{
-
"error": {
-
"root_cause": [
-
{
-
"type":
"illegal_argument_exception",
-
"reason":
"Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
-
}
-
],
-
"type":
"search_phase_execution_exception",
-
"reason":
"all shards failed",
-
"phase":
"query",
-
"grouped":
true,
-
"failed_shards": [
-
{
-
"shard":
0,
-
"index":
"test_index",
-
"node":
"4onsTYVZTjGvIj9_spWz2w",
-
"reason": {
-
"type":
"illegal_argument_exception",
-
"reason":
"Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
-
}
-
}
-
],
-
"caused_by": {
-
"type":
"illegal_argument_exception",
-
"reason":
"Fielddata is disabled on text fields by default. Set fielddata=true on [test_field] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."
-
}
-
},
-
"status":
400
-
}
对分词的field,直接执行聚合操作,会报错,大概意思是说,你必须要打开fielddata,然后将正排索引数据加载到内存中,才可以对分词的field执行聚合操作,而且会消耗很大的内存
2、
如果要对分词的field执行聚合操作,必须将fielddata设置为true
-
POST /test_index/_mapping/test_type
-
{
-
”properties”: {
-
”test_field”: {
-
”type”:
”text”,
-
”fielddata”:
true
-
}
-
}
-
}
-
-
{
-
”test_index”: {
-
”mappings”: {
-
”test_type”: {
-
”properties”: {
-
”test_field”: {
-
”type”:
”text”,
-
”fields”: {
-
”keyword”: {
-
”type”:
”keyword”,
-
”ignore_above”:
256
-
}
-
},
-
”fielddata”:
true
-
}
-
}
-
}
-
}
-
}
-
}
-
GET /test_index/test_type/_search
-
{
-
"size":
0,
-
"aggs": {
-
"group_by_test_field": {
-
"terms": {
-
"field":
"test_field"
-
}
-
}
-
}
-
}
-
-
{
-
"took":
23,
-
"timed_out":
false,
-
"_shards": {
-
"total":
5,
-
"successful":
5,
-
"failed":
0
-
},
-
"hits": {
-
"total":
2,
-
"max_score":
0,
-
"hits": []
-
},
-
"aggregations": {
-
"group_by_test_field": {
-
"doc_count_error_upper_bound":
0,
-
"sum_other_doc_count":
0,
-
"buckets": [
-
{
-
"key":
"test",
-
"doc_count":
2
-
}
-
]
-
}
-
}
-
}
3、使用内置field不分词,对string field进行聚合
如果对不分词的field执行聚合操作,直接就可以执行,不需要设置fieldata=true(keyword在256字符内忽略分词)
-
GET /test_index/test_type/_search
-
{
-
”size”:
0,
-
”aggs”: {
-
”group_by_test_field”: {
-
”terms”: {
-
”field”:
”test_field.keyword”
-
}
-
}
-
}
-
}
-
{
-
"took":
3,
-
"timed_out":
false,
-
"_shards": {
-
"total":
5,
-
"successful":
5,
-
"failed":
0
-
},
-
"hits": {
-
"total":
2,
-
"max_score":
0,
-
"hits": []
-
},
-
"aggregations": {
-
"group_by_test_field": {
-
"doc_count_error_upper_bound":
0,
-
"sum_other_doc_count":
0,
-
"buckets": [
-
{
-
"key":
"test",
-
"doc_count":
2
-
}
-
]
-
}
-
}
-
}
4、分词field+fielddata的工作原理
doc value –> 不分词的所有field,可以执行聚合操作 –> 如果你的某个field不分词,那么在index-time,就会自动生成doc value –> 针对这些不分词的field执行聚合操作的时候,自动就会用doc value来执行
分词field,是没有doc value的。在index-time,如果某个field是分词的,那么是不会给它建立doc value正排索引的,因为分词后,占用的空间过于大,所以默认是不支持分词field进行聚合的
分词field默认没有doc value,所以直接对分词field执行聚合操作,是会报错的
对于分词field,必须打开和使用fielddata,完全存在于纯内存中,结构和doc value类似。如果是ngram或者是大量term,那么必将占用大量的内存。
如果一定要对分词的field执行聚合,那么必须将fielddata=true,然后es就会在执行聚合操作的时候,现场将field对应的数据,建立一份fielddata正排索引,fielddata正排索引的结构跟doc value是类似的,但是只会讲fielddata正排索引加载到内存中来,然后基于内存中的fielddata正排索引执行分词field的聚合操作
如果直接对分词field执行聚合,报错,才会让我们开启fielddata=true,告诉我们,会将fielddata uninverted index,正排索引,加载到内存,会耗费内存空间
为什么fielddata必须在内存?因为大家自己思考一下,分词的字符串,需要按照term进行聚合,需要执行更加复杂的算法和操作,如果基于磁盘和os cache,那么性能会很差