基础:es版本6.0
text类型时分析型类型,默认是不允许进行聚合操作的。如果想对text类型的域(就是字段、属性的意思)进行聚合操作,需要设置其fielddata为true。但这样设置完了只是满足聚合要求了,而无法满足精准聚合,就是说text类型还是会进行分词分析过程,导致针对一个text类型的域进行聚合的时候,可能会不准确。因此还需要设置此字段的fileds子域为keyword类型,经过这两处设置之后就可以进行精准聚合操作了。
下面是测试过程。
首先创建一个索引my_index,并指定其类型名为my_type(6版本每个索引只支持一个type),并为其设置了映射规则:testText域的类型为text类型
PUT my_index
{
"mappings" : {
"my_type" : {
"properties" : {
"testText" : {
"type" : "text"
}
}
}
}
}
接下来插入一条文档
POST my_index/my_type
{
"testText":"v1/v2"
}
接下来尝试分桶聚合操作
POST /my_index/my_type/_search
{
"size" : 0,
"aggs" : {
"buk" : {
"terms" : {
"field" : "testText"
}
}
}
}
结果报错如下
{
"error": {
"root_cause": [
{
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "my_index",
"node": "WsOTyQlISXKvOkxoqlqAJA",
"reason": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
],
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [testText] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
},
"status": 400
}
报错意思是说,默认情况下,在text类型的字段上禁用Fielddata,因为会占用很大的内存。如果实在想对text类型进行聚合,可以在对应字段上设置fielddata=true,以便通过取消反转索引将fielddata加载到内存中。要实现聚合,建议直接设置类型为keyword而不是text。
那么我们先不管占不占内存,先按照提示设置fielddata=true试试
POST /my_index/_mapping/my_type
{
"properties": {
"testText": {
"type": "text",
"fielddata": true
}
}
}
设置完之后再次进行之前的聚合操作,得到如下结果
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0,
"hits": []
},
"aggregations": {
"buk": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "v1",
"doc_count": 1
},
{
"key": "v2",
"doc_count": 1
}
]
}
}
}
可以看到对testText字段进行聚合操作后得到了两个桶,第一个桶的值为“v1”,第二个桶的值为“v2”,且都各自对应一个文档。分析一下,我们只存了一个文档,testText值为“v1/v2”。因为text会进行分词,默认使用的分词器会把“/”省略掉,倒排索引后生成“v1”,“v2”两个token,因此对testText进行聚合操作时,会分别匹配“v1”,“v2”两个token而不是输入的“v1/v2”。所以这就是开头说的对text类型字段进行聚合可能会不准确。
接下来解决不准确的问题!
为testText字段设置keyword类型的子字段
POST /my_index/_mapping/my_type
{
"properties": {
"testText": {
"type": "text",
"fielddata": true,
"fields": {
"subField": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
接下来测试聚合操作,注意这时候是对testText字段的子字段subField进行聚合(实际上子字段的值就是父字段的值,因此可以代替父字段进行聚合),但如果不添加新的文档的话,会发现没有结果,因为旧文档已经按照旧的映射规则创建了倒排索引了,所以新的聚合不会查到数据。我这里直接新增新的文档再进行聚合。
POST my_index/my_type
{
"testText":"v3/v4"
}
POST /my_index/my_type/_search
{
"size" : 0,
"aggs" : {
"buk" : {
"terms" : {
"field" : "testText.subField"
}
}
}
}
插入了新的值“v3/v4”,并对subField进行聚合,结果如下
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"buk": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "v3/v4",
"doc_count": 1
}
]
}
}
}
可以看到只有一个桶,且值为输入的值“v3/v4”,这就实现精确聚合了。
但还是需要注意,这样会消耗内存,建议对需要聚合的字符串字段设置为keyword类型。