Mapping - Field datatypes

最新推荐文章于 2024-03-05 22:05:42 发布
姓氏弓长张
最新推荐文章于 2024-03-05 22:05:42 发布
阅读量624
点赞数
分类专栏： elasticsearch 文章标签： elasticsearch mapping Field datatypes
elasticsearch 专栏收录该内容
4 篇文章 0 订阅
订阅专栏
 
  1.array datatype 
 
   在默认情况下所有的field可以包含0到多个值，在array中的所有值必须保证是统一datatype 
 
 an array of strings: [ "one", "two" ]
 an array of integers: [ 1, 2 ]
 an array of arrays: [ 1, [ 2, 3 ]] which is the equivalent of [ 1, 2, 3 ]
 an array of objects: [ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }]

   注意：arrays of objects无法进行查询使用，如果需要查询对象内部数据，需要使用nested datatype 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/nested.html 
 
   当自动添加field动态生成mapping时，array中的第一个值决定了field type，所有其他值都必须拥有相同的数据类型，查询时能够按照对应的datatype进行搜索 
 
  2.binary datatype 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/binary.html 
 
   能够接受如base64 encoded 字符串（将图片或者文件读取的字节流进行base64编码），使用该类型的属性不会进行默认存储丙炔不会被索引。 
 
   注意base64 encoded二进制值不能含有 \n ,即不能换行 
 
   存在两个参数可设置，没看懂什么意思 
 
   doc_values 默认 true 
 
   store 默认 false 意思好像是使值可以存储，并且可以通过非_source域获取到。 
 
  3.Boolean datatype 
 
   False values false, "false", "off", "no", "0", "" (empty string), 0, 0.0
 True values Anything that isn’t false.
 
   boolean类型的field可以存储各种类型，直接使用ap查询时可以查询原始值，比如存储的是1.2，查询时_source{"property":1.2,...},使用aggregate汇总时，是按照true和false汇总的。 
 
   可以使用script显示时直接filed 'is_published' 使用true和false显示 
 
   { 
 
   "script_fields": { 
 
   "is_published": { 
 
   "script": { 
 
   "lang": "painless", 
 
   "inline": "doc['is_published'].value" 
 
   } 
 
   } 
 
   } 
 
   } 
 
   具体使用script的方法参看 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/modules-scripting.html 
 
  4.Date datatype 
 
   Json中不存在date类型，因此date在elasticsearch可以是： 
 
   时间格式的字符串，比如 “2015-01-01" 或者 
  "2015/01/01 12:10:30" 
 
   long型数字表示 从1970-01-01的毫秒数 
 
   integer数字表示 从1970-01-01的秒数 
 
   内部，时间将被转化成UTC时间存储成long型的毫秒数。 
 
   时间格式可以被自定义，但是如果没有format被设置，则使用默认格式 
 
  "strict_date_optional_time||epoch_millis" 
 
   默认格式意味着可以使用 
  strict_date_optional_time或者毫秒数存储时间 
 
   具体的strict_date_optional_time 参看 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/mapping-date-format.html#strict-date-time 
 
   具体的时间格式是参照joda 
 
  http://www.joda.org/joda-time/apidocs/org/joda/time/format/ISODateTimeFormat.html#dateOptionalTimeParser-- 
 
  date-opt-time = date-element ['T' [time-element] [offset]] 
 
   offset 为具体时区 如中国是东八区UTC+0800 
 
   "2016-1-7 10:21"，代表的真实时间是"2016-1-7T10:21+0800" 
 
   实验：存储不同时区的时间，或者日期格式 
 
   当排序时可以显示内部存储的毫秒数，或者直接使用script_fields painless进行毫秒转化 
 
   进行查看 
 
   "date": "2015-01-01T12:10:30Z" == 1420114230000 
 
   "date": "2015-01-01T12:10:30+0800" == 1420085430000 
 
   可以为时间field设置多个格式，使用'||'进行分割，获取values时尝试匹配 
 
   PUT my_index{ "mappings": { "my_type": { "properties": { "date": { "type": "date", "format": "yyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis" } } } }} 
 
  5.geo_point datatype (geographic geodesic) 
 
   geo_point 用于存储经纬度对 latitude，longitude可以用于 
 
   ①在地理区域矩形内所有坐标点，或者在一个中心点内的范围，或者是一个多边形内的坐标点 
 
   矩形参看 
   https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-geo-bounding-box-query.html 
 
   计算距离 
   https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-geo-distance-query.html 
 
   多边形 
   https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-geo-polygon-query.html 
 
   ②进行地理类比的聚合（aggregate）文档操作或者聚合点对点的距离 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/search-aggregations-bucket-geohashgrid-aggregation.html 
 
   ③ 通过距离进行计算花评分 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/query-dsl-function-score-query.html 
 
   ④ 通过距离排序 
 
   共存在4种方式设置经纬度 
 
   ① 
 
   "location": { "lat": 41.12, "lon": -71.34 
 
   } 
 
   ② 
 
   "location": "41.12,-71.34" 
 
   ③ 
 
   "location": "drm3btev3e86" -- geohash 
 
   ④ 
 
   "location": [ -71.34, 41.12 ] 
 
   注意：当使用字符串定义坐标时使用的是lat,lon 纬度经度，而使用数组时相反使用的是经纬度 
 
   当在scripts中使用geo_point，获取到的 
  GeoPoint是个对象，获取经纬度可通过下面的方式 
 
  geopoint  
  = 
   doc 
  [ 
  'location' 
  ]. 
  value 
  ; 
  lat  
  = 
   geopoint 
  . 
  lat 
  ; 
  lon  
  = 
   geopoint 
  . 
  lon 
  ; 
 
  For performance reasons, it is better to access the lat/lon values directly: 
 
  lat  
  = 
   doc 
  [ 
  'location' 
  ]. 
  lat 
  ; 
  lon  
  = 
   doc 
  [ 
  'location' 
  ]. 
  lon 
 
  6.geo_shape 
 
   geo_shape datatype 促进对于矩形或者多边形这样的不规则的地理图形的索引和搜索，经常被用作处理不同于坐标点的包含图形的索引和查找。 
 
   mapping options 
 
   geo_shape的mapping设置了geo_json几何对象对geo_shape 类型的描述。要使用该类型，用户必须显式的设置geo_shape的以下属性 
 
   tree: geohash 
 
   使用的前缀树(prefixtree)的实现方法的名字，geohash 表示 GeohashPrefixTree 以及 quadtree 表示 四叉树前缀树(QuadPrefixTree). 默认是 geohash 
 
   precision: 50m 
 
   这个参数用于补充 tree_levels 被设置成何止的值。这个值设置期望的精度，elasticsearch 将根据这个精度去计算最佳的tree_levels（前缀树层级量）值。 设置的值应该是一个数组再加上一个可选的距离单位。合法的距离单位包括： 
  in 
  ,  
  inch 
  ,  
  yd 
  ,  
  yard 
  ,  
  mi 
  ,  
  miles 
  ,  
  km 
  ,  
  kilometers 
  , 
  m 
  , 
  meters 
  ,  
  cm 
  , 
  centimeters 
  ,  
  mm 
  ,  
  millimeters 
  . 
 
   默认的单位为meters或者简写成m 
 
   tree_levels: 
 
   前缀树的最大层级数，被用于控制图形表示的经度以及多少组被索引。默认值取决于选择的前缀树的实现方式。当该值需要一个具体前缀树的明确层数时，用户可以使用precision参数替换。无论如何，elasticsearch只在内部使用这个参数，当你使用精度参数时通过mapping api获取到这个值。 
 
   默认为=50m---应该是精度，默认精度为50m，产生的tree_levels 
 
   strategy:recursive 
 
   策略参数定义了索引和查找中如何表述图形的方式.他同时影响可用性因此推荐让elasticsearch自动设置这个参数。这里有两种策略可以使用 recursive 和 term。term策略仅支持点图形，即points_only参数设置为true。 recursive 策略支持所有的图形。 
 
   distance_error_pct: 
 
   用于前缀树得到的结果的误差经度百分比，默认为0.025（2.5%),0.5是支持的最大值。如果precision或者tree_level被显示设置则该值为0.这样保证在映射中的空间经度。这将导致显著的内存应用在搞分辨率图形和低错误的情况下（比如large shapes at 1m with 《 0.001 error）。为了完善索引性能（在损失查询经度）显式定义tree_level或者是precision同时需要定义理性的distance_error_pct,注意大图形（应该是大分辨率比如1m）将出现大的误差。 
 
   orientation: 
 
   可选定义如何解释多边形的顶点顺序。这个参数定义了两种坐标规则中的一种（right-hand or left-hand）每种可以使用三种不同的方式设定。right-hand rule(逆时针规则)right,ccw,counterclockwise, left-hand rule(顺时针规则) left,cw,clockwise.默认的orientation为counterclockwise 遵从 OGC 
  开放地理空间信息联盟 （Open Geospatial Consortium-OGC）的标准，该标准通过分别通过逆时针关联坐标点定义外部边界，通过顺时针关联坐标点定义内部边界。在geo_shape的映射中显式的设置坐标列表的排序规则会被每个单独的GeoJSON文档重写。 
 
   points_only:false 
 
   设置该属性选项true（默认false)将配置geo_shape仅为点图形服务。这将优化索引和查找geohash或者是四叉树quadtree的性能当只有坐标单为索引时。但在这种情况下geo_shape的查询将不能用于查询geo_point(我的理解geo_shape points_only 和geo_point是冲突的，但是这两者区别未知) 
 
   prefix trees（前缀树） 
 
   在索引中有效率的表述图形,图形被转化成表述方网格（工厂被称作栅格rasters）的哈希序列用于前缀树的实现。前缀树的概念来源于 前缀树使用多个网格层，每个网格层表述不断增加的精度的地球整体描述（类似百度地图的缩放).可以被认为成地图或者图片在更高缩放层级的展示。 
 
   前缀树的实现有： 
 
   GeohashPrefixTree - 使用方网格geohashes。 geohashes是纬度和经度的比特位组合的Base32编码字符串。更长的hase串代表更高的精度。geohash表述的每一个字符代表另外的树层级，以及添加精度的5个bit。geohash表述一个矩形区域以及含有32个子矩形。最大的层级数是24. 
 
   quadprefixtree-为方格使用四叉树。类似与geohash，四叉树存储经纬度结果hash结果为bit集合。一个树层级在这个集合中占2个bit，一个用于相互关联。最大的层级数为50 
 
   空间策略 
 
   参看网页内容 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/geo-shape.html#spatial-strategy 
 
   recursive 
 
   term 
 
   后面的不翻译基本上是精度与索引设置的这种，推荐使用妥协的精度，保证索引不会太大 
 
   各种地理图形的设置参看地址 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/geo-shape.html#point 
 
   mapping设置，需要增加现在还没有字段 
 
   可以使用 put index/_mapping/type 进行设置 
 
  { 
    
  "properties" 
  : 
    
  { 
    
  "location" 
  : 
    
  { 
    
  "type" 
  : 
    
  "geo_shape" 
  , 
    
  "tree" 
  : 
    
  "quadtree" 
  , 
    
  "precision" 
  : 
    
  "1m" 
    
  } 
    
  } 
   
  } 
 
  点 
 
  { 
    
  "location" 
    
  : 
    
  { 
    
  "type" 
    
  : 
    
  "point" 
  , 
    
  "coordinates" 
    
  : 
    
  [- 
  77.03653 
  , 
    
  38.897676 
  ] 
    
  } 
   
  } 
 
  线 
 
  { 
    
  "location" 
    
  : 
    
  { 
    
  "type" 
    
  : 
    
  "linestring" 
  , 
    
  "coordinates" 
    
  : 
    
  [[- 
  77.03653 
  , 
    
  38.897676 
  ], 
    
  [- 
  77.009051 
  , 
    
  38.889939 
  ]] 
    
  } 
   
  } 
 
  圆形 
 
  { 
    
  "location" 
    
  : 
    
  { 
    
  "type" 
    
  : 
    
  "circle" 
  , 
    
  "coordinates" 
    
  : 
    
  [- 
  45.0 
  , 
    
  45.0 
  ], 
    
  "radius" 
    
  : 
    
  "100m" 
    
  } 
   
  } 
 
  7.ip datatype 
 
   能够索引或存储ipv4 或者 ipv6 地址 
 
   查询ip类型的方法 
 
   使用CIDR notation 标记，参考 
 
  https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing#CIDR_notation 
 
   [ip_address]/[prefix_length] 
 
   原理 
 
  The number of addresses of a subnet may be calculated as 2的( 
  address size − prefix size)次方 
  , in which the address size is 128 for IPv6 and 32 for IPv4. For example, in IPv4, the prefix size /29 gives: 2的( 
  32 − 29)次方 
   = 2的 
  3次方 
   = 8 addresses. 
 
   常用的prefix_length以及对应的子网掩码 
 
   /16 
   255.255.0.0 
 
   /24 
   255.255.255.0 
 
  /8 255.0.0.0 
 
  用于进行网段过滤 
 
  POST my_index 
  / 
  _search 
  { 
    
  "query" 
  : 
    
  { 
    
  "term" 
  : 
    
  { 
    
  "ip_addr" 
  : 
    
  "192.168.0.0/16" 
    
  } 
    
  } 
   
  } 
 
  8.keyword datatype 
 
   索引的属性经常会包括邮箱地址，主机地址，状态吗，邮政编码，标签等值，他们经常被用作过滤聚合等等操作，keyword保证他们被查找时仅能够按照准确值而不按照分词评分查找。 
 
   如果你需要按照全文索引邮件内容或者产品介绍之类的，请使用text 类型 
 
   keyword是5.0才存在的类型，之前版本没有 
 
  9.nested datatype 
 
   嵌套类型是object datatype的一个专用类型，该类型允许包含该对象的数组类型进行索引和查询。 
 
   对象数组的扁平化 
 
   内部对象属性值的数组不能按照用户期望的方式工作，lucene没有内部对象的概念，因此elasticsearch 将简单的list中的name和value进行扁平化对象层级处理 
 
   比如 
 
  "user" 
    
  : 
    
  [ 
    
  { 
    
  "first" 
    
  : 
    
  "John" 
  , 
    
  "last" 
    
  : 
    
  "Smith" 
    
  }, 
    
  { 
    
  "first" 
    
  : 
    
  "Alice" 
  , 
    
  "last" 
    
  : 
    
  "White" 
    
  } 
    
  ] 
 
   将会转化成为如下的层级存储到文档中 
 
  "user.first" 
    
  : 
    
  [ 
    
  "alice" 
  , 
    
  "john" 
    
  ], 
   
  "user.last" 
    
  : 
    
  [ 
    
  "smith" 
  , 
    
  "white" 
    
  ] 
 
  注意：elasticsearch这样处理之后，user.first 和user.last 属性都扁平化变成多值的属性值，而原来文档中的John对应Simith 以及 Alice对应White的关系将会丢失无法保存，当需要查找对应姓名的用户无法正确的匹配到，正确的名字为Alice white 但是使用下面 alice smith同样可以查询到 
 
  GET my_index 
  / 
  _search 
  { 
    
  "query" 
  : 
    
  { 
    
  "bool" 
  : 
    
  { 
    
  "must" 
  : 
    
  [ 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.first" 
  : 
    
  "Alice" 
    
  }}, 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.last" 
  : 
    
  "Smith" 
    
  }} 
    
  ] 
    
  } 
    
  } 
   
  } 
 
   在对象数组中使用nested属性 
 
   如果你需要在索引对象数组的同时保持每个在数组中的对象独立（内部属性之间的相互依赖），你需要使用nested 数据类型替换object数据类型。内部，nested对象索引数组中的每个对象时将会分隔成独立的隐藏文档，意味着每个nested对象都可以被独立的搜索。 
 
   设置person为nested类型（在properties） 
 
   "person": { 
 
   "type": "nested" 
 
   } 
 
   存储数据，查询时使用nested进行查询 
 
  GET my_index 
  / 
  _search 
  { 
    
  "query" 
  : 
    
  { 
    
  "nested" 
  : 
    
  { 
    
  "path" 
  : 
    
  "user" 
  , 
    
  "query" 
  : 
    
  { 
    
  "bool" 
  : 
    
  { 
    
  "must" 
  : 
    
  [ 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.first" 
  : 
    
  "Alice" 
    
  }}, 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.last" 
  : 
    
  "Smith" 
    
  }} 
    
  ] 
    
  } 
    
  } 
    
  } 
    
  } 
   
  } 
 
   注意：因为nested文档被索引时是分隔成多个文档，他们也只能在nested查询时再独立的文档内评分。比如在nested文档中设置的offsets使用高亮，则需要限定在nested inner hits内部进行设置，而不能设置成外围的，从下面的语句可以看到 inner_hits是在nested内部的 
 
  GET my_index 
  / 
  _search 
  { 
    
  "query" 
  : 
    
  { 
    
  "nested" 
  : 
    
  { 
    
  "path" 
  : 
    
  "user" 
  , 
    
  "query" 
  : 
    
  { 
    
  "bool" 
  : 
    
  { 
    
  "must" 
  : 
    
  [ 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.first" 
  : 
    
  "Alice" 
    
  }}, 
    
  { 
    
  "match" 
  : 
    
  { 
    
  "user.last" 
  : 
    
  "White" 
    
  }} 
    
  ] 
    
  } 
    
  }, 
    
  "inner_hits" 
  : 
    
  { 
    
  "highlight" 
  : 
    
  { 
    
  "fields" 
  : 
    
  { 
    
  "user.first" 
  : 
    
  {} 
    
  } 
    
  } 
    
  } 
    
  } 
    
  } 
   
  } 
 
   10.numeric datatypes 
 
   数字类型支持如下： 
 
   long 64bit integer 2的63次方正负 
 
   integer 32bit integer 2的31次方正负 
 
   short 16bit integer 32767 -32768 
 
   byte 8bit integer -128 127 
 
   double 64bit 
 
   float 32bit 
 
   half_float 16bit 
 
   scaled_float 
 
   11.object datatype 
 
  PUT my_index 
  / 
  my_type 
  / 
  1 
   
  { 
    
  "region" 
  : 
    
  "US" 
  , 
    
  "manager" 
  : 
    
  { 
    
  "age" 
  : 
    
  30 
  , 
    
  "name" 
  : 
    
  { 
    
  "first" 
  : 
    
  "John" 
  , 
    
  "last" 
  : 
    
  "Smith" 
    
  } 
    
  } 
   
  } 
 
  内部的存储接口如下 
 
  { 
    
  "region" 
  : 
    
  "US" 
  , 
    
  "manager.age" 
  : 
    
  30 
  , 
    
  "manager.name.first" 
  : 
    
  "John" 
  , 
    
  "manager.name.last" 
  : 
    
  "Smith" 
   
  } 
 
   12.string datatype 
 
   string在elasticsearch5 版本已经不支持了，string会自动转化成text和keyword。 
 
   13.text datatype 
 
   该类型索引全文搜索的值，比如电子邮件的内容或者是产品的详细介绍。该属性需要被分词器进行分词，将字符串转化为能够被索引的单个词组。为elasticsearch提供用于搜索的词组。text类型不能用于排序也很少用于聚合。 
 
   如果你需要精确索引完整的内容，应该使用keyword类型 
 
   如果你想同时让一个字段同时进行全文搜索和精确搜索以及聚合操作，则可以参看使用multi-fields 
 
   进行处理 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/multi-fields.html 
 
   内部使用的分词器参考如下 
 
  https://www.elastic.co/guide/en/elasticsearch/reference/5.0/analysis.html 
 
   关于ik分词器使用参看elasticsearch-ik分词器 
 
   ik_smart，ik_max_word 
 
   14.token count datatype 
 
   token_count数据类型用于记录一个字符串被分词器分词后统计的该字符串中分出的词的数量的integer值,在属性中增加fields字段，可以参看multi_fields,链接在text datatype里面 
 
   PUT my_index{ "mappings": { "my_type": { "properties": { "name": { "type": "text", "fields": { "length": { "type": "token_count", "analyzer": "standard" } } } } } }}PUT my_index/my_type/1{ "name": "John Smith" }PUT my_index/my_type/2{ "name": "Rachel Alice Williams" }GET my_index/_search{ "query": { "term": { "name.length": 3 } }} 
 
   15.percolator type 不太能看懂，暂时不使用