INTRODUCTION
what is?
- document oriented
- schema free
- distributed
- multi-tenancy
- api centric & RESTful
- unstructured & structured search
- analytics & combine
- realtime
glossary
- cluster
- node
- index: collection of documents, logical namespace which maps to shards
- shard: single Apache Lucene instance
- primary: not possible to change after index creation
- replica: increase high availability, increase read throughput
in a distribution
- executable scripts
- node config file
- data storage
- data path
- internal libs
- log
setting
- discovery: multicast / unicast
- network
- network.host: bind address, publish address
- HTTP: ports [9200-9300)
- transport: ports [9300-9400)
plugin
DATA IN DATA OUT
data structure
- document as JSON object
- metadata fields:
- id, type, source, timestamp, size…
- field data type: core data type, complex type, others
APIs
- Create index API:
-XPUT
+ target index name - Index API: create & update doc
-XPUT
+ … document type- operation response: 201 or 200
_id
url is optional- index auto generated
- timestamp
- TTL: expritation time
- distributed execution: 1) index request, 2) reroute, 3) replicate
- Get API
-XGET
+ retrive doc from index using_type
and_id
- realtime
- operation response: 200 or 404
- request specific fiends using
_source
- distributed execution: 1) get, 2) execute
- Exists API:
-XHEAD
- Delete API:
-XDELETE
- Document Versioning:
- concurrency control: read-then-write
- creation, reindex, update, delete ops
- can be from external system
- Update API
-XPOST
+ partial data or scripts- internal get-then-reindex ops min conflicts:
retry_on_conflict
- named parameters, index related parameter
upsert
- Multi Get API
- get multiple doc:
_mget
- get multiple doc:
- Bulk API
- minimize round trips when bulk ops
- line break
- Search API:
- finding doc based on free text search:
-XGET
- query DSL
- control search context
- finding doc based on free text search:
ELASTIC SEARCH & LUCENE
Lucene index
- memory buffer
- flush: issue Lucene commit & clear translog
- refresh
- N segments: immutable inverted index
- segments API
- transaction log
- delete doc/seg
- merge segments: throttle
- optimize API: explicit merge
Detour
- writes are sequential
TEXT ANALYSIS
- need: stop word, uppercase, plural form, synonym…
- anatomy of analyzer: tokenizer, token filter
- ICU plugins
- pre-built
- Analyze API
MAPPINGS
mapping
- index based on doc and fields
- dynamic mapping
- when
- config
mapping API
- basic mapping
- type: text, numeric, date, boolean, object, common attribute
- dynamic nature
- multi field, metadata
- customize
SEARCH
- pagination
- sorting
- search types
- query then fetch
- dfs query then fetch
- count
- scan
- query DSL
- query: match, multi-match, bool, range, match all, query strings …
- filters: warmer API
- highlighting
SUGGEST
suggester: term, phrase, completion, fuzzy
RELEVANCY & BOOSTING
relavancy
- vector space model
- TF-IDF
- lucene similarity
boosting
function score: boost factor, decay function, script score, random score
AGGREGATIONS
- facet
- scope
- query scope
- filtered_query, post_filter
- categories of aggregation
- Buckets
- filter
- sub-aggregations
- missing
- terms
- range, *_range
- histogram, date_histogram
- Metrics
- extended_stats
- Buckets
DOCUMENT RELATIONS
inner objects, nested, parent/child
GEO LOCATION
geo point
geo shape: geohashes, quad tree
PERCOLATOR
- registering a query
- _percolate API, _mpercolate API
- routing, filtering, scoring and sorting, highlighting, aggregation, storing
DISTRIBUTED MODEL
- finding nodes, elect master node
- cluster meta API
- cluster state
- cluster state API
- shard allocation: create index, add nodes, remove node, filtering, awareness
- node types: data node, master node, client node, tribe node
- routing
- replication
INDEX MANAGEMENT
create index, update index settings, deleting index, open/close index
index template
snapshot/restore - backup mechanism for indices
index aliase
DATA MANAGEMENT
- overallocation
- replica
- multiple indices
- capacity planning
- user data flow
- time data flow