Near Real-Time Search
Elasticsearch底层依赖的 Lucene ,引入了 per-segment search 的概念。一个段(segment)是有完整功能的倒排索引。New documents 在被写入an on-disk segment之前,首先写入 in-memory indexing buffer
英文比较浅显,我就不翻译了
Sitting between Elasticsearch and the disk is the filesystem cache. documents in the in-memory indexing buffer are written to a new segment . But the new segment is written to the filesystem cache first—which is cheap—and only later is it flushed to disk—which is expensive. But once a file is in the cache, it can be opened and read, just like any other file.
refresh
In Elasticsearch, this lightweight process of writing and opening a new segment is called a refresh. By default, every shard is refreshed automatically once every second. This is why we say that Elasticsearch has near real-time search: document changes are not visible to search immediately, but will become visible within 1 second.
commit
光是refresh是不够的,还得把data持久化到disk,
the action of performing a commit and truncating the translog is known in Elasticsearch as a flush. Shards are flushed automatically every 30 minutes, or when the translog becomes too big
为了保证数据可靠性,引入了事务日志translog,两次commit point之间,由translog 来纪录data changes
New documents are added to the in-memory buffer and appended to the transaction log
Every so often—such as when the translog is getting too big—the index is flushed; a new translog is created, and a full commit is performed The filesystem cache is flushed with an fsync。The old translog is deleted.
translog本身也是可靠的
By default, the translog is fsync’ed every 5 seconds and after a write request completes (e.g. index, delete, update, bulk). This process occurs on both the primary and replica shards. Ultimately, that means your client won’t receive a 200 OK response until the entire request has been fsync’ed in the translog of the primary and all replicas.
详情可参见章节https://www.elastic.co/guide/en/elasticsearch/guide/current/inside-a-shard.html