Loki 初探

Loki

Grafana Loki是可以组成功能齐全的日志记录堆栈的一组组件。

与其他日志记录系统不同,Loki是围绕着仅索引有关日志的元数据的思路构建的:标签(就像Prometheus标签一样)。日志数据本身然后被压缩并存储在对象存储(例如S3或GCS)中的块中,甚至存储在文件系统上的本地。小索引和高度压缩的块简化了操作,并大大降低了Loki的成本。

一. 处理流程

1.1 写流程

image-20200719151210705

大致流程如下:

  1. The distributor receives an HTTP/1 request to store data for streams.

    分发服务器收到HTTP / 1请求以存储流数据。

  2. Each stream is hashed using the hash ring.

    使用哈希环对每个流进行哈希处理。

  3. The distributor sends each stream to the appropriate ingesters and their replicas (based on the configured replication factor).

    分发程序将每个流发送到适当的inester和其副本(基于配置的复制因子)。

  4. Each ingester will create a chunk or append to an existing chunk for the stream’s data. A chunk is unique per tenant and per labelset.

    每个实例将为流的数据创建一个块或将其附加到现有块中。每个租户和每个标签集的块都是唯一的。

  5. The distributor responds with a success code over the HTTP/1 connection.

    分发服务器通过HTTP / 1连接以成功代码作为响应。

1.2 Read Path (读流程)

image-20200719150048332

大致流程如下:

  1. The querier receives an HTTP/1 request for data.

    querier 收到读取数据 的http请求

  2. The querier passes the query to all ingesters for in-memory data.

    querier 将query传递给all ingesters查询内存中的数据。

  3. The ingesters receive the read request and return data matching the query, if any.

    ingesters接收读取请求并返回与查询匹配的数据(如果有)。

  4. The querier lazily loads data from the backing store and runs the query against it if no ingesters returned data.

    如果没有接收者返回数据,则查询器会从后备存储延迟加载数据并对其执行查询。

  5. The querier iterates over all received data and deduplicates, returning a final set of data over the HTTP/1 connection.

    查询器将迭代所有接收到的数据并进行重复数据删除,从而通过HTTP / 1连接返回最终数据集。

二. 架构

1. 多租户支持

All data - both in memory and in long-term storage - is partitioned by a tenant ID, pulled from the X-Scope-OrgID HTTP header in the request when Loki is running in multi-tenant mode. When Loki is not in multi-tenant mode, the header is ignored and the tenant ID is set to “fake”, which will appear in the index and in stored chunks.

当Loki在多租户模式下运行时,所有数据(包括内存和长期存储中的数据)都由租户ID分区,该租户ID是从请求中的X-Scope-OrgID HTTP标头提取的。 当Loki不在多租户模式下时,将忽略标头,并且将租户ID设置为“ fake”,这将显示在索引和存储的块中。

2. 组件

2.1 Distributor 分发器

分发服务负责处理客户写的日志。从本质上讲,它是日志数据写入路径中的“第一站”。

分发者接收到日志数据后,会将其拆分为多个批次,然后并行发送给多个ingester。

The distributor service is responsible for handling logs written by clients. It’s essentially the “first stop” in the write path for log data. Once the distributor receives log data, it splits them into batches and sends them to multiple ingesters in parallel.

Distributors communicate with ingesters via gRPC. They are stateless and can be scaled up and down as needed.

Hashing

Distributors use consistent hashing in conjunction with a configurable replication factor to determine which instances of the ingester service should receive log data.

分发程序将一致性哈希可配置的复制因子结合使用,以确定ingester服务的哪些实例应接收日志数据。

The hash is based on a combination of the log’s labels and the tenant ID.

哈希基于日志标签和租户ID的组合。

A hash ring stored in Consul is used to achieve consistent hashing; all ingesters register themselves into the hash ring with a set of tokens they own. Distributors then find the token that most closely matches the value of the log’s hash and will send data to that token’s owner.

存储在Consul中的哈希环用于实现一致的哈希;所有ingester都会使用自己拥有的一组令牌将自己注册到哈希环中。然后,分发者会找到与日志的哈希值最匹配的令牌,并将数据发送给该令牌的所有者。

A hash ring stored in Consul is used to achieve consistent hashing; all ingesters register themselves into the hash ring with a set of tokens they own. Each token is a random unsigned 32-bit number. Along with a set of tokens, ingesters register their state into the hash ring. The state JOINING, and ACTIVE may all receive write requests, while ACTIVE and LEAVING ingesters may receive read requests. When doing a hash lookup, distributors only use tokens for ingesters who are in the appropriate state for the request.

To do the hash lookup, distributors find the smallest appropriate token whose value is larger than the hash of the stream. When the replication factor is larger than 1, the next subsequent tokens (clockwise in the ring) that belong to different ingesters will also be included in the result.

The effect of this hash set up is that each token that an ingester owns is responsible for a range of hashes. If there are three tokens with values 0, 25, and 50, then a hash of 3 would be given to the ingester that owns the token 25; the ingester owning token 25 is responsible for the hash range of 1-25.

Quorum consistency (Quorum类型的一致性)

Since all distributors share access to the same hash ring, write requests can be sent to any distributor.

To ensure consistent query results, Loki uses Dynamo-style quorum consistency on reads and writes. This means that the distributor will wait for a positive response of at least one half plus one of the ingesters to send the sample to before responding to the user.

Quorum 机制,是一种分布式系统中常用的,用来保证数据冗余和最终一致性的投票算法,其主要数学思想来源于鸽巢原理

https://zh.wikipedia.org/wiki/Quorum_(%E5%88%86%E5%B8%83%E5%BC%8F%E7%B3%BB%E7%BB%9F) Quorum 机制

2.2 Ingester

The ingester service is responsible for writing log data to long-term storage backends (DynamoDB, S3, Cassandra, etc.) on the write path and returning log data for in-memory queries on the read path.

ingester服务负责在写入路径上将日志数据写入到持久化存储(DynamoDB,S3,Cassandra等),并在读取路径上返回日志数据以进行内存中查询。

Ingesters contain a lifecycler which manages the lifecycle of an ingester in the hash ring.

摄取器包含一个生命周期器,该生命周期器管理哈希环中ingester的生命周期。

Each ingester has a state of either PENDING, JOINING, ACTIVE, LEAVING, or UNHEALTHY:

  1. PENDING is an Ingester’s state when it is waiting for a handoff from another ingester that is LEAVING.
  2. JOINING is an Ingester’s state when it is currently inserting its tokens into the ring and initializing itself(向集群注册成功并且初始化中). It may receive write requests for tokens it owns.
  3. ACTIVE is an Ingester’s state when it is fully initialized(完全初始化成功). It may receive both write and read requests for tokens it owns.
  4. LEAVING is an Ingester’s state when it is shutting down(宕机). It may receive read requests for data it still has in memory.
  5. UNHEALTHY is an Ingester’s state when it has failed to heartbeat to Consul(和Consul的心跳断开时). UNHEALTHY is set by the distributor when it periodically checks the ring.

Each log stream that an ingester receives is built up into a set of many “chunks” in memory and flushed to the backing storage backend at a configurable interval.

一个ingester接收到的每个日志流被构建为一组许多“chunks”在内存中,并以可配置的间隔刷新到持久化存储。

Chunks are compressed and marked as read-only(当有以下情况时压缩块并将其标记为只读) when:

  1. The current chunk has reached capacity (a configurable value). 块满了
  2. Too much time has passed without the current chunk being updated 时间间隔到了
  3. A flush occurs. 触发flush

Whenever a chunk is compressed and marked as read-only, a writable chunk takes its place.

每当将块压缩并标记为只读时,就将使用可写块。


If an ingester process crashes or exits abruptly, all the data that has not yet been flushed will be lost. Loki is usually configured to replicate multiple replicas (usually 3) of each log to mitigate this risk.

如果一个ingester进程崩溃或突然退出,所有尚未刷新的数据将丢失。Loki通常配置为复制每个日志的多个副本(通常为3个副本)以减轻这种风险。

When a flush occurs to a persistent storage provider, the chunk is hashed based on its tenant, labels, and contents. This means that multiple ingesters with the same copy of data will not write the same data to the backing store twice, but if any write failed to one of the replicas, multiple differing chunk objects will be created in the backing store.

当持久存储提供程序发生刷新时,将根据块的租户,标签和内容对块进行哈希处理。这意味着具有相同数据副本的多个实例不会将相同数据两次写入后备存储,但是如果对其中一个副本的写入失败,则会在后备存储中创建多个不同的块对象。

See Querier for how data is deduplicated.

有关如何对数据进行重复数据删除,请参阅Querier。

The ingesters validate timestamps for each log line received maintains a strict ordering. See the Loki Overview for detailed documentation on the rules of timestamp order.

接收到的每个日志行的时间戳验证时间戳保持严格的顺序。有关时间戳顺序的详细文档,请参阅Loki概述。


Handoff

By default, when an ingester is shutting down and tries to leave the hash ring, it will wait to see if a new ingester tries to enter before flushing and will try to initiate a handoff. The handoff will transfer all of the tokens and in-memory chunks owned by the leaving ingester to the new ingester.

默认情况下,当一个ingester正在关闭并尝试离开哈希环时,它将等待看是否有一个新的ingester在刷新之前尝试进入并尝试启动切换。移交会将离开的ingester拥有的所有令牌和内存块转移到新的ingester。

Before joining the hash ring, ingesters will wait in PENDING state for a handoff to occur. After a configurable timeout, ingesters in the PENDING state that have not received a transfer will join the ring normally, inserting a new set of tokens.

在加入哈希环之前,入职者将在PENDING状态下等待切换发生。在可配置的超时后,处于PENDING状态的未接收到传输的ingester将正常加入环,插入一组新的令牌。

This process is used to avoid flushing all chunks when shutting down, which is a slow process.

此过程用于避免关闭时刷新所有块,这是一个缓慢的过程。

2.3 Query frontend

The query frontend is an optional service providing the querier’s API endpoints and can be used to accelerate the read path. When the query frontend is in place, incoming query requests should be directed to the query frontend instead of the queriers. The querier service will be still required within the cluster, in order to execute the actual queries.

The query frontend internally performs some query adjustments and holds queries in an internal queue. In this setup, queriers act as workers which pull jobs from the queue, execute them, and return them to the query-frontend for aggregation. Queriers need to be configured with the query frontend address (via the -querier.frontend-address CLI flag) in order to allow them to connect to the query frontends.

Query frontends are stateless. However, due to how the internal queue works, it’s recommended to run a few query frontend replicas to reap the benefit of fair scheduling. Two replicas should suffice in most cases.

2.3.1 Queueing

The query frontend queuing mechanism is used to:

  • Ensure that large queries, that could cause an out-of-memory (OOM) error in the querier, will be retried on failure. This allows administrators to under-provision memory for queries, or optimistically run more small queries in parallel, which helps to reduce the TCO.
  • Prevent multiple large requests from being convoyed on a single querier by distributing them across all queriers using a first-in/first-out queue (FIFO).
  • Prevent a single tenant from denial-of-service-ing (DOSing) other tenants by fairly scheduling queries between tenants.
2.3.2 Splitting

The query frontend splits larger queries into multiple smaller queries, executing these queries in parallel on downstream queriers and stitching the results back together again.

查询前端将较大的查询拆分为多个较小的查询,在下游查询器上并行执行这些查询,然后将结果重新组合在一起。

This prevents large (multi-day, etc) queries from causing out of memory issues in a single querier and helps to execute them faster.

这样可以防止大型查询(多日查询)在单个查询器中引起内存不足的问题,并有助于更快地执行它们。

2.3.3 Caching
2.3.3.1 Metric Queries

The query frontend supports caching metric query results and reuses them on subsequent queries. If the cached results are incomplete, the query frontend calculates the required subqueries and executes them in parallel on downstream queriers. The query frontend can optionally align queries with their step parameter to improve the cacheability of the query results. The result cache is compatible with any loki caching backend (currently memcached, redis, and an in-memory cache).

2.3.3.2 Log Queries - Coming soon!

Caching log (filter, regexp) queries are under active development.

2.4 Querier

The querier service handles queries using the LogQL query language, fetching logs both from the ingesters and long-term storage.

Queriers query all ingesters for in-memory data before falling back to running the same query against the backend store. Because of the replication factor, it is possible that the querier may receive duplicate data. To resolve this, the querier internally deduplicates data that has the same nanosecond timestamp, label set, and log message.

3. Chunk Format 块格式

  -------------------------------------------------------------------
  |                               |                                 |
  |        MagicNumber(4b)        |           version(1b)           |
  |                               |                                 |
  -------------------------------------------------------------------
  |         block-1 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-2 bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |         block-n bytes         |          checksum (4b)          |
  -------------------------------------------------------------------
  |                        #blocks (uvarint)                        |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  | #entries(uvarint) | mint, maxt (varint) | offset, len (uvarint) |
  -------------------------------------------------------------------
  |                      checksum(from #blocks)                     |
  -------------------------------------------------------------------
  |                    #blocks section byte offset                  |
  -------------------------------------------------------------------

mint and maxt describe the minimum and maximum Unix nanosecond timestamp, respectively.

2.5.1 Block Format

A block is comprised of a series of entries, each of which is an individual log line.

Note that the bytes of a block are stored compressed using Gzip. The following is their form when uncompressed:

  -------------------------------------------------------------------
  |    ts (varint)    |     len (uvarint)    |     log-1 bytes      |
  -------------------------------------------------------------------
  |    ts (varint)    |     len (uvarint)    |     log-2 bytes      |
  -------------------------------------------------------------------
  |    ts (varint)    |     len (uvarint)    |     log-3 bytes      |
  -------------------------------------------------------------------
  |    ts (varint)    |     len (uvarint)    |     log-n bytes      |
  -------------------------------------------------------------------

ts is the Unix nanosecond timestamp of the logs, while len is the length in bytes of the log entry.

2.6 Chunk Store 块存储

The chunk store is Loki’s long-term data store, designed to support interactive querying and sustained writing without the need for background maintenance tasks. It consists of:

Unlike the other core components of Loki, the chunk store is not a separate service, job, or process, but rather a library embedded in the two services that need to access Loki data: the ingester and querier.

The chunk store relies on a unified interface to the “NoSQL” stores (DynamoDB, Bigtable, and Cassandra) that can be used to back the chunk store index. This interface assumes that the index is a collection of entries keyed by:

  • A hash key. This is required for all reads and writes.
  • A range key. This is required for writes and can be omitted for reads, which can be queried by prefix or range.

The interface works somewhat differently across the supported databases:

  • DynamoDB supports range and hash keys natively. Index entries are thus modelled directly as DynamoDB entries, with the hash key as the distribution key and the range as the DynamoDB range key.
  • For Bigtable and Cassandra, index entries are modelled as individual column values. The hash key becomes the row key and the range key becomes the column key.

A set of schemas are used to map the matchers and label sets used on reads and writes to the chunk store into appropriate operations on the index. Schemas have been added as Loki has evolved, mainly in an attempt to better load balance writes and improve query performance.

三. 与竞品比较

EFK(Elasticsearch,Fluentd,Kibana)堆栈用于从各种来源提取,可视化和查询日志。

存储:

Elasticsearch中的数据作为非结构化JSON对象存储在磁盘上。每个对象的键和每个键的内容都被索引。然后可以使用JSON对象或定义为Lucene的查询语言来查询数据以定义查询(称为查询DSL)。

相比之下,**Loki在单二进制模式下可以将数据存储在磁盘上,**但是在水平可伸缩模式下,数据存储在诸如S3,GCS或Cassandra之类的云存储系统中。日志以纯文本格式存储,并标有一组标签名称和值,其中仅对标签对进行索引。这种折衷使得它比全索引更便宜运行,并允许开发人员从其应用程序积极地进行日志记录。使用LogQL查询Loki中的日志。但是,由于这种设计折衷,基于内容(即日志行中的文本)进行过滤的LogQL查询需要加载搜索窗口中与查询中定义的标签匹配的所有块。

采集:

Fluentd通常用于收集日志并将其转发到Elasticsearch。Fluentd被称为数据收集器,它可以从许多来源提取日志,对其进行处理,然后将其转发到一个或多个目标

相比之下,Promtail的用例专门针对Loki量身定制。它的主要操作模式是发现存储在磁盘上的日志文件,并将与一组标签关联的日志文件转发给Loki。Promtail可以为与Promtail在同一节点上运行的Kubernetes Pod进行服务发现,充当容器Sidecar或Docker日志记录驱动程序,从指定的文件夹中读取日志并尾随系统日志。

Loki用一组标签对表示日志的方式类似于Prometheus表示度量的方式。当与Prometheus一起部署在环境中时,由于使用了相同的服务发现机制,Promtail的日志通常具有与应用程序指标相同的标签。具有相同级别的日志和指标使用户可以在指标和日志之间无缝地上下文切换,从而有助于根本原因分析。

可视化:

Kibana用于可视化和搜索Elasticsearch数据,并且在对该数据进行分析时非常强大。Kibana提供了许多可视化工具来进行数据分析,例如位置图,用于异常检测的机器学习以及用于发现数据关系的图形。可以将警报配置为在发生意外情况时通知用户。

相比之下,Grafana专门针对来自Prometheus和Loki等来源的时间序列数据量身定制。可以设置仪表板以可视化指标(即将提供日志支持),并且可以使用浏览视图对数据进行临时查询。与Kibana一样,Grafana支持根据您的指标进行警报。

Loki 的第一目的就是最小化度量和日志的切换成本,有助于减少异常事件的响应时间和提高用户的体验。

Loki 的第二个目的是,在查询语言的易操作性和复杂性之间可以达到一个权衡。

,Loki 的第三个目的是,提高一个更具成本效益的解决方案。

四. 安装

采用 docker-compose 安装一个单机版本的

vi docker-compose.yaml

version: "3"

networks:
  loki:

services:
  loki:
    image: grafana/loki:1.5.0
    ports:
      - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    networks:
      - loki

  promtail:
    image: grafana/promtail:1.5.0
    volumes:
      - /var/log:/var/log
    command: -config.file=/etc/promtail/docker-config.yaml
    networks:
      - loki

  grafana:
    image: grafana/grafana:master
    ports:
      - "3000:3000"
    networks:
      - loki

启动
docker-compose up -d

五. 参考资料

https://blog.csdn.net/qq_42046105/article/details/107328512

https://github.com/grafana/loki/blob/master/docs/overview/README.md

https://github.com/grafana/loki/blob/master/docs/architecture.md

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值