IIoT平台数据库– Mail.ru云解决方案如何处理来自大量设备的PB级数据

Hello, my name is Andrey Sergeyev and I work as a Head of IoT Solution Development at Mail.ru Cloud Solutions. We all know there is no such thing as a universal database. Especially when the task is to build an IoT platform that would be capable of processing millions of events from various sensors in near real-time.

您好,我叫Andrey Sergeyev,我是Mail.ru Cloud Solutions的IoT解决方案开发负责人。 众所周知,没有通用数据库。 尤其是当任务是建立一个IoT平台时,它将能够近乎实时地处理来自各种传感器的数百万个事件。

Our product Mail.ru IoT Platform started as a Tarantool-based prototype. I’m going to tell you about our journey, the problems we faced and the solutions we found. I will also show you a current architecture for the modern Industrial Internet of Things platform. In this article we will look into:

我们的产品Mail.ru IoT Platform最初是基于Tarantool的原型。 我将向您介绍我们的旅程,遇到的问题以及找到的解决方案。 我还将向您展示现代工业物联网平台的当前体系结构。 在本文中,我们将研究:

  • our requirements for the database, universal solutions, and the CAP theorem

    我们对数据库,通用解决方案和CAP定理的要求
  • whether the database + application server in one approach is a silver bullet

    数据库+应用服务器在一种方法中是否是灵丹妙药
  • the evolution of the platform and the databases used in it

    平台及其中使用的数据库的发展
  • the number of Tarantools we use and how we came to this

    我们使用的Tarantools的数量以及我们如何实现这一目标

Mail.ru物联网平台 (Mail.ru IoT Platform today)

Our product Mail.ru IoT Platform is a scalable and hardware-independent platform for building Industrial Internet of Things solutions. It enables us to collect data from hundreds of thousands devices and process this stream in near real-time by using user-defined rules (scripts in Python and Lua) among other tools.

我们的产品Mail.ru IoT Platform是一个可扩展且独立于硬件的平台,用于构建工业物联网解决方案。 它使我们能够从数十万个设备中收集数据,并通过使用用户定义的规则(Python和Lua中的脚本)以及其他工具来近乎实时地处理此流。

The platform can store an unlimited amount of raw data from the sources. It also has a set of ready-made components for data visualization and analysis as well as built-in tools for predictive analysis and platform-based app development.

该平台可以存储来自源的无限量的原始数据。 它还具有一组用于数据可视化和分析的现成组件,以及用于预测分析和基于平台的应用程序开发的内置工具。

Mail.ru IoT Platform set-up Mail.ru IoT平台设置

The platform is currently available for on-premise installation on customers’ facilities. In 2020 we are planning its release as a public cloud service.

该平台当前可用于在客户设施上进行本地安装。 我们计划在2020年将其作为公共云服务发布。

基于Tarantool的原型:我们如何开始 (Tarantool-based prototype: how we started)

Our platform started as a pilot project – a prototype with a single instance Tarantool. Its primary functions were receiving a data stream from the OPC server, processing the events with Lua scripts in real-time, monitoring key indicators on its basis, and generating events and alerts for upstream systems.

我们的平台从一个试点项目开始–一个具有单一实例Tarantool的原型。 它的主要功能是从OPC服务器接收数据流,使用Lua脚本实时处理事件,在其基础上监视关键指标以及为上游系统生成事件和警报。

Flowchart of the Tarantool-based prototype 基于Tarantool的原型的流程图

This prototype has even shown itself in the field conditions of a multi-well pad in Iraq. It worked at an oil platform in the Persian Gulf, monitoring key indicators and sending data to the visualization system and the event log. The pilot was deemed successful, but then, as it often happens with prototypes, it was put into cold storage until we got our hands on it.

该原型甚至在伊拉克的一个多Kong井场的野外条件下展示了自己。 它在波斯湾的一个石油平台上工作,监视关键指标并将数据发送到可视化系统和事件日志。 试点被认为是成功的,但随后,就像在原型中经常发生的那样,将其放入冷藏室,直到我们动手为止。

我们的目标是开发物联网平台 (Our aims in developing the IoT platform)

Along with the prototype, we got ourselves a challenge of creating a fully functional, scalable, and failsafe IoT platform that could then be released as a public cloud service.

除了原型之外,我们还面临创建一个功能齐全,可扩展且具有故障保护功能的物联网平台的挑战,该平台随后可以作为公共云服务发布。

We had to build a platform with the following specifications:

我们必须构建一个具有以下规范的平台:

  1. Simultaneous connection of hundreds of thousands of devices

    同时连接数十万个设备
  2. Receiving millions of events every second

    每秒接收数百万个事件
  3. Datastream processing in near real-time

    几乎实时的数据流处理
  4. Storing several years of raw data

    存储数年的原始数据
  5. Analytics tools for both streaming and historic data

    流数据和历史数据的分析工具
  6. Support for deployment in multiple data centers to maximize disaster tolerance

    支持在多个数据中心中进行部署以最大程度地提高容灾能力

平台原型的优缺点 (Pros and cons of the platform prototype)

At the start of active development the prototype had the following structure:

在积极开发的开始,原型具有以下结构:

  • Tarantool that was used as a database + Application Server

    用作数据库+ Application Server的Tarantool
  • all the data was stored in Tarantool’s memory

    所有数据都存储在Tarantool的内存中
  • this Tarantool had a Lua app that performed the data reception and processing and called the user scripts with incoming data

    这个Tarantool有一个Lua应用程序,该应用程序执行数据接收和处理,并使用传入的数据调用用户脚本
This type of app structure has its advantages: 这种应用程序结构具有以下优点:
  1. The code and the data are stored in one place – that enables to manipulate the data right in the application memory and get rid of extra network manipulations, which are typical for traditional apps

    代码和数据存储在一个地方–可以在应用程序内存中直接处理数据,并且摆脱了传统应用程序常见的额外网络操作
  2. Tarantool uses the JIT (Just in Time Compiler) for Lua. It compiles Lua code into machine code, allowing simple Lua scripts to execute at the C-like speed (40,000 RPS per core and even higher!)

    Tarantool使用Lua的JIT(Just in Time Compiler)。 它将Lua代码编译为机器代码,从而允许简单的Lua脚本以类似于C的速度执行(每个内核40,000 RPS甚至更高!)。
  3. Tarantool is based upon cooperative multitasking. This means that every call of stored procedure runs in its own coroutine-like fiber. It gives a further performance boost for the tasks with I/O operations, e.g. network manipulations

    Tarantool基于协作式多任务处理。 这意味着存储过程的每次调用都在其自己的类似协程的光纤中运行。 对于具有I / O操作(例如网络操作)的任务,它可以进一步提高性能
  4. Efficient use of resources: tools capable of handling 40,000 RPS per core are quite rare

    有效利用资源:每个内核能够处理40,000 RPS的工具非常稀少
There are also significant disadvantages: 也有明显的缺点:
  1. We need storing several years of raw data from the devices, but we don’t have hundreds of petabytes for Tarantool

    我们需要从设备存储数年的原始数据,但是我们没有数百PB的Tarantool
  2. This item directly results from advantage #1. All of the platform code consists of procedures stored in the database, which means that any codebase update is basically a database update, and that sucks

    此项直接来自优势1。 所有平台代码都由存储在数据库中的过程组成,这意味着任何代码库更新基本上都是数据库更新,这很糟糕
  3. Dynamic scaling gets difficult because the whole system’s performance depends on the memory it uses. Long story short, you can’t just add another Tarantool to increase the bandwidth capacity without losing 24 to 32 Gb of memory (while starting, Tarantool allocates all the memory for data) and resharding the existent data. Besides, when sharding, we lose the advantage #1 – the data and the code may not be stored in the same Tarantool

    动态扩展变得困难,因为整个系统的性能取决于它使用的内存。 长话短说,您不能仅仅添加另一个Tarantool来增加带宽容量而不会丢失24到32 Gb的内存(启动时,Tarantool会为数据分配所有内存)并重新分片现有数据。 此外,在分片时,我们失去了#1的优势–数据和代码可能无法存储在同一Tarantool中
  4. Performance deteriorates as the code gets more complex with the platform progress. This happens not only because Tarantool executes all the Lua code in a single system stream, but also because the LuaJIT goes into interpreting mode instead of compiling when dealing with complex code

    随着平台的发展,代码变得越来越复杂,性能会下降。 发生这种情况,不仅是因为Tarantool在单个系统流中执行了所有Lua代码,而且还因为LuaJIT在处理复杂代码时进入了解释模式,而不是进行编译
Conclusion: Tarantool is a good choice for creating an MVP, but it doesn’t work for a fully functional, easily maintained, and failsafe IoT platform capable of receiving, processing, and storing data from hundreds of thousands of devices. 结论: Tarantool是创建MVP的不错选择,但不适用于功能齐全,易于维护且具有故障安全能力的IoT平台,该平台能够接收,处理和存储来自成千上万个设备的数据。

我们要解决的两个主要问题 (Two primary problems that we wanted to solve)

First of all, there were two main issues we wanted to sort out:

首先,我们要解决两个主要问题:

  1. Ditching the concept of database + application server. We wanted to update the app code independently of the database.

    抛弃了数据库+应用服务器的概念。 我们想要独立于数据库更新应用程序代码。
  2. Simplifying the dynamic scaling under stress. We wanted to have an easy independent horizontal scaling of the greatest possible number of functions

    简化压力下的动态缩放。 我们希望对最大数量的功能进行简单的独立水平缩放

To solve these problems, we took an innovative approach that was not well tested – the microservice architecture divided into Stateless (the applications) and Stateful (the database).

为了解决这些问题,我们采用了未经测试的创新方法–微服务体系结构分为无状态(应用程序)和有状态(数据库)。

In order to make maintenance and scaling the Stateless services out even simpler, we containerized them and adopted Kubernetes.

为了简化维护和扩展无状态服务,我们将它们容器化并采用了Kubernetes。

Now that we figured out the Stateless services, we have to decide what to do with the data.

既然我们已经找到了无状态服务,我们就必须决定如何处理数据。

物联网平台数据库的基本要求 (Basic requirements for the IoT platform database)

At first, we tried not to overcomplicate things – we wanted to store all the platform data in one single universal database. Having analyzed our goals, we came up with the following list of requirements for the universal database:

最初,我们尝试不使事情复杂化–我们希望将所有平台数据存储在一个通用数据库中。 在分析了我们的目标之后,我们提出了通用数据库的以下需求列表:

  1. ACID transactions – the clients will keep a register of their devices on the platform, so we wouldn’t want to lose some of them upon data modification

    ACID交易 –客户将在平台上保留其设备的注册,因此我们不希望在数据修改时丢失其中的一些设备

  2. Strict consistency – we have to get the same responses from all of the database nodes

    严格的一致性 –我们必须从所有数据库节点获得相同的响应

  3. Horizontal scaling for writing and reading – the devices send a huge stream of data that has to be processed and saved in near real-time

    用于写入和读取的水平缩放 –设备发送大量数据流,这些数据流必须实时处理和保存

  4. Fault tolerance – the platform has to be capable of manipulating the data from multiple data centers to maximize fault tolerance

    容错 –平台必须能够处理来自多个数据中心的数据,以最大程度地提高容错能力

  5. Accessibility – no one would use a cloud platform that shuts down whenever one of the nodes fails

    可访问性 –没有人会使用一个云平台,该云平台在任何一个节点发生故障时都会关闭

  6. Storage volume and good compression – we have to store several years (petabytes!) of raw data that also needs to be compressed.

    存储量和良好的压缩率 –我们必须存储数年(PB!)的原始数据,这些数据也需要进行压缩。

  7. Performance – quick access to raw data and tools for stream analytics, including access from the user scripts (tens of thousands of reading requests per second!)

    性能 –快速访问原始数据和用于流分析的工具,包括从用户脚本访问(每秒数以万计的读取请求!)

  8. SQL – we want to let our clients run analytics queries in a familiar language

    SQL –我们希望让客户以熟悉的语言运行分析查询

用CAP定理检查我们的要求 (Checking our requirements with the CAP theorem)

Before we started examining all the available databases to see if they meet our requirements, we decided to check whether our requirements are adequate by using a well-known tool – the CAP theorem.

在我们开始检查所有可用数据库以查看它们是否满足我们的要求之前,我们决定使用一个众所周知的工具CAP定理来检查我们的要求是否足够。

The CAP theorem states that a distributed system cannot simultaneously have more than two of the following qualities:

CAP定理指出,分布式系统不能同时具有以下两个以上的性质:

  1. Consistency – data in all of the nodes have no contradictions at any point in time

    一致性 –所有节点上的数据在任何时间点都没有矛盾

  2. Availability – any request to a distributed system results in a correct response, however, without a guarantee that the responses of all system nodes match

    可用性 –对分布式系统的任何请求都将导致正确的响应,但是不能保证所有系统节点的响应都匹配

  3. Partition tolerance – even when the nodes are not connected, they continue working independently

    分区容限 –即使未连接节点,它们也可以继续独立工作

For instance, the Master-Slave PostgreSQL cluster with synchronous replication is a classic example of a CA system and Cassandra is a classic AP system.

例如,具有同步复制功能的Master-Slave PostgreSQL集群是CA系统的经典示例,而Cassandra是经典的AP系统。

Let’s get back to our requirements and classify them with the CAP theorem:

让我们回到我们的要求,并用CAP定理对它们进行分类:

  1. ACID transactions and strict (or at least not eventual) consistency are C.

    ACID交易和严格的(或至少不是最终的)一致性是C。
  2. Horizontal scaling for writing and reading + accessibility is A (multi-master).

    用于读写的水平缩放+可访问性为A(多主机)。
  3. Fault tolerance is P: if one data center shuts down, the system should stand.

    容错能力为P:如果一个数据中心关闭,则系统应处于待机状态。
Conclusion: the universal database we require has to offer all of the CAP theorem qualities, which means that none of the existing databases can fulfill all of our needs. 结论:我们需要的通用数据库必须提供所有CAP定理品质,这意味着现有数据库都无法满足我们的所有需求。

根据IoT平台使用的数据选择数据库 (Choosing the database based on the data the IoT platform works with)

Being unable to pick a universal database, we decided to split the data into two types and choose a database for each type the database will work with.

由于无法选择通用数据库,我们决定将数据分为两种类型,并为每种数据库将使用的类型选择一个数据库。

With a first approximation we subdivided the data into two types:

通过第一近似,我们将数据细分为两种类型:

  1. Metadata – the world model, the devices, the rules, the settings. Practically all the data except the data from the end devices

    元数据 –世界模型,设备,规则,设置。 几乎所有数据,除了来自终端设备的数据

  2. Raw data from the devices – sensor readings, telemetry, and technical information from the devices. These are time series of messages containing a value and a timestamp

    来自设备的原始数据 –传感器读数,遥测和来自设备的技术信息。 这些是包含值和时间戳记的消息的时间序列

为元数据选择数据库 (Choosing the database for the metadata)

Our requirements 我们的要求

Metadata is inherently relational. It is typical for this data to have a small amount and be rarely modified, but the metadata is quite important. We can’t lose it, so consistency is important – at least in terms of asynchronous replication, as well as ACID transactions and horizontal read scaling.

元数据本质上是关系的。 通常,此数据量很少且很少修改,但是元数据非常重要。 我们不会丢失它,因此一致性非常重要-至少在异步复制,ACID事务和水平读取缩放方面。

This data is comparatively little in amount and it will be changed rather infrequently, so you can ditch horizontal read scaling, as well as the possible inaccessibility of the read database in case of failure. That is why, in the language of the CAP theorem, we need a CA system.

此数据量相对较小,并且很少更改,因此您可以放弃水平读取缩放,以及在发生故障的情况下可能无法访问读取数据库。 这就是为什么用CAP定理的语言,我们需要一个CA系统。

What usually works. If we put a question like this, we would do with any classic relational database with asynchronous replication cluster support, e.g. PostgreSQL or MySQL. 通常会起作用。 如果我们提出这样的问题,那么我们将使用支持异步复制群集的任何经典关系数据库,例如PostgreSQL或MySQL。 Our platform aspects. We also needed support for trees with specific requirements. The prototype had a feature taken from the systems of the RTDB class (real-time databases) – modeling the world using a tag tree. They enable us to combine all the client devices in one tree structure, which makes managing and displaying a large number of devices much easier. 我们平台的方面。 我们还需要对具有特定要求的树木的支持。 该原型具有从RTDB类系统(实时数据库)中提取的功能-使用标签树对世界进行建模。 它们使我们能够将所有客户端设备组合成一个树形结构,这使得管理和显示大量设备变得更加容易。
This is how the device tree looks like 这是设备树的样子

This tree enables linking the end devices with the environment. For example, we can put devices physically located in the same room in one subtree, which facilitates the work with them in the future. This function is very convenient, besides, we wanted to work with RTDBs in the future, and this functionality is basically the industry standard there.

通过该树,可以将终端设备与环境链接。 例如,我们可以将物理上位于同一房间中的设备放在一棵子树中,这便于将来与它们一起工作。 此功能非常方便,此外,我们希望将来与RTDB一起使用,并且该功能基本上是该行业的行业标准。

To have a full implementation of the tag trees, a potential database must meet the following requirements:

要完全实现标签树,潜在的数据库必须满足以下要求:

  1. Support for trees with arbitrary width and depth.

    支持任意宽度和深度的树木。
  2. Modification of tree elements in ACID transactions.

    修改ACID事务中的树元素。
  3. High performance when traversing a tree.

    遍历树时的高性能。

Classic relational databases can handle small trees quite well, but they don’t do as well with arbitrary trees.

经典的关系数据库可以很好地处理小树,但对于任意树却不能做到。

Possible solution. Using two databases: a graph one for the tree and the relational one for all the other metadata. 可能的解决方案。 使用两个数据库:一个图用于树,关系图用于所有其他元数据。

This approach has major disadvantages:

这种方法的主要缺点是:

  1. To ensure consistency between two databases, you need to add an external transaction coordinator.

    为了确保两个数据库之间的一致性,您需要添加一个外部事务协调器。
  2. This design is difficult to maintain and not so reliable.

    这种设计很难维护,而且不太可靠。
  3. As a result, we get two databases instead of one, while the graph database is only required for supporting limited functionality.

    结果,我们得到两个数据库,而不是一个,而图形数据库仅是为了支持有限的功能而需要的。
A possible, but not a perfect solution with two databases 两个数据库的可能但不是完美的解决方案 Our solution for storing metadata. We thought a little longer and remembered that this functionality was initially implemented in a Tarantool-based prototype and it turned out very well. 我们用于存储元数据的解决方案。 我们考虑了一会儿,并记住该功能最初是在基于Tarantool的原型中实现的,结果非常好。

Before we continue, I would like to give an unorthodox definition of Tarantool: Tarantool is not a database, but a set of primitives for building a database for your specific case.

在继续之前,我想给出一个非常规的Tarantool定义: Tarantool不是数据库,而是用于为您的特定情况构建数据库的一组原语。

Available primitives out of the box:

开箱即用的可用原语:

  • Spaces – an equivalent of tables for storing data in the databases.

    空间–等同于用于在数据库中存储数据的表。
  • Full-fledged ACID transactions.

    全面的ACID交易。
  • Asynchronous replication using WAL logs.

    使用WAL日志的异步复制。
  • A sharding tool that supports automatic resharding.

    支持自动重新分片的分片工具。
  • Ultrafast LuaJIT for stored procedures.

    用于存储过程的Ultrafast LuaJIT。
  • Large standard library.

    大型标准库。
  • LuaRocks package manager with even more packages.

    LuaRocks软件包管理器,提供更多软件包。

Our CA solution was a relational + graph Tarantool-based database. We assembled perfect metadata storage with Tarantool primitives:

我们的CA解决方案是基于关系+图形Tarantool的数据库。 我们使用Tarantool原语组装了完美的元数据存储:

  • Spaces for storage.

    存放空间。
  • ACID transactions – already in place.

    ACID交易–已经到位。
  • Asynchronous replication – already in place.

    异步复制-已经到位。
  • Relations – we built them upon stored procedures.

    关系–我们基于存储过程建立它们。
  • Trees – built upon stored procedures too.

    树–也基于存储过程。

Our cluster installation is classic for systems like these – one Master for writing and several Slaves with asynchronous replications for reading scaling.

对于这样的系统,我们的集群安装是经典的-一个主服务器用于写入,而多个从服务器具有异步复制用于读取扩展。

As a result, we have a fast scalable hybrid of relational and graph databases.

结果,我们有了关系数据库和图形数据库的快速可扩展混合。

One Tarantool instance is able to process thousands of reading requests, including those with active tree traversals.

一个Tarantool实例能够处理成千上万的读取请求,包括那些具有活动树遍历的读取请求。

选择用于存储设备数据的数据库 (Choosing the database for storing the data from the devices)

Our requirements 我们的要求

This type of data is characterized by frequent writing and a large amount of data: millions of devices, several years of storage, petabytes of both incoming messages, and stored data. Its high availability is very important since the sensor readings are important for the user-defined rules and our internal services.

这种类型的数据的特征在于频繁写入和大量数据:数百万个设备,数年的存储,传入消息的PB级和已存储的数据。 它的高可用性非常重要,因为传感器读数对于用户定义的规则和我们的内部服务非常重要。

It is important that the database offers horizontal scaling for reading and writing, availability, and fault tolerance, as well as ready-made analytical tools for working with this data array, preferably SQL-based. We can sacrifice consistency and ACID transactions, so in terms of the CAP theorem, we need an AP system.

重要的是,数据库必须提供用于读取和写入,可用性和容错能力的水平缩放,以及用于处理此数据数组(最好是基于SQL)的现成的分析工具。 我们可以牺牲一致性和ACID事务,因此就CAP定理而言,我们需要一个AP系统。

Additional requirements. We had a few additional requirements for the solution that would store the gigantic amounts of data: 其他要求。 对于存储大量数据的解决方案,我们还有一些其他要求:
  1. Time Series – sensor data that we wanted to store in a specialized base.

    时间序列–我们想要存储在专门基础中的传感器数据。
  2. Open-source – the advantages of open source code are self-explanatory.

    开源–开源代码的优势不言而喻。
  3. Free cluster – a common problem among modern databases.

    自由集群–现代数据库中的常见问题。
  4. Good compression – given the amount of data and its homogeneity, we wanted to compress the stored data efficiently.

    良好的压缩-考虑到数据量及其同质性,我们想有效地压缩存储的数据。
  5. Successful maintenance – in order to minimize risks, we wanted to start with a database that someone was already actively exploiting at loads similar to ours.

    成功的维护–为了最大程度地降低风险,我们希望从一个数据库开始,有人已经在以与我们类似的负载来积极利用数据库。
Our solution. The only database suiting our requirements was ClickHouse – a columnar time-series database with replication, multi-master, sharding, SQL support, and a free cluster. Moreover, Mail.ru has many years of successful experience in operating one of the largest ClickHouse clusters. 我们的解决方案。 满足我们要求的唯一数据库是ClickHouse,它是具有复制,多主数据库,分片,SQL支持和免费集群的柱状时序数据库。 此外,Mail.ru在运营最大的ClickHouse集群之一方面拥有多年的成功经验。

But ClickHouse, however good it may be, didn’t work for us.

但是ClickHouse不管它有多好,对我们都不起作用。

设备数据数据库的问题及其解决方案 (Problems with the database for device data and their solution)

Problem with writing performance. We immediately had a problem with the large data stream writing performance. It needs to be delivered to the analytical database as soon as possible so that the rules analyzing the flow of events in real-time can look at the history of a particular device and decide whether to raise an alert or not. 书写性能问题。 我们立即就大数据流写入性能遇到了问题。 它需要尽快传送到分析数据库,以便实时分析事件流的规则可以查看特定设备的历史记录,并决定是否发出警报。 Solution. ClickHouse is not good with multiple single inserts, but works well with large packets of data, easily coping with writing millions of lines in batches. We decided to buffer the incoming data stream, and then paste this data in batches. 解。 ClickHouse不能用于多个单个插入,但是可以处理大数据包,轻松应对成批写入数百万行的问题。 我们决定缓冲传入的数据流,然后分批粘贴此数据。
This is how we dealt with poor writing performance 这就是我们应对不良写作表现的方式

The writing problems were solved, but it cost us several seconds of lag between the data coming into the system and its appearance in our database.

解决了书写问题,但是这使我们花了几秒钟的时间才能将数据输入到系统中,以及它出现在数据库中的时间。

This is critical for various algorithms that react to the sensor readings in real-time.

这对于实时响应传感器读数的各种算法至关重要。

Problem with reading performance. Stream analytics for real-time data processing constantly needs information from the database – tens of thousands of small queries. On average, one ClickHouse node handles about a hundred analytical queries at any time. It was created to infrequently process heavy analytical queries with large amounts of data. Of course, this is not suitable for calculating trends in the data stream from hundreds of thousands of sensors. 阅读性能有问题。 用于实时数据处理的流分析不断需要来自数据库的信息-成千上万的小查询。 平均而言,一个ClickHouse节点可以随时处理大约一百个分析查询。 它的创建是为了不经常处理具有大量数据的繁重分析查询。 当然,这不适用于计算来自成千上万个传感器的数据流的趋势。
ClickHouse doesn’t handle a large number of queries well ClickHouse不能很好地处理大量查询 Solution. We decided to place a cache in front of Clickhouse. The cache was meant to store the hot data that has been requested in the last 24 hours most often. 解。 我们决定在Clickhouse前面放置一个缓存。 缓存旨在存储最近24小时内最频繁请求的热数据。

24 hours of data is not a year but still quite a lot – so we need an AP system with horizontal scaling for reading and writing and a focus on performance while writing single events and numerous readings. We also need high availability, analytic tools for time series, persistence, and built-in TTL.

24小时的数据不是一年,而是相当长的时间–因此,我们需要一个具有水平缩放功能的AP系统来读写,并在编写单个事件和大量读数时注重性能。 我们还需要高可用性,用于时间序列,持久性和内置TTL的分析工具。

So, we needed a fast ClickHouse that could store everything in memory. Being unable to find any suitable solutions, we decided to build one based on the Tarantool primitives:

因此,我们需要一个可以将所有内容存储在内存中的快速ClickHouse。 由于找不到合适的解决方案,我们决定基于Tarantool原语构建一个解决方案:

  1. Persistence – check (WAL-logs + snapshots).

    持久性–检查(WAL日志+快照)。
  2. Performance – check; all the data is in the memory.

    性能–检查; 所有数据都在内存中。
  3. Scaling – check; replication + sharding.

    缩放-检查; 复制+分片。
  4. High availability – check.

    高可用性–检查。
  5. Analytics tools for time series (grouping, aggregation, etc.) – we built them upon stored procedures.

    时间序列(分组,汇总等)的分析工具–我们在存储过程中构建了它们。
  6. TTL – built upon stored procedures with one background fiber (coroutine).

    TTL –基于具有一个背景光纤(协程)的存储过程而构建。

The solution turned out to be powerful and easy to use. One instance handled 10,000 reading RPCs, including analytic ones.

事实证明该解决方案功能强大且易于使用。 一个实例处理了10,000个正在读取的RPC,包括解析的RPC。

Here is the architecture we came up with:

这是我们想出的架构:

Final architecture: ClickHouse as the analytic database and the Tarantool cache storing 24 hours of data.

最终架构:ClickHouse作为分析数据库,而Tarantool缓存则存储24小时数据。

一种新的数据类型-状态及其存储 (A new type of data – the state and it’s storing)

We found a specific database for each type of data, but as the platform developed, another one appeared – the status. The status consists of current statuses of sensors and devices, as well as some global variables for stream analytics rules.

我们为每种数据类型找到了一个特定的数据库,但是随着平台的发展,出现了另一个数据库-状态。 状态包括传感器和设备的当前状态,以及一些用于流分析规则的全局变量。

Let’s say we have a lightbulb. The light may be either on or off, and we always need to have access to its current state, including one in the rules. Another example is a variable in stream rules – e.g., a counter of some sort.

假设我们有一个灯泡。 指示灯可以打开或关闭,我们始终需要访问其当前状态,包括规则中的一种。 另一个示例是流规则中的变量-例如某种计数器。

This type of data needs frequent writing and fast access but doesn’t take a lot of space.

这种类型的数据需要频繁写入和快速访问,但不会占用太多空间。

Metadata storage doesn’t suit this type of data well, because the status may change quite often and we only have one Master for writing. Durable and operating storage doesn’t work well too, because our status was last changed three years ago, and we need to have quick reading access.

元数据存储不太适合此类数据,因为状态可能会经常更改,而且我们只有一个Master可以写。 持久和可操作的存储也无法正常工作,因为我们的状态在三年前更改过,我们需要具有快速读取权限。

This means that the status database needs to have horizontal scaling for reading and writing, high availability, fault tolerance, and consistency on the values/documents level. We can sacrifice global consistency and ACID transactions.

这意味着状态数据库需要具有用于读写的水平缩放,高可用性,容错能力以及值/文档级别的一致性。 我们可以牺牲全球一致性和ACID交易。

Any Key-Value or a document database should work: Redis sharding cluster, MongoDB, or, once again, Tarantool.

任何键值或文档数据库都应该起作用:Redis分片集群,MongoDB或再次是Tarantool。

Tarantool advantages:

Tarantool的优势:

  1. It is the most popular way of using Tarantool.

    这是使用Tarantool的最流行的方式。
  2. Horizontal scaling – check; asynchronous replication + sharding.

    水平缩放–检查; 异步复制+分片。
  3. Consistency on the document level – check.

    文档级别的一致性–检查。

As a result, we have three Tarantools that are used differently: one for storing metadata, a cache for quick reading from the devices, and one for storing status data.

因此,我们有三种不同使用的Tarantools:一种用于存储元数据,一种用于从设备快速读取的缓存,以及一种用于存储状态数据的缓存。

如何为您的物联网平台选择数据库 (How to choose a database for your IoT platform)

  1. There is no such thing as a universal database.

    没有通用数据库。
  2. Each type of data should have its own database, the one most suitable.

    每种类型的数据都应该有自己的数据库,这是最合适的。
  3. There is a chance you may not find a fitting database in the market.

    您有可能在市场上找不到合适的数据库。
  4. Tarantool can work as a basis for a specialized database

    Tarantool可以作为专门数据库的基础

翻译自: https://habr.com/en/company/mailru/blog/514766/

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值