qt自动建立文件_学习建立自动完成系统

最新推荐文章于 2022-11-20 20:20:18 发布

weixin_26711867

最新推荐文章于 2022-11-20 20:20:18 发布

阅读量228

点赞数

文章标签： python java linux 人工智能 git

原文链接：https://medium.com/@iftimiealexandru/learning-to-build-an-autocomplete-system-2c2e9f423537

版权

qt自动建立文件

This is part of the series Learning System Design

这是“学习系统设计”系列的一部分

Credits for original implementation go to Pedro Lopes

最初实施的功劳归Pedro Lopes所有

什么是自动完成系统？(What is an Autocomplete System?)

It is a feature that phones, or browsers have. When an user starts typing a sentence, he is given a short list of recommended phrases. While this feature looks simple or not interesting, to make it scalable to millions of users with minimal latency is a real challenge.

这是电话或浏览器具有的功能。当用户开始输入句子时，会向他提供推荐短语的简短列表。尽管此功能看起来简单或不有趣，但要使其以最小的延迟扩展到数百万用户是一个真正的挑战。

After looking over the source code of the project, I told myself that it is a lot to digest. I just wasn’t able to follow all the connections. Some of the links from the Gateway to Collector Backend and to Assembler Frontend I was capable of following, but I was lost in understanding how did the phrases ended into the Trie Builder or how some shell scripts were interconnected.

在查看了项目的源代码之后，我告诉自己要消化的东西很多。我只是无法追踪所有联系。我能够跟踪从网关到收集器后端和汇编器前端的某些链接，但是我对理解这些短语如何以Trie Builder结尾或如何将某些Shell脚本互连感到迷惑。

However, I was still intrigued about how somebody can code and glue together such a system. Without planning too much I decided to do the following steps.

但是，我仍然对有人如何编码和将这样的系统粘合在一起感兴趣。在没有计划太多的情况下，我决定执行以下步骤。

One important thing to mention is that I didn’t wanted to reproduce exactly the original implementation, nor to make it better. As a consequence, some of choices in the steps below may look superficial, but this is only because I wanted to learn how to glue every component.

值得一提的是，我既不想完全复制原始的实现，也不想使其变得更好。因此，以下步骤中的某些选择可能看起来很肤浅，但这仅仅是因为我想学习如何粘合每个组件。

步骤1.删除所有内容并从头开始。 (Step 1. Remove everything and start from scratch.)

I decided to start simple. What can be removed from the original diagram?

我决定从简单开始。可以从原始图中删除什么？

Well, almost everything can be removed. So this is the result and the code is on GitHub.

好吧，几乎所有东西都可以删除。所以这是结果，代码在GitHub上。

The system is now a simple monolith. It is inefficient and unreliable.

该系统现在是一个简单的整体。它效率低下且不可靠。

Every time a top-phrases is requested, the app reads the trie from disk and computes the results for a prefix.

每次请求顶级短语时，该应用程序都会从磁盘读取特里并为前缀计算结果。

Every time a search is requested, the app constructs a new trie and saves it to disk.

每次请求搜索时，应用程序都会构造一个新的Trie并将其保存到磁盘。

步骤2.将整体破碎成小块 (Step 2. Break the monolith into smaller pieces)

What can be done better? Instead of having a monolith, the system can be composed of 3 services as in the next diagram. Here is the code for this version.

有什么可以做得更好的？如下图所示，该系统可以由3个服务组成，而不是具有一个整体。这是此版本的代码。

Nothing better is done in terms of efficiency. Data is still saved on the local file system. Each service shares a local volume. With every /search request, the app still constructs a new trie and with every /top-phrases request, the app still reads the trie from disk and computes the result for a prefix.

在效率方面，没有比这更好的了。数据仍保存在本地文件系统上。每个服务共享一个本地卷。对于每个/ search请求，该应用程序仍将构造一个新的Trie，并且对于每个/ top-phrases请求，该应用程序仍将从磁盘中读取该Trie并计算前缀的结果。

The services are now grouped using Docker, which will make it easier to add new components.

现在，使用Docker对服务进行了分组，这将使添加新组件变得更加容易。

步骤3.添加Trie Builder服务 (Step 3. Add a Trie Builder service)

In the versions above, one very expensive and redundant operation is in Assembler Collector when it constructs a new trie with every /search request. Instead of doing this, a Trie Builder component can be introduced that can build the trie at fixed intervals.

在上述版本中，汇编程序收集器在每个/ search请求中构造新的trie时，将在Assembler Collector中执行一项非常昂贵且冗余的操作。代替这样做，可以引入Trie Builder组件，该组件可以以固定的时间间隔构建Trie。

The component can be a service that can be called or a simple script that verifies the file system for new phrases at regular intervals. I decided to use the make it as a service, because it follows the nice story towards building the original implementation.

该组件可以是可以调用的服务，也可以是简单的脚本，该脚本定期检查文件系统中是否有新短语。我决定将make it用作服务，因为它遵循了建立原始实现的美好故事。

With this architecture the problem of building the trie every time a new phrase is submitted is solved.

通过这种体系结构，解决了每次提交新短语时都构造特里树的问题。

The Collector Backend is still responsible for deciding whether a new trie should be constructed. It does that by listing the file system and verifying the current phrase’s timestamp and the timestamp of the last file. Each file contains the phrases for a 30 minute (or second) intervals. When the Collector Backend detects the current phrase belongs to a new sliding window, it sends a signal to the Trie Builder to build the trie given the available files.

收集器后端仍负责决定是否应构造新的Trie。它通过列出文件系统并验证当前短语的时间戳和最后一个文件的时间戳来实现。每个文件包含的短语间隔为30分钟(或第二秒)。当收集器后端检测到当前短语属于一个新的滑动窗口时，它将向Trie Builder发送信号以在给定可用文件的情况下构建Trie。

One major problem that remains in this implementation is that the the Distributor Backend still loads the trie with every request.

此实现中仍然存在的一个主要问题是，分发服务器后端仍会在每个请求中都加载该Trie。

Here is the code for the new version.

这是新版本的代码。

步骤4.分发服务器后端的快速解决方案 (Step 4. Quick solution for Distributor Backend)

So far, the system was saving each phrase into timestamp named files. Every new trie was saved by overwriting the previous one. Instead of overwriting the previous trie file, every new trie contains the name of the last file, just for the purpose of distinguishing them.

到目前为止，系统将每个短语保存到带有时间戳的文件中。每个新的特例都通过覆盖前一个特例而得以保存。每个新的Trie都包含最后一个文件的名称，而不是覆盖先前的Trie文件，仅是为了区分它们。

Now, the Distributor Backend can list the available trie files and load the trie from disk only if a new trie is available. Here is the code for this version.

现在，仅当有新的Trie可用时，分发服务器后端才能列出可用的Trie文件并从磁盘加载Trie。这是此版本的代码。

步骤5.在Trie Builder和Distributor之间添加信令 (Step 5. Add signaling between Trie Builder and Distributor)

Instead of listing the available trie files, the Distributor Backend can reload the trie file given a signal from the Trie Builder once it has finished and saved the new trie. Here is the code.

无需列出可用的trie文件，分配器后端可以在完成并保存新的trie时，根据Trie Builder的信号重新加载trie文件。这是代码。

This is fine, but one problem that can arise from this way of communicating is that resources and events are not separated. This can lead to mixing responsibilities and can slow down development.

很好，但是这种通信方式可能引起的一个问题是资源和事件没有分开。这可能导致职责混合，并且可能减慢开发速度。

步骤6.添加Zookeeper (Step 6. Add Zookeeper)

To solve the communication problem described above, the Zookeeper was introduced into the system. Now that Zookeeper is introduced, Collector Backend no longer makes any call to the Trie Builder, it simply dumps the phrases into files to the file system.

为了解决上述通信问题，将Zookeeper引入了系统。现在介绍了Zookeeper，Collector Backend不再对Trie Builder进行任何调用，它只是将短语转储到文件到文件系统中。

To trigger the Trie Builder, a new Tasks component has been added. It is implemented as a simple script that selects a set of phrase files from the disk and creates a so called “target” that the Trie Builder should build. The Tasks component notifies the Trie Builder via the Zookeeper.

为了触发Trie Builder，添加了一个新的Tasks组件。它是作为一个简单的脚本实现的，该脚本从磁盘上选择一组短语文件并创建一个Trie Builder应构建的所谓的“目标”。任务组件通过Zookeeper通知Trie Builder。

Once the Trie Builder finishes building, it notifies the Backend via the Zookeeper to load a particular trie file.

一旦Trie Builder完成构建，它将通过Zookeeper通知后端以加载特定的Trie文件。

To setup properly the Zookeeper the target “setup” in Makefile must be executed and the target “do_tasks” must be executed to trigger the Trie Builder.

为了正确设置Zookeeper，必须执行Makefile中的目标“设置”，并且必须执行目标“ do_tasks”来触发Trie Builder。

Code is here

代码在这里

步骤7.添加HDFS和Message Broker(Step 7. Add HDFS and Message Broker)

Until now, the data was saved and loaded from the local disk by using shared volumes between docker containers. A better way to do this is by introducing HDFS as storage. In the same time the message broker is introduced. The message broker received phrases from collector and sends them to Kafka Connect which in turn saves them to HDFS.

到目前为止，数据是通过使用docker容器之间的共享卷保存并从本地磁盘加载的。更好的方法是通过引入HDFS作为存储。同时引入了消息代理。消息代理从收集器接收短语并将其发送到Kafka Connect，后者又将其保存到HDFS。

After the HDFS was set up the following snippets represent the information flow from the moment it is received in Collector until it is dumped on HDFS

设置HDFS之后，以下片段代表从收集器中接收到将其转储到HDFS上的信息流。

Instead of interacting with the file system the do_tasks.sh now interacts with HDFS. After the do_tasks.sh has finished creating a target, the Trie Builder is triggered. It will read the files for the specified target from HDFS

现在，do_tasks.sh不再与文件系统交互，而是与HDFS交互。 do_tasks.sh完成创建目标后，将触发Trie Builder。它将从HDFS读取指定目标的文件

Code here

在这里编码

最后的想法(Final thoughts)

It took me about 3 days to understand all of these. I will probably stop here for now. The missing components from the diagram compared to the original implementation consist of Load Balancers, Caches, Partitioning and Map Reduce. To me, these were less important as I was more interested into the core architecture. Maybe I will allocate some time later to look over the Partitioning and Map Reduce parts.

我花了大约三天的时间才能理解所有这些。我现在可能会在这里停止。与原始实现相比，图中缺少的组件包括负载均衡器，缓存，分区和Map Reduce。对我来说，这些都不那么重要，因为我对核心体系结构更加感兴趣。也许我稍后会分配一些时间来查看“分区”和“ Map Reduce”部件。

Part two, where I add partitioning, load balancers and cache is now available.

第二部分(添加分区，负载均衡器和缓存)现已可用。