网址缩短服务的系统设计

最新推荐文章于 2022-03-22 16:21:16 发布

weixin_26728245

最新推荐文章于 2022-03-22 16:21:16 发布

阅读量934

点赞数

文章标签： java python

原文链接：https://towardsdatascience.com/system-design-of-url-shortening-service-b325b18c8f88

版权

System design is one of the most important and feared aspects of software engineering. This opinion comes from my own learning experience in an associate architecture course. When I started my associate architecture course, I had a hard time understanding the idea of designing a system.

系统设计是软件工程中最重要和最令人担忧的方面之一。这种观点来自我自己在助理架构课程中的学习经验。当我开始助理架构课程时，我很难理解设计系统的想法。

One of the main reasons was that the terms used in software architecture books are pretty hard to understand at first, and there is no clear step by step guidelines. Everybody seems to have a different approach.

主要原因之一是，一开始很难理解软件体系结构书籍中使用的术语，并且没有明确的分步指南。每个人似乎都有不同的方法。

So, I set out to design a system based on my experience of learning architecture courses. This is part of a series on system design for beginners (link is given below). For this one, let’s design the URL shortening service.

因此，我根据自己学习建筑课程的经验着手设计一个系统。这是针对初学者的系统设计系列文章的一部分(下面提供了链接)。为此，我们来设计URL缩短服务。

In Medium, we can see the URLs are pretty big, especially the friend links; while sharing an article, we tend to shorten the URL. Some of the known URL shortening services are TinyURL, bit.ly, goo.gl, rb.gy, etc. We are going to design such a URL shortening service.

在Medium中，我们可以看到URL很大，尤其是朋友链接。分享文章时，我们倾向于缩短URL。一些已知的URL缩短服务包括TinyURL，bit.ly，goo.gl，rb.gy等。我们将设计这样的URL缩短服务。

★系统定义： (★ Definition of the System:)

We need to clarify the goal of the system. System design is such a vast topic; if we don’t narrow it down to a specific purpose, then it will become complicated to design the system, especially for newbies. URL shortening service provides shorter aliases for long URLs. When users hit the shortened links, they will be redirected to the original URL.

我们需要澄清系统的目标。 系统设计是一个巨大的话题。 如果我们不将其范围缩小到特定目的，那么设计系统就会变得复杂，尤其是对于新手。 URL缩短服务为长URL提供较短的别名。 当用户点击缩短的链接时，他们将被重定向到原始URL。

#URLShorteningservice #tinyURLDesign #system design — Image by Author

★系统要求：(★ Requirements of the System:)

In this segment, we decide the features of the system. What are the requirements we need to focus on? We may divide the system requirements into two parts:

在此部分中，我们确定系统的功能。我们需要关注哪些要求？ 我们可以将系统要求分为两部分：

Functional requirement:
功能要求：

The user gives a URL as an input; our service should generate a shorter and unique alias of that URL. When users hit the shorter link, our system should redirect them to the original link. Links may expire after a duration. Users can specify the expiration time. We are not considering custom links by the user here.

用户提供URL作为输入；我们的服务应生成该URL的较短且唯一的别名。当用户点击较短的链接时，我们的系统应将其重定向到原始链接。链接可能会在一段时间后过期。用户可以指定到期时间。 我们此处不考虑用户的自定义链接。

This is a requirement that the system has to deliver. It is the main goal of the system.

这是系统必须交付的要求。这是系统的主要目标。

Non-Functional requirement:
非功能需求：

Now for the more critical requirements that need to be analyzed. If we don’t fulfill this requirement, it might be harmful to the business plan of the project. So, let’s define our NFRs:

现在针对需要分析的更关键的要求。如果我们不满足此要求，则可能对项目的业务计划有害。因此，让我们定义我们的NFR：

The system should be highly available. If the service is down, all the URL redirections will fail. URL redirection should happen in real-time. Nobody should be able to predict the shortened links.

该系统应具有很高的可用性。如果服务关闭，所有URL重定向将失败。 URL重定向应实时进行。没有人应该能够预测缩短的链接。

Performance, modifiability, availability, scalability, reliability, etc. are some important quality requirements in system design. These ‘ilities’ are what we need to analyze for a system and determine if our system is designed properly.

性能，可修改性，可用性，可伸缩性，可靠性等是系统设计中的一些重要质量要求。 这些“缺陷”是我们需要对系统进行分析并确定系统设计是否适当的条件。

In this system, availability is the main quality attribute. Security is another important attribute. Normally, availability and scalability are important features for system design. Performance is by default important, nobody wants to build a system with worse performance, right?!

在此系统中，可用性是主要的质量属性。安全是另一个重要属性。通常，可用性和可伸缩性是系统设计的重要功能。默认情况下，性能很重要，没有人愿意构建性能较差的系统，对吗？

★系统需要处理多少个请求？ (★ How much request does the system need to handle?)

Let’s assume, one user may request for a new URL and use it 100 times for redirection. So, the ratio between write and read would be 1:100. So the system is read-heavy.

假设一个用户可以请求一个新的URL，并将其使用100次重定向。因此，写入与读取之间的比例为1：100。因此，该系统非常繁琐。

How many URL requests do we need to handle in the service? Let’s say we may get 200 URL requests per second. So, for a month’s calculation, we can have 30 days * 24 hours * 3600 seconds*200 =~ 500 M requests.

我们需要在服务中处理多少个URL请求？假设我们每秒可能收到200个URL请求。因此，对于一个月的计算，我们可以有30天* 24小时* 3600秒* 200 =〜500 M请求。

So, there can be almost 500M new URL shortening requests per month. Then, the redirection request would be 500M*100 = 50 Billion.

因此，每月可能有将近500M新的URL缩短请求。那么，重定向请求将是500M * 100 = 500亿。

For year count you have to multiply this number by 12.

对于年份计数，您必须将此数字乘以12。

★我们需要多少存储空间？ (★ How much storage do we need?)

Let’s assume, the system stores all the URL shortening request and their shortened link for 5 years. As we expect to have 500M new URLs every month, the total number of objects we expect to store will be 500 M * (5 * 12) months = 30 B.

假设，系统将所有URL缩短请求及其缩短的链接存储了5年。由于我们期望每月有500M个新URL，因此我们希望存储的对象总数将为500 M *(5 * 12)个月= 30B。

Now let’s assume that each stored object will be approximately 100 bytes. We will need total storage of 30 billion * 100 bytes = 3 TB.

现在，假设每个存储的对象大约为100个字节。我们将需要300亿* 100字节= 3 TB的总存储空间。

If we want to cache some of the popular URLs that are frequently accessed and if we follow the 80–20 rule, meaning we keep a 20% request from the cache.

如果我们要缓存一些经常访问的流行URL，并且遵循80-20规则，则意味着我们保留了来自缓存的20％的请求。

Since we have 20K requests/second, we will be getting

由于我们每秒有2万个请求，因此我们将获得

20K * 60 seconds* 60 minutes * 24 hours = ~1.7 billion per day

20K * 60秒* 60分钟* 24小时=每天约17亿

If we plan to cache 20% of these requests, we will need

如果我们计划缓存这些请求的20％，则需要

0.2 * 1.7 billion * 100 bytes = ~34GB of memory.

0.2 * 17亿* 100字节=约34GB的内存

★数据流：(★ Data flow:)

For newbies to system design, please remember, “If you are confused about where to start for the system design, try to start with the data flow.”

对于系统设计的新手，请记住：“如果您对从哪里开始系统设计感到困惑，请尝试从数据流开始。”

Now, one of the main tasks of the server-side components is generating a unique key for an input URL. Here, our input data is only a URL. So, we need to store them as a string. The output is another shortened version of the URL. If somebody clicks on that shortened URL, it will redirect to the original URL. Now, each output URL needs to be unique.

现在，服务器端组件的主要任务之一是为输入URL生成唯一键。在这里，我们的输入数据只是一个URL。因此，我们需要将它们存储为字符串。输出是URL的另一个简化版本。如果有人单击该缩短的URL，它将重定向到原始URL。现在，每个输出URL必须是唯一的。

★为给定的URL生成一个简短的唯一密钥 (★ Generate a short — unique key for a given URL)

For example, we may take a random shortened URL “rb.gy/ln9zeb”. The last characters should form a unique key. So, our input is a long URL given by users.

例如，我们可以采用随机缩短的URL“ rb.gy/ln9zeb ”。 最后一个字符应构成唯一键。 因此，我们的输入是用户提供的长URL。

We need to compute a unique hash of the input URL. If we use base64 encoding, 6 characters long key will give us 64 ^(6)= ~68.7 billion possible strings, which should be enough for our system.

我们需要计算输入URL的唯一哈希。如果我们使用base64编码，则6个字符长的键将为我们提供64 ^(6)=〜687亿个可能的字符串，这对于我们的系统应该足够了。

Problem: If multiple users enter the same URL, the system should not provide the same shortened URL. What if some strings are duplicated, what would be the system’s behavior?

问题：如果多个用户输入相同的URL，则系统不应提供相同的缩短的URL。如果某些字符串重复了，系统的行为将如何？

Solution: We may append the input URL with an increasing sequence number to each request URL. It should make the URL unique. But, the overflow of sequence numbers might be a problem. We may append user-id to the input URL assuming user-id be unique.

解决方案：我们可以在输入的URL后面附加一个递增的序列号到每个请求URL。它应该使URL唯一。但是，序列号的溢出可能是一个问题。假设用户ID是唯一的，我们可以将用户ID附加到输入URL中。

★独特的密钥生成： (★ Unique Key Generation:)

In the system, user-id should be unique so that we can compute a unique hash. We can have a standalone Unique-key Generation Service(UGS) that generates random id beforehand and stores them in a database.

在系统中，用户ID应该是唯一的，以便我们可以计算唯一的哈希。我们可以有一个独立的唯一密钥生成服务(UGS)，该服务可以预先生成随机ID并将其存储在数据库中。

#system design of tinyURL or URLShortening service — Figure: UGS service for unique key generation (Image by Author)

Whenever we need a new key, we can take one of the already generated IDs. This approach can make things faster as while a new request comes, we don’t need to create an ID, ensure its uniqueness, etc. UGS will ensure all the IDs are unique, and they can be stored in a database so that the IDs don’t need to be generated every time.

每当我们需要新密钥时，我们都可以使用一个已经生成的ID。这种方法可以加快处理速度，因为在收到新请求时，我们无需创建ID，确保其唯一性等。UGS将确保所有ID都是唯一的，并且可以将其存储在数据库中，以便将这些ID存储在数据库中不需要每次都生成。

As we need one byte to store one character, we can store all these keys in:

因为我们需要一个字节来存储一个字符，所以我们可以将所有这些键存储在：

6 (characters) * 68.7B (unique keys) ~= 412 GB.

6个(字符)* 68.7B个(唯一键)〜= 412 GB。

★可用性和可靠性： (★Availability & Reliability:)

If we keep one copy of UGS, it’s a single point of failure. So, we need to make a replica of UGS. If the primary server dies, the secondary one can handle the requests of the users.

如果我们保留一份UGS副本，那将是单点故障。因此，我们需要复制UGS。如果主服务器死亡，则辅助服务器可以处理用户的请求。

Each UGS server can cache some keys from key-DB. It can speed things up. But, we have to be careful; if one server dies before consuming all the keys, we will lose those keys. But, we may assume, this is acceptable since we have almost 68B unique six-letter keys.

每个UGS服务器都可以缓存来自密钥数据库的一些密钥。 它可以加快速度。但是，我们必须小心；如果一台服务器在消耗所有密钥之前就死了，我们将丢失这些密钥。但是，我们可以假设这是可以接受的，因为我们有68B个唯一的六字母键。

For ensuring availability, we need to ensure to remove a single point of failure in the system. Replication for Data will remove a single point of failure and provide backup. We can keep multiple replications to ensure database server reliability. And also, for uninterrupted service, other servers also need copies.

为了确保可用性，我们需要确保消除系统中的单点故障。复制数据将消除单点故障并提供备份。我们可以保留多个复制以确保数据库服务器的可靠性。而且，为了不中断服务，其他服务器也需要副本。

★数据存储： (★DataStorage:)

In this system, we need to store billions of records. Each object we keep is possibly less than 1 KB. One URL data is not related to another. So, we can use a NoSQL database like Cassandra, DynamoDB, etc. A NoSQL choice would be easier to scale, which is one of our requirements.

在这个系统中，我们需要存储数十亿条记录。我们保留的每个对象可能小于1 KB。一个网址数据与另一网址无关。因此，我们可以使用NoSQL数据库，例如Cassandra，DynamoDB等。选择NoSQL会更容易扩展，这是我们的要求之一。

★可扩展性： (★ Scalability:)

For supporting billions of URLs, we need to partition our database to divide and store our data into different DB servers.

为了支持数十亿个URL，我们需要对数据库进行分区，以将数据划分并存储到不同的DB服务器中。

i) We can partition the database based on the first letter of the hash key. We can put keys starting with ‘A’ in one server, ‘B’ in another server. This is called Range Based Partitioning.

i)我们可以根据哈希键的第一个字母对数据库进行分区。我们可以将以“ A”开头的密钥放在一台服务器中，将“ B”开头的密钥放在另一台服务器中。这称为基于范围的分区。

The problem with this approach is that it can lead to unbalanced partitioning. For example, there are very few words starting with ‘Z.’ On the other hand; we may have too many URLs that begin with the letter ‘E.’

这种方法的问题在于，它可能导致分区不平衡。例如，很少有以“ Z”开头的单词。另一方面; 我们可能有太多以字母“ E”开头的网址。

We may combine less frequently occurring letters into one database partition.

我们可以将出现频率较低的字母组合到一个数据库分区中。

ii) We can also partition based on the hash of the objects we are storing. We may take the hash of the key to determine the partition in which we can store the data object. The hash function will generate a server number, and we will store the key in that server. This process can make the distribution more random. This is Hash-Based Partitioning.

ii)我们还可以根据所存储对象的哈希值进行分区。我们可以使用键的哈希值来确定可以在其中存储数据对象的分区。哈希函数将生成一个服务器号，我们将密钥存储在该服务器中。此过程可以使分布更加随机。这是基于哈希的分区。

If this approach still leads to overloaded partitions, we need to use Consistent Hashing.

如果这种方法仍然导致分区过载，则需要使用Consistent Hashing 。

★缓存： (★ Cache:)

We can cache URLs that are frequently accessed by the users. The UGS servers, before making a query to the database, may check if the cache has the desired URL. Then it does not need to make the query again.

我们可以缓存用户经常访问的URL。在对数据库进行查询之前，UGS服务器可以检查缓存是否具有所需的URL。然后，无需再次进行查询。

What will happen when the cache is full? We may replace an older not used link with a newer or popular URL. We may choose the Least Recently Used (LRU) cache eviction policy for our system. In this policy, we remove the least recently used URL first.

缓存已满时会发生什么？我们可能会用较新的或流行的URL替换未使用的较旧链接。我们可以为我们的系统选择最近最少使用(LRU)缓存逐出策略。在此政策中，我们将首先删除最近最少使用的URL。

★负载均衡器： (★ Load balancer:)

We can add a load balancing layer at different places in our system, in front of the URL shortening server, database, and cache servers.

我们可以在系统中不同位置的URL缩短服务器，数据库和缓存服务器之前添加负载平衡层。

We may use a simple Round Robin approach for request distribution. In this approach, LB distributes incoming requests equally among backend servers. This approach of LB is simple to implement. If a server is dead, LB will stop sending any traffic to it.

我们可以使用简单的Round Robin方法进行请求分配。在这种方法中，LB在后端服务器之间平均分配传入请求。 LB的这种方法易于实现。如果服务器已死，则LB将停止向其发送任何流量。

Problem: If a server is overloaded, the LB will not stop sending a new request to that server in this approach. We might need an intelligent LB later.

问题：如果服务器过载，则LB不会以这种方式停止向该服务器发送新请求。稍后我们可能需要智能LB。

★持续时间后链接过期： (★ Link expiration after a duration:)

If the expiration time is reached for a URL, what would happen to the link?

如果到达URL的到期时间，链接将发生什么？

We can search in our datastores and remove them. The problem here is that if we chose to search for expired links to remove them from our data store, it would put a lot of pressure on our database.

我们可以在我们的数据存储中搜索并删除它们。这里的问题是，如果我们选择搜索过期的链接以将其从数据存储中删除，则会给数据库造成很大压力。

We can do it another way. We can slowly remove expired links periodically. Even if some dead links live longer, it should never be returned to users.

我们可以用另一种方式来做。我们可以定期缓慢删除过期的链接。即使某些无效链接的寿命更长，也绝不应将其退还给用户。

If a user tries to access an expired link, we can remove the link and return an error to the user. A periodical clean up process can run to remove expired links from our database. As storage is getting cheaper, some links might stay there even if we miss while clean up.

如果用户尝试访问过期的链接，我们可以删除该链接并将错误返回给该用户。可以运行定期清理过程，以从我们的数据库中删除过期的链接。随着存储设备的价格越来越便宜，即使我们在清理过程中错过了服务，某些链接也可能会保留在那里。

After removing the link, we can put it back in our database for reuse.

删除链接后，我们可以将其放回数据库中以供重用。

★安全性： (★ Security:)

We can store the access type (public/private) with each URL in the database. If a user tries to access a URL, which he does not have permission, the system can send an error (HTTP 401) back.

我们可以将访问类型(公共/私有)与每个URL一起存储在数据库中。如果用户尝试访问其没有权限的URL，则系统可以向后发送错误(HTTP 401)。

Final system design of TinyURL #URL Shortening service — Figure: Final design of URL Shortening Service (Image by Author)

结论： (Conclusion:)

In this system, we did not consider the UI part. And as this is a web service, so no client part is also discussed either. The unique key generation is an important part of this system. So, we added an extra service to create and store unique keys for URLs. For ensuring the availability of services, we used replication of servers so that if one goes down, others can still give service. Databases are also replicated to ensure data reliability. The cache server is used to store some popular queries to speed up the latency. And load balancer is added to distribute incoming requests equally among backend servers.

在此系统中，我们未考虑UI部分。由于这是一个Web服务，因此也不会讨论任何客户端部分。唯一的密钥生成是该系统的重要组成部分。因此，我们添加了一项额外的服务来创建和存储URL的唯一键。为了确保服务的可用性，我们使用服务器的复制，这样一台服务器出现故障时，其他服务器仍然可以提供服务。数据库也被复制以确保数据可靠性。缓存服务器用于存储一些流行的查询以加快延迟。并且添加了负载平衡器，以在后端服务器之间平均分配传入请求。

Source: Grokking the System Design Interview Course.

来源：浏览系统设计面试课程。

Thank you for reading the article. Have a good day 😃

感谢您阅读这篇文章。 祝你有美好的一天

This article is part of a series of system design for beginners. Here is the link.

本文是面向初学者的一系列系统设计的一部分。这是链接。

翻译自: https://towardsdatascience.com/system-design-of-url-shortening-service-b325b18c8f88

weixin_26728245

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
网址缩短服务的系统设计

System design is one of the most important and feared aspects of software engineering. This opinion comes from my own learning experience in an associate architecture course. When I started my associa...
复制链接

扫一扫