websockets_Websockets在数据工程中鲜为人知的模式

websockets

Learn how to use the API approach to enable full duplex data transfer between client and server asynchronously using WebSockets which is an upgrade from HTTP, with a working Python code.

了解如何使用API​​方法通过WebSockets(这是从HTTP进行的升级)和有效的Python代码在客户端和服务器之间异步实现全双工数据传输。

让我们赶上API (Let’s Catch Up on APIs)

The typical data integration pipeline is generally one way, i.e. a client such as a mobile application makes a call to the RESTful API running on a server somewhere. The API does all the work and sends the results back to the client. If the client needs the same data again, it has to call the API once again. If it needs the data continuously, e.g. to update its own dashboard, it has to call the API repeatedly to get the most recent data.

吨他典型的数据集成管道通常是单向的,即客户端,如移动应用程序,使对服务器的某处运行的RESTful API的调用。 API完成所有工作,然后将结果发送回客户端。 如果客户端再次需要相同的数据,则必须再次调用API。 如果它持续需要数据(例如更新其自己的仪表板),则必须反复调用API以获取最新数据。

Image for post

This model works for a variety of reasons. The requests and responses come over http (or https, if needed) calls, which is fairly standard. The payload can be variable to accommodate all types of data and the response can be predictable. If everything goes well, it will get a response code of 200. The request response pairs are also asynchronous, i.e. an API may be able to receive 1000 calls at any point, process them in parallel and send the responses in parallel as well.

此模型的运行有多种原因。 请求和响应来自http(或HTTP,如果需要)调用,这是相当标准的。 有效负载可以可变以容纳所有类型的数据,并且响应可以预测。 如果一切顺利,它将获得200的响应代码。请求响应对也是异步的,即,API可能在任何时候都可以接收1000个调用,可以并行处理它们,也可以并行发送响应。

HTTP的问题 (The Problem with HTTP)

There are two problems. First, the server or the API in this case is passive. It waits for the call from the client and only then sends the data. The communication is only in one direction at any point. The request comes from the client; the API processes it, responds and that’s the end of the communication. Notice how this is a purely one-direction-at-a-time movement of data; it’s not bi-directional, or chatty. In other words, it’s not full duplex and not in real time.

有两个问题。 首先,在这种情况下,服务器或API是被动的。 它等待来自客户端的呼叫,然后才发送数据。 在任何时候,通信仅在一个方向上进行。 该请求来自客户端; API对其进行处理,做出响应,这就是通信的结束。 请注意,这是一次纯粹的单向数据移动。 它不是双向的或健谈的。 换句话说,它不是全双工且不是实时的。

The second problem is that the API does not send anything else back to the client on its own. It’s a corollary of the first problem; since the communication is driven by the client, not the server, the server does not take advantage of the time it is sending the data to the client by sending something of its own. Hence the client continuously polls the API to get the most recent data. Not only it wastes bandwidth; it adds latency to the process as the data is only as good as the last time it was polled.

第二个问题是API不会自行将其他任何内容发送回客户端。 这是第一个问题的必然结果; 由于通信是由客户端而不是服务器驱动的,因此服务器无法通过发送自己的东西来利用它向客户端发送数据的时间。 因此,客户端不断轮询API以获取最新数据。 不仅浪费带宽; 它增加了处理的延迟,因为数据仅与上次轮询一样好。

For instance, consider a client application that measures user’s clicks on a webpage and sends the counts to an API that calculates the running total from all the clients and stores it in a database. But the client also needs to know the overall totals from all the clients to display on its webpage. In a normal API design it works like this:

例如,考虑一个客户端应用程序,该应用程序测量用户在网页上的点击并将计数发送到API,该API计算来自所有客户端的运行总计并将其存储在数据库中。 但是客户还需要知道所有客户的总数以显示在其网页上。 在正常的API设计中,它的工作方式如下:

Image for post

事件顺序: (Sequence of events:)

  1. Client calls the API with the parameter num_of_clicks = 6

    客户端使用参数num_of_clicks = 6调用API
  2. API gets it, updates its database and responds with a “Got it”. In API terms we call it a response code of 200. The running total of clicks from all clients was 10 earlier. With the new count of 6 coming from the client, the running total is now 10+ 6 = 16. Communication closed.

    API会获取它,更新它的数据库并以“知道它”来响应。 用API术语来说,我们将其称为响应代码200。来自所有客户端的点击总运行时间早于10。 有了来自客户端的6个新计数,现在的运行总数为10+ 6 =16。通信已关闭。
  3. But the client still needs to know the overall count to display on its page. So it sends another request to the API. This time it is a GET request.

    但是客户端仍然需要知道要在其页面上显示的总数。 因此,它将另一个请求发送到API。 这次是GET请求。
  4. API responds with 16. Communication closed.

    API响应16。通信已关闭。

There were two threads of communication, both of which were initiated by the client. Won’t it have been nice to have the server push the running total to the client at the first thread itself, since the communication was already open? For instance, in addition to the 200 code, it could have sent the running total, eliminating the second communication. Even better, that would have been a real time data update at the client rather than a separate request, like the picture below.

两个通信线程,这两个线程都是由客户端启动的。 由于通信已经打开,因此让服务器在第一个线程本身将运行总数推送到客户端会不会很好? 例如,除了200个代码外,它还可以发送运行总计,从而消除了第二次通信。 更好的是,这将是在客户端进行实时数据更新,而不是单独的请求,如下图所示。

Image for post

Consider a real-life example where multiple clients are reporting their counts and will benefit from getting the running total from the API when they report it. For instance, another client reports its clicks, 5, in its case:

考虑一个真实的示例,其中多个客户端报告其计数,并且在他们报告API时将从API获得运行总计中受益。 例如,在另一例中,另一位客户报告了5次点击:

Image for post

直接耦合系统的问题 (Problem with Directly Coupled Systems)

Can’t we just make the process bi-directional by making both the systems (the client and the server) directly coupled? Sure, we can; but we are hit with two problems:

我们不能仅通过使两个系统(客户端和服务器)直接耦合来使流程双向化吗? 我们当然可以; 但是我们遇到两个问题:

First, it will not be a RESTful API then. It will be just two programs communicating synchronously, which reduces scalability. Imagine in the previous example, when two clients make a synchronous call, one has to wait. Thousands of clients making the call synchronously will make it worse.

首先,它不会是RESTful API。 只是两个程序同步通信,这降低了可伸缩性。 想象在上一个示例中,当两个客户端进行同步调用时,一个必须等​​待。 成千上万的客户同步拨打电话会使情况变得更糟。

Second, it will not communicate over http, which is an accepted standard for data communication with firewalls configured for it, proxy servers understand it and clients (such as web browsers and SDKs) available for it.

其次,它不会通过http进行通信,而http是与它配置的防火墙进行数据通信的公认标准,代理服务器可以理解它,并且客户端可以使用它(例如Web浏览器和SDK)。

When you come to think of it, we really don’t need a sustained, dedicated or synchronous connection. We just want to make the data exchange bi-directional so that the server just gets a chance to push something to the client without the client explicitly asking for it. We love the API approach; but we need to do something to make the process more chatty.

当您想到它时,我们确实不需要持续的,专用的或同步的连接。 我们只想使数据交换成为双向,这样服务器就可以有机会将某些内容推送到客户端,而无需客户端明确要求。 我们喜欢API方法; 但是我们需要做一些事情以使过程更加轻松。

Let’s examine the problem statement a bit clearly:

让我们仔细检查问题陈述:

We want to let the server push some data to the client over http (or https) without the client explicitly requesting it.

我们希望让服务器通过http(或https)将一些数据推送到客户端,而无需客户端明确请求。

那么,我们该如何解决呢? (So, how do we solve it?)

There are multiple options. One option is to make a http long poll. We simply keep the connection between the client and API open for a long time, allowing the client to get the update in realtime. This may be desirable in some cases; but not all. And it is not scalable either. For the full duplex data transfer, a set of methods collectively called Comet emerged (https://en.wikipedia.org/wiki/Comet_(programming)). It lacked adoption; but it did usher in a thinking about the need of a standardized process to enable full duplex data transfer over the http protocol. These standardizations resulted in two patterns: websockets and Server Sent Events (SSE). WebSockets is a protocol, allowing full duplex data transfer. In this blog you will learn about websockets, leaving SSEs to a future blog.

有多种选择。 一种选择是进行http长轮询。 我们只需长时间保持客户端和API之间的连接打开状态,即可使客户端实时获取更新。 在某些情况下,这可能是理想的; 但不是所有的。 而且它也不是可伸缩的。 对于全双工数据传输,出现了一组统称为Comet的方法( https://en.wikipedia.org/wiki/Comet_(programming) )。 它没有被采纳; 但是它确实引发了对标准化过程的需求的思考,以实现通过http协议进行全双工数据传输。 这些标准化导致了两种模式:websocket和服务器发送事件(SSE)。 WebSockets是一种协议,允许全双工数据传输。 在此博客中,您将了解Websocket,将SSE留给以后的博客。

Do not confuse it with a seemingly similar but different concept called 2-Way API call, often called webhooks. We will explore that in a different blog.

不要将它与看似相似但又不同的概念称为2-Way API调用(通常称为webhooks)混淆 。 我们将在另一个博客中进行探讨。

让我们探索websockets (Let’s explore websockets)

I will leave the complete history and evolution of WebSockets to its Wikipedia article here (https://en.wikipedia.org/wiki/WebSocket). In short, WebSockets protocol allows the data from server back to the client over the TCP protocol. They are not the same as http protocol but compatible with it, hence the communication occurs over ports 80 and 443 for http and https respectively. This helps tremendously in firewall and proxy configurations which already account for these ports.

我将在这里将WebSockets的完整历史和演变留给其Wikipedia文章( https://en.wikipedia.org/wiki/WebSocket )。 简而言之,WebSockets协议允许数据通过TCP协议从服务器返回到客户端。 它们与http协议不同,但与http协议兼容,因此,通信分别通过端口80和443分别用于http和https。 这对已经占这些端口的防火墙和代理配置有很大帮助。

Wait! This is confusing. WebSockets is a different protocol, not http; but compatible with it? What does that actually mean, you may ask.

等待! 这很混乱。 WebSockets是一个不同的协议,而不是http; 但是兼容吗? 您可能会问,这实际上是什么意思。

免费升级 (A Complimentary Upgrade)

This is where a feature introduced in http helps. HTTP allows you to upgrade the requested connection to something else. There are only a handful of upgrades allowed in HTTP 1.1, which are h2c, HTTPS/1.3, IRC/6.9, RTA/x11, websocket. Upgrades allow the request to come as a normal http one but then upgraded to something else in those list of protocols. In this case websocket is an approved upgrade. During the handshake the protocol is upgraded.

这就是http中引入的功能的帮助。 HTTP允许您将请求的连接升级到其他连接。 HTTP 1.1仅允许少量升级,包括h2c,HTTPS / 1.3,IRC / 6.9,RTA / x11,websocket。 升级允许请求像普通的HTTP请求一样发送,然后升级到那些协议列表中的其他请求。 在这种情况下,websocket是批准的升级。 在握手期间,协议被升级。

A typical http request has the following structure of the URI (universal resource identifier)

典型的http请求具有以下URI(通用资源标识符)结构

http://host[:port]path[?query]

A WebSocket request looks exactly like that, with a different protocol identifier (“ws”)

WebSocket请求看起来完全像这样,但是协议标识符(“ ws”)不同

ws://host[:port]path[?query]

And just like https for http, there is a wss version for secure ws. The client that desires a WebSocket connection sends the following:

就像HTTP的https一样,有一个用于安全ws的wss版本。 需要WebSocket连接的客户端发送以下信息:

GET ws://www.proligence.com:5678/ HTTP/1.1Host: localhost:5678Connection: UpgradePragma: no-cacheCache-Control: no-cacheUpgrade: websocketSec-WebSocket-Version: 13Sec-WebSocket-Key: q4xkcO32u266gldTuKaSOw==```

Note the lines

注意行

Connection: UpgradeUpgrade: websocket

This is where the client asks for this connection request to be upgraded to a WebSocket connection. Assuming the server is WS capable and is willing, it responds with the following:

客户端在此请求将此连接请求升级到WebSocket连接。 假设服务器具有WS能力并且愿意,它将以以下方式响应:

HTTP/1.1 101 Switching ProtocolsUpgrade: websocketConnection: UpgradeSec-WebSocket-Accept: fA9dggdnMPU79lJgAE3W4TRnyDM=

The line HTTP/1.1 101 Switching Protocols is critical in telling the client that the protocol has been switched to WS from http. Now the client has a WebSocket connection upgraded from a http connection. Nothing like a free upgrade.

HTTP/1.1 101 Switching Protocols对于告诉客户端协议已从http切换到WS至关重要。 现在,客户端具有从http连接升级的WebSocket连接。 就像免费升级一样。

By the way, if you want to read up on the specification of the packet format of the WebSocket protocol, you are welcome to visit the official IETF page https://tools.ietf.org/html/rfc6455#section-5.1. But rest assured that you don’t have to learn that to leverage the powers of WebSockets, mostly.

顺便说一句,如果您想了解WebSocket协议的数据包格式的规范,欢迎您访问IETF官方页面https://tools.ietf.org/html/rfc6455#section-5.1 。 但是请放心,您不必学习它就可以充分利用WebSocket的功能。

秒参数 (The Sec- Parameters)

What about the Sec-WebSocket-Key and Sec-WebSocket-Accept values in the above responses, you may ask.

您可能会问上述响应中的Sec-WebSocket-KeySec-WebSocket-Accept值如何?

Remember, WebSocket establishes a 2 way duplex communication between the client and the server. Both repurpose http; but the protocol is not http. This opens up a security risk. What if one of them misinterprets the WebSocket data as a normal http request? They both have to know definitively that both of them throughout the process support WebSockets and the communication is WebSocket based; not http. So they have to establish some sort of validation to ensure the WS requests are not misinterpreted as http. This is why they need to validate the requests with a key, which is not required for http. The client sends a randomly generated key (called a “nonce”), in its request:

记住,WebSocket在客户端和服务器之间建立双向双向通信。 两者都改用了 http; 但协议不是http。 这带来了安全风险。 如果其中之一将WebSocket数据误解为正常的http请求怎么办? 他们俩都必须明确地知道,他们在整个过程中都支持WebSocket,并且通信是基于WebSocket的。 不是http。 因此,他们必须建立某种形式的验证,以确保WS请求不会被误解为http。 这就是为什么他们需要使用密钥验证请求的原因,而http则不需要。 客户端在其请求中发送一个随机生成的密钥(称为“ nonce”):

Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==

The server takes this value, concatenates a static value 258EAFA5-E914–47DA-95CA-C5AB0DC85B11 (defined in RFC 6455), takes the base-64 encoding of the resulting value, which is what it sends back in its own response:

服务器采用此值,并连接静态值258EAFA5-E914–47DA-95CA-C5AB0DC85B11(在RFC 6455中定义),采用结果值的base-64编码,这是它在自己的响应中发回的信息:

Sec-WebSocket-Accept: fA9dggdnMPU79lJgAE3W4TRnyDM=

The client checks the header and computes the same value and accepts the WebSockets frames only if the values match. This way the client and servers reassure themselves that the WS requests are not misinterpreted as http.

客户端检查标头并计算相同的值,并且仅在值匹配时才接受WebSockets框架。 这样,客户端和服务器便可以确保WS请求不会被误解为http。

行动守则 (A Code in Action)

Enough theory. If you are like me, you must be itching for action. How easily can you use WebSockets? Do you need to know all the frame layout, how the base-64 values are computed, etc.?

Ënough理论。 如果您像我一样,您一定会渴望采取行动。 您如何轻松使用WebSockets? 您是否需要了解所有框架布局,如何计算base-64值等?

Fortunately, no. Most modern languages have supported SDKs to jumpstart WebSockets. Let’s consider the example we described above, i.e. the need of the client to send its counts to an API and get the running count from the server in the same data transfer in a full duplex communication. We will implement it in Python.

幸运的是,没有。 大多数现代语言都支持SDK来启动WebSockets。 让我们考虑上面描述的示例,即客户端需要在全双工通信中的同一数据传输中将其计数发送到API并从服务器获取运行计数。 我们将在Python中实现它。

We will create two programs:

我们将创建两个程序:

  1. The server, which is an API that listens to the WebSocket requests from clients. When the connection is established, it receives the count data from the client, updates its running total from all the clients and sends the current running total to the client, all in the same thread.

    服务器 ,它是一个侦听来自客户端的WebSocket请求的API。 建立连接后,它将从客户端接收计数数据,从所有客户端更新其运行总计,并将当前运行总计发送到客户端,所有线程均在同一线程中。

  2. The client, which is just a simple microservice that sends the count data (which, to make it simple, we will let the user enter at the runtime as an input) and receives the total count from the server, displays it and severes the connection.

    客户端 ,这只是一个简单的微服务,用于发送计数数据(为简单起见,我们将让用户在运行时作为输入输入)并从服务器接收总计数,显示该计数并切断连接。

Here are the codes:

以下是代码:

Server.py

Server.py

import asyncio as aioimport websockets as wsssum = 0print(f”Waiting for clients to connect. Running total = {sum}”)async def add(wss, ev): global sum event = await wss.recv() print(f”Debug: Received {event}”) sum = sum + int(event)  await wss.send(str(sum)) print(f”Running total = {sum}”)ws_server = wss.serve(add, “localhost”, 8181)aio.get_event_loop().run_until_complete(ws_server)aio.get_event_loop().run_forever()

Let’s dissect the code.

让我们剖析代码。

Since this is an API call, we need the call to be Asynchronous (you will learn about it just in a bit), hence we need to import the module asyncio. We also need to import the module websockets that encapsulates all the needed foundation for using WebSockets. If you don’t have them, of course you need to install them using:

由于这是一个API调用,因此我们需要将该调用异步化(您将很快了解它),因此我们需要导入异步模块。 我们还需要导入模块websockets,其中封装了使用WebSockets所需的所有基础。 如果您没有它们,那么您当然需要使用以下方法安装它们:

pip install websockets
pip install asyncio

Next, we define a variable called “sum” and set it to 0, when the server starts. Our main function where we compute the running total is named “add”. It’s inside this function we will add the counts we get from clients; but we need this value to be visible to the entire program so we need to define this variable as a global variable. This is vital in python. Otherwise the variable will be local to the function and any reference to it from outside the function will be invalid.

接下来,我们定义一个名为“ sum”的变量,并在服务器启动时将其设置为0。 我们计算运行总额的主要函数称为“加”。 在此函数的内部,我们将添加从客户端获得的计数; 但是我们需要该值对整个程序可见,因此我们需要将此变量定义为全局变量。 这在python中至关重要。 否则,变量将是函数的局部变量,并且从函数外部对其进行的任何引用均将无效。

Next you will see probably not so common declarations

接下来,您可能会看到不太常见的声明

async def add(wss, ev)
await wss.send(str(sum))

These are necessary to define an asynchronous process in python. It’s impossible to do justice what these keywords do in this small blog; so I will leave it for another time.

这些是在python中定义异步过程所必需的。 在这个小型博客中,公正地说这些关键词的作用是不可能的。 所以我会再等一遍。

The rest are self explanatory. We define a webserver called ws_server on port 8181, that allows websockets protocol and when a request is received, it calls the add() function to update the sum variable and then it sends the value of the variable to the client over the same connection.

其余的不言自明。 我们在端口8181上定义了一个名为ws_server的Web服务器,该服务器允许websockets协议,并且在接收到请求时,它将调用add()函数更新sum变量,然后通过相同的连接将该变量的值发送给客户端。

Now let’s write a simple client program that sends its count to the server via an API call and accepts whatever the server sends over the same connection. We will make the count be an input by the user.

现在,让我们编写一个简单的客户端程序,该程序通过API调用将其计数发送到服务器,并接受服务器通过同一连接发送的所有内容。 我们将计数作为用户的输入。

Client.py

客户端

import asyncio as aioimport websockets as wsasync def send_event():  uri = “ws://localhost:8181”  async with ws.connect(uri) as wss:    event = input(“What is the number you want to send? “)    await wss.send(event)    recv_event = await wss.recv()    print(f”Running total so far (received): {recv_event}”)aio.get_event_loop().run_until_complete(send_event())

With these two in place, in one window, run python server.py from the command line. The program will display the following and wait for the clients to connect.

将这两个安装到位后,在一个窗口中,从命令行运行python server.py 。 该程序将显示以下内容,并等待客户端连接。

Server> python server.pyWaiting for clients to connect. Running total = 0

In a different window, runpython client.py from the command line.

在另一个窗口中,从命令行运行python client.py

Client> python client.pyWhat is the number you want to send? 1Running total so far (received): 1

On the server window you can see the messages:

在服务器窗口上,您可以看到以下消息:

Debug: Received 1Running total = 1

The client program had exited. From yet another window, run the same client.py and this time pass “3”:

客户端程序已退出。 从另一个窗口,运行相同的client.py,这次传递“ 3”:

Client> python client.pyWhat is the number you want to send? 3Running total so far (received): 4

On the server window you can see the following:

在服务器窗口上,您可以看到以下内容:

Debug: Received 3Running total = 4

The server is keeping the running total, which is not that interesting. But the key thing is that the server is pushing it to the client. The client is not pulling it from the server. And all this is happening over an API call; not a synchronous transfer of data between the two systems.

服务器保持运行总数,这并不是很有趣。 但是关键是服务器正在将其推送给客户端 。 客户端没有将其从服务器中拉出。 所有这些都是通过API调用进行的; 不能在两个系统之间同步传输数据。

This was merely an example to illustrate the concept of the WebSockets. This is not the restrictive as it may seem like. The server can send any data it wants; not just the running total of the counts shown in this example. In real life uses, it will likely send a JSON object that the client will parse and get the meaningful data from it. For instance, the client, running in a machine in a factory could be sending the data on power consumption recorded by its sensors and server could be keeping track of the maximum power consumption from all the clients. If the maximum consumption is reached anywhere, the server will push a warning, or even a mandate to shutdown some machine to reduce the overall power consumption to reduce the risk to the power grid. This warning is sent over the same communication, in real time from the server to the client, without the client explicitly asking for it.

这仅是说明WebSockets概念的示例。 这似乎不是限制性的。 服务器可以发送它想要的任何数据。 不只是此示例中显示的计数的总和。 在现实生活中,它可能会发送一个JSON对象,客户端将对其进行解析并从中获取有意义的数据。 例如,在工厂的机器上运行的客户端可能正在发送有关其传感器记录的功耗数据,服务器可能会跟踪所有客户端的最大功耗。 如果在任何地方都达到最大消耗,则服务器将发出警告,甚至强制关闭某些计算机,以减少总功耗,从而降低对电网的风险。 该警告是通过同一通信从服务器实时发送到客户端的,而无需客户端明确要求。

综上所述 (In Summary)

  1. WebSocket is a new protocol

    WebSocket是新协议
  2. It is compatible with HTTP

    与HTTP兼容
  3. The request comes as a WS protocol; but over the normal HTTP ports

    该请求以WS协议的形式出现; 但通过普通的HTTP端口
  4. The request includes a request to upgrade the connection

    该请求包括升级连接的请求
  5. If the connection is upgraded from HTTP to WS, the connection become full duplex

    如果连接从HTTP升级到WS,则连接变为全双工
  6. This allows full bi-directional transfer for data in a normal API request-response fashion

    这允许以常规API请求-响应方式进行数据的完整双向传输
  7. The server can push data to the client without the client explicitly asking for it over the same connection

    服务器可以将数据推送到客户端,而无需客户端通过相同的连接明确请求它

翻译自: https://medium.com/@arupnanda/websockets-lesser-known-pattern-in-data-engineering-200329e90331

websockets

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值