How I Made Porn 20x More Efficient with Python

最新推荐文章于 2021-04-14 22:51:48 发布

逆风飞扬

最新推荐文章于 2021-04-14 22:51:48 发布

阅读量1.5w

点赞数

分类专栏： LINUX系统开发

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/flyqwang/article/details/8913083

版权

LINUX系统开发专栏收录该内容

97 篇文章 0 订阅

订阅专栏

Intro

Porn is a big industry. There aren’t many sites on the Internet that can rival the traffic of its biggest players.

And juggling this immense traffic is tough. To make things even harder, much of the content served from porn sites is made up of low latency live streams rather than simple static video content. But for all of the challenges involved, rarely have I read about the developers who take them on. So I decided to write about my own experience on the job.

What’s the problem?

A few years ago, I was working for the 26th (at the time) most visited website in the world—not just the porn industry: the world.

At the time, the site served up porn streaming requests with the Real Time Messaging protocol (RTMP). More specifically, it used a Flash Media Server (FMS) solution, built by Adobe, to provide users with live streams. The basic process was as follows:

The user requests access to some live stream
The server replies with an RTMP session playing the desired footage

For a couple reasons, FMS wasn’t a good choice for us, starting with its costs, which included the purchasing of both:

Windows licenses for every machine on which we ran FMS.
~$4k FMS-specific licenses, of which we had to purchase several hundred (and more every day) due to our scale.

All of these fees began to rack up. And costs aside, FMS was a lacking product, especially in its functionality (more on this in a bit). So I decided to scrap FMS and write my own RTMP parser from scratch.

In the end, I managed to make our service roughly 20x more efficient.

Getting started

There were two core problems involved: firstly, RTMP and other Adobe protocols and formats were not open (i.e., publically available), which made them hard to work with. How can you reverse or parse files in a format about which you know nothing? Luckily, there were some reversing efforts available in the public sphere (not produced by Adobe, but rather by osflash.org who’ve since taken them down) on which we based our work.

Note: Adobe later released “specifications” which contained no more information than what was already disclosed in the non-Adobe-produced reversing wiki and documents. Their (Adobe's) specifications were of an absurdly low quality and made it near impossible to actually use their libraries. Moreover, the protocol itself seemed intentionally misleading at times. For example:

They used 29-bit integers.
They included protocol headers with big endian formatting everywhere—except for a specific (yet unmarked) field, which was little endian.
They squeezed data into less space at the cost of computational power when transporting 9k video frames, which made little to no sense, because they were earning back bits or bytes at a time—insignificant gains for such a file size.

And secondly: RTMP is highly session oriented, which made it virtually impossible to multicast an incoming stream. Ideally, if multiple users wanted to watch the same live stream, we could just pass them back pointers to a single session in which that stream is being aired (this would be multicasting). But with RTMP, we had to create an entirely new instance of the stream for every user that wanted access. This was a complete waste.

1_1

My solution

With that in mind, I decided to re-package/parse the typical response stream into FLV ‘tags’ (where a ‘tag’ is just some video, audio, or meta data). These FLV tags could travel within the RTMP with little issue.

The benefits of such an approach:

We only needed to repackage a stream once (repackaging was a nightmare due to the lack of specifications and protocol quirks outlined above).
We could re-use any stream between clients with very few problems by providing them simply with an FLV header, while an internal pointer to FLV tags (along with some sort of offset to indicate where they’re at in the stream) allowed access to the content.

I began development in the language I knew best at the time: C. Over time, this choice became cumbersome; so I started learning Python while porting over my C code. The development process sped up, but after a few demos, I quickly ran into the problem of exhausting resources. Python's socket handling was not meant to handle these types of situations: specifically, in Python we found ourselves making multiple system calls and context switches per action, adding a huge amount of overhead.

Improving performance: mixing Python and C

After profiling the code, I chose to move the performance-critical functions into a Python module written entirely in C. This was fairly low-level stuff: specifically, it made use of the kernel’s epoll mechanism to provide a logarithmic order-of-growth.

In asynchronous socket programming there are facilities that can provide you with info whether a given socket is readable/writable/error-filled. In the past, developers have used the select() system call to get this information, which scales badly. Poll() is a better version of select, but it's still not that great as you have to pass in a bunch of socket descriptors at every call.

Epoll is amazing as all you have to do is register a socket and the system will remember that distinct socket, handling all the gritty details internally. So there's no argument-passing overhead with each call. It also scales far better and returns only the sockets that you care about, which is way better than running through a list of 100k socket descriptors to see if they had events with bitmasks--which you need to do if you use the other solutions.

But for the increase in performance, we paid a price: this approach followed a completely different design pattern than before. The site’s previous approach was (if I recall correctly) one monolithic process which blocked on receiving and sending; I was developing an event-driven solution, so I had to refactor the rest of the code as well to fit this new model.

Specifically, in our new approach, we had a main loop, which handled receiving and sending as follows:

1_2

The received data was passed (as messages) up to the RTMP layer.
The RTMP was dissected and FLV tags were extracted.
The FLV data was sent to the buffering and multicasting layer, which organized the streams and filled the low-level buffers of the sender.
The sender kept a struct for every client, with a last-sent index, and tried to send as much data as possible to the client.

This was a rolling window of data, and included some heuristics to drop frames when the client was too slow to receive. Things worked pretty well.

Systems-level, architectural, and hardware issues

But we ran into another problem: the kernel's context switches were becoming a burden. As a result, we chose to write only every 100 milliseconds, rather than instantaneously. This aggregated the smaller packets and prevented a burst of context switches.

Perhaps a larger problem lied in the realm of server architectures: we needed a load-balancing and failover-capable cluster—losing users due to server malfunctions is not fun. At first, we went with a separate-director approach, in which a designated ‘director’ would try to create and destroy broadcaster feeds by predicting demand. This failed spectacularly. In fact, everything we tried failed pretty substantially. In the end, we opted for a relatively brute-force approach of sharing broadcasters among the cluster’s nodes randomly, equaling out the traffic.

This worked, but with one drawback: although the general case was handled pretty well, we saw terrible performance when everyone on the site (or a disproportionate number of users) watched a single broadcaster. The good news: this never happens outside a marketing campaign. We implemented a separate cluster to handle this scenario, but in truth we reasoned that jeopardizing the paying user's experience for a marketing effort was senseless—in fact, this wasn’t really a genuine scenario (although it would have been nice to handle every imaginable case).

Conclusion

Some statistics from the end-result: Daily traffic on the cluster was about a 100k users at peak (60% load), ~50k on average. I managed two clusters (HUN and US); each of them handled about 40 machines to share the load. The aggregated bandwidth of the clusters was around 50 Gbps, from which they used around 10 Gbps while at peak load. In the end, I managed to push out 10 Gbps/machine easily; theoretically,1 this number could've gone as high as 30 Gbps/machine, which translates to about 300k users watching streams concurrently from one server.

The existing FMS cluster contained more than 200 machines, which could've been replaced by my 15—only 10 of which would do any real work. This gave us roughly a 200/10 = 20x improvement.

Probably my greatest take-away from the project was that I shouldn’t let myself be stopped by the prospect of having to learn a new skill set. In particular, Python, transcoding, and object-oriented programming, were all concepts with which I had very sub-professional experience before taking on this project.

That, and that rolling your own solution can pay big.

1 Later, when we put the code into production, we ran into hardware issues, as we used older sr2500 Intel servers which could not handle 10 Gbit Ethernet cards because of their low PCI bandwidths. Instead, we used them in 1-4x1 Gbit Ethernet bonds (aggregating the performance of several network interface cards into a virtual card). Eventually, we got some of the newer sr2600 i7 Intels, which served 10 Gbps over optics without any performance kinks. All the projected calculations refer to this hardware.

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

博客等级

码龄17年

161
原创

56
点赞

215
收藏

132
粉丝

关注

私信

热门文章

分类专栏

最新评论

ff 15 调用函数与e8调用函数的区别 call的两种二进制命令
Revival_S: 不，其实ff15和e8的call和平栈方式没有关系，call指令是啥编码并不能约束平栈方式。要说是编译器的规则也不对，你要找反汇编的话，也能找到e8加上内平栈的。
Rocksdb源码剖析一----Rocksdb概述与基本组件
weixin_46971565: 错误的英文单词有点多啊
C语言中不同的结构体类型的指针间的强制转换详解
agsooo: https://blog.csdn.net/tic_yx/article/details/9719521?spm=1001.2101.3001.6650.5&utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-5.no_search_link&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7Edefault-5.no_search_link 部分原文。
C语言中不同的结构体类型的指针间的强制转换详解
agsooo: 后面的两行程序看不懂，写的太绕了。取结构体指针的地址，强制转换为tQueueNode 类型指针的指针，不理解。第三行，&pTempMsg->data仿真跟第二行没有关系，内存地址是一样的。
Rocksdb源码剖析一----Rocksdb概述与基本组件
Owl丶: ....写的很好，可惜就是没后续了

最新文章

目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。