How I Made Porn 20x More Efficient with Python

Intro

Porn is a big industry. There aren’t many sites on the Internet that can rival the traffic of its biggest players.

And juggling this immense traffic is tough. To make things even harder, much of the content served from porn sites is made up of low latency live streams rather than simple static video content. But for all of the challenges involved, rarely have I read about the developers who take them on. So I decided to write about my own experience on the job.

What’s the problem?

A few years ago, I was working for the 26th (at the time) most visited website in the world—not just the porn industry: the world.

At the time, the site served up porn streaming requests with the Real Time Messaging protocol (RTMP). More specifically, it used a Flash Media Server (FMS) solution, built by Adobe, to provide users with live streams. The basic process was as follows:

  1. The user requests access to some live stream
  2. The server replies with an RTMP session playing the desired footage

For a couple reasons, FMS wasn’t a good choice for us, starting with its costs, which included the purchasing of both:

  1. Windows licenses for every machine on which we ran FMS.
  2. ~$4k FMS-specific licenses, of which we had to purchase several hundred (and more every day) due to our scale.

All of these fees began to rack up. And costs aside, FMS was a lacking product, especially in its functionality (more on this in a bit). So I decided to scrap FMS and write my own RTMP parser from scratch.

In the end, I managed to make our service roughly 20x more efficient.

Getting started

There were two core problems involved: firstly, RTMP and other Adobe protocols and formats were not open (i.e., publically available), which made them hard to work with. How can you reverse or parse files in a format about which you know nothing? Luckily, there were some reversing efforts available in the public sphere (not produced by Adobe, but rather by osflash.org who’ve since taken them down) on which we based our work.

Note: Adobe later released “specifications” which contained no more information than what was already disclosed in the non-Adobe-produced reversing wiki and documents. Their (Adobe's) specifications were of an absurdly low quality and made it near impossible to actually use their libraries. Moreover, the protocol itself seemed intentionally misleading at times. For example:

  1. They used 29-bit integers.
  2. They included protocol headers with big endian formatting everywhere—except for a specific (yet unmarked) field, which was little endian.
  3. They squeezed data into less space at the cost of computational power when transporting 9k video frames, which made little to no sense, because they were earning back bits or bytes at a time—insignificant gains for such a file size.

And secondly: RTMP is highly session oriented, which made it virtually impossible to multicast an incoming stream. Ideally, if multiple users wanted to watch the same live stream, we could just pass them back pointers to a single session in which that stream is being aired (this would be multicasting). But with RTMP, we had to create an entirely new instance of the stream for every user that wanted access. This was a complete waste.

1_1

My solution

With that in mind, I decided to re-package/parse the typical response stream into FLV ‘tags’ (where a ‘tag’ is just some video, audio, or meta data). These FLV tags could travel within the RTMP with little issue.

The benefits of such an approach:

  1. We only needed to repackage a stream once (repackaging was a nightmare due to the lack of specifications and protocol quirks outlined above).
  2. We could re-use any stream between clients with very few problems by providing them simply with an FLV header, while an internal pointer to FLV tags (along with some sort of offset to indicate where they’re at in the stream) allowed access to the content.

I began development in the language I knew best at the time: C. Over time, this choice became cumbersome; so I started learning Python while porting over my C code. The development process sped up, but after a few demos, I quickly ran into the problem of exhausting resources. Python's socket handling was not meant to handle these types of situations: specifically, in Python we found ourselves making multiple system calls and context switches per action, adding a huge amount of overhead.

Improving performance: mixing Python and C

After profiling the code, I chose to move the performance-critical functions into a Python module written entirely in C. This was fairly low-level stuff: specifically, it made use of the kernel’s epoll mechanism to provide a logarithmic order-of-growth.

In asynchronous socket programming there are facilities that can provide you with info whether a given socket is readable/writable/error-filled. In the past, developers have used the select() system call to get this information, which scales badly. Poll() is a better version of select, but it's still not that great as you have to pass in a bunch of socket descriptors at every call. 

Epoll is amazing as all you have to do is register a socket and the system will remember that distinct socket, handling all the gritty details internally. So there's no argument-passing overhead with each call. It also scales far better and returns only the sockets that you care about, which is way better than running through a list of 100k socket descriptors to see if they had events with bitmasks--which you need to do if you use the other solutions.

But for the increase in performance, we paid a price: this approach followed a completely different design pattern than before. The site’s previous approach was (if I recall correctly) one monolithic process which blocked on receiving and sending; I was developing an event-driven solution, so I had to refactor the rest of the code as well to fit this new model.

Specifically, in our new approach, we had a main loop, which handled receiving and sending as follows:

1_2

  1. The received data was passed (as messages) up to the RTMP layer.
  2. The RTMP was dissected and FLV tags were extracted.
  3. The FLV data was sent to the buffering and multicasting layer, which organized the streams and filled the low-level buffers of the sender.
  4. The sender kept a struct for every client, with a last-sent index, and tried to send as much data as possible to the client.

This was a rolling window of data, and included some heuristics to drop frames when the client was too slow to receive. Things worked pretty well.

Systems-level, architectural, and hardware issues

But we ran into another problem: the kernel's context switches were becoming a burden. As a result, we chose to write only every 100 milliseconds, rather than instantaneously. This aggregated the smaller packets and prevented a burst of context switches.

Perhaps a larger problem lied in the realm of server architectures: we needed a load-balancing and failover-capable cluster—losing users due to server malfunctions is not fun. At first, we went with a separate-director approach, in which a designated ‘director’ would try to create and destroy broadcaster feeds by predicting demand. This failed spectacularly. In fact, everything we tried failed pretty substantially. In the end, we opted for a relatively brute-force approach of sharing broadcasters among the cluster’s nodes randomly, equaling out the traffic.

This worked, but with one drawback: although the general case was handled pretty well, we saw terrible performance when everyone on the site (or a disproportionate number of users) watched a single broadcaster. The good news: this never happens outside a marketing campaign. We implemented a separate cluster to handle this scenario, but in truth we reasoned that jeopardizing the paying user's experience for a marketing effort was senseless—in fact, this wasn’t really a genuine scenario (although it would have been nice to handle every imaginable case).

Conclusion

Some statistics from the end-result: Daily traffic on the cluster was about a 100k users at peak (60% load), ~50k on average. I managed two clusters (HUN and US); each of them handled about 40 machines to share the load. The aggregated bandwidth of the clusters was around 50 Gbps, from which they used around 10 Gbps while at peak load. In the end, I managed to push out 10 Gbps/machine easily; theoretically,1 this number could've gone as high as 30 Gbps/machine, which translates to about 300k users watching streams concurrently from one server.

 The existing FMS cluster contained more than 200 machines, which could've been replaced by my 15—only 10 of which would do any real work. This gave us roughly a 200/10 = 20x improvement.

Probably my greatest take-away from the project was that I shouldn’t let myself be stopped by the prospect of having to learn a new skill set. In particular, Python, transcoding, and object-oriented programming, were all concepts with which I had very sub-professional experience before taking on this project.

That, and that rolling your own solution can pay big.

1 Later, when we put the code into production, we ran into hardware issues, as we used older sr2500 Intel servers which could not handle 10 Gbit Ethernet cards because of their low PCI bandwidths. Instead, we used them in 1-4x1 Gbit Ethernet bonds (aggregating the performance of several network interface cards into a virtual card). Eventually, we got some of the newer sr2600 i7 Intels, which served 10 Gbps over optics without any performance kinks. All the projected calculations refer to this hardware.

1JAVA SE 1.1深入JAVA API 1.1.1Lang包 1.1.1.1String类和StringBuffer类 位于java.lang包中,这个包中的类使用时不用导入 String类一旦初始化就不可以改变,而stringbuffer则可以。它用于封装内容可变的字符串。它可以使用tostring()转换成string字符串。 String x=”a”+4+”c”编译时等效于String x=new StringBuffer().append(“a”).append(4).append(“c”).toString(); 字符串常量是一种特殊的匿名对象,String s1=”hello”;String s2=”hello”;则s1==s2;因为他们指向同一个匿名对象。 如果String s1=new String(“hello”);String s2=new String(“hello”);则s1!=s2; /*逐行读取键盘输入,直到输入为“bye”时,结束程序 注:对于回车换行,在windows下面,有'\r'和'\n'两个,而unix下面只有'\n',但是写程序的时候都要把他区分开*/ public class readline { public static void main(String args[]) { String strInfo=null; int pos=0; byte[] buf=new byte[1024];//定义一个数组,存放换行前的各个字符 int ch=0; //存放读入的字符 system.out.println(“Please input a string:”); while(true) { try { ch=System.in.read(); //该方法每次读入一个字节的内容到ch变量中。 } catch(Exception e) { } switch(ch) { case '\r': //回车时,不进行处理 break; case '\n': //换行时,将数组总的内容放进字符串中 strInfo=new String(buf,0,pos); //该方法将数组中从第0个开始,到第pos个结束存入字符串。 if(strInfo.equals("bye")) //如果该字符串内容为bye,则退出程序。 { return; } else //如果不为bye,则输出,并且竟pos置为0,准备下次存入。 { System.out.println(strInfo); pos=0; break; } default: buf[pos++]=(byte)ch; //如果不是回车,换行,则将读取的数据存入数组中。 } } } } String类的常用成员方法 1、构造方法: String(byte[] byte,int offset,int length);这个在上面已经用到。 2、equalsIgnoreCase:忽略大小写的比较,上例中如果您输入的是BYE,则不会退出,因为大小写不同,但是如果使用这个方法,则会退出。 3、indexOf(int ch);返回字符ch在字符串中首次出现的位置 4、substring(int benginIndex); 5、substring(int beginIndex,int endIndex); 返回字符串的子字符串,4返回从benginindex位置开始到结束的子字符串,5返回beginindex和endindex-1之间的子字符串。 基本数据类型包装类的作用是:将基本的数据类型包装成对象。因为有些方法不可以直接处理基本数据类型,只能处理对象,例如vector的add方法,参数就只能是对象。这时就需要使用他们的包装类将他们包装成对象。 例:在屏幕上打印出一个*组成的矩形,矩形的宽度和高度通过启动程序时传递给main()方法的参数指定。 public class testInteger { public static void main(String[] args) //main()的参数是string类型的数组,用来做为长,宽时,要转换成整型。 { int w=new Integer(args[0]).intValue(); int h=Integer.parseInt(args[1]); //int h=Integer.valueOf(args[1]).intValue(); //以上为三种将字符串转换成整形的方法。 for(int i=0;i<h;i++) { StringBuffer sb=new StringBuffer(); //使用stringbuffer,是因为它是可追加的。 for(int j=0;j<w;j++) { sb.append('*'); } System.out.println(sb.toString()); //在打印之前,要将stringbuffer转化为string类型。 } } } 比较下面两段代码的执行效率: (1)String sb=new String(); For(int j=0;j<w;j++) { Sb=sb+’*’; } (2) StringBuffer sb=new StringBuffer(); For(int j=0;j<w;j++) { Sb.append(‘*’); } (1)和(2)在运行结果上相同,但效率相差很多。 (1)在每一次循环中,都要先将string类型转换为stringbuffer类型,然后将‘*’追加进去,然后再调用tostring()方法,转换为string类型,效率很低。 (2)在没次循环中,都只是调用原来的那个stringbuffer对象,没有创建新的对象,所以效率比较高。 1.1.1.2System类与Runtime类 由于java不支持全局函数和全局变量,所以java设计者将一些与系统相关的重要函数和变量放在system类中。 我们不能直接创建runtime的实例,只能通过runtime.getruntime()静态方法来获得。 编程实例:在java程序中启动一个windows记事本程序的运行实例,并在该运行实例中打开该运行程序的源文件,启动的记事本程序5秒后关闭。 public class Property { public static void main(String[] args) { Process p=null; //java虚拟机启动的进程。 try { p=Runtime.getRuntime().exec("notepad.exe Property.java"); //启动记事本并且打开源文件。 Thread.sleep(5000); //持续5秒 p.destroy(); //关闭该进程 } catch(Exception ex) { ex.printStackTrace(); } } } 1.1.1.3Java语言中两种异常的差别 Java提供了两类主要的异常:runtime exception和checked exception。所有的checked exception是从java.lang.Exception类衍生出来的,而runtime exception则是从java.lang.RuntimeException或java.lang.Error类衍生出来的。    它们的不同之处表现在两方面:机制上和逻辑上。    一、机制上    它们在机制上的不同表现在两点:1.如何定义方法;2. 如何处理抛出的异常。请看下面CheckedException的定义:    public class CheckedException extends Exception    {    public CheckedException() {}    public CheckedException( String message )    {    super( message );    }    }    以及一个使用exception的例子:    public class ExceptionalClass    {    public void method1()    throws CheckedException    {     // ... throw new CheckedException( “...出错了“ );    }    public void method2( String arg )    {     if( arg == null )     {      throw new NullPointerException( “method2的参数arg是null!” );     }    }    public void method3() throws CheckedException    {     method1();    }    }    你可能已经注意到了,两个方法method1()和method2()都会抛出exception,可是只有method1()做了声明。另外,method3()本身并不会抛出exception,可是它却声明会抛出CheckedException。在向你解释之前,让我们先来看看这个类的main()方法:    public static void main( String[] args )    {    ExceptionalClass example = new ExceptionalClass();    try    {    example.method1();    example.method3();    }    catch( CheckedException ex ) { } example.method2( null );    }    在main()方法中,如果要调用method1(),你必须把这个调用放在try/catch程序块当中,因为它会抛出Checked exception。    相比之下,当你调用method2()时,则不需要把它放在try/catch程序块当中,因为它会抛出的exception不是checked exception,而是runtime exception。会抛出runtime exception的方法在定义时不必声明它会抛出exception。    现在,让我们再来看看method3()。它调用了method1()却没有把这个调用放在try/catch程序块当中。它是通过声明它会抛出method1()会抛出的exception来避免这样做的。它没有捕获这个exception,而是把它传递下去。实际上main()方法也可以这样做,通过声明它会抛出Checked exception来避免使用try/catch程序块(当然我们反对这种做法)。    小结一下:    * Runtime exceptions:    在定义方法时不需要声明会抛出runtime exception;    在调用这个方法时不需要捕获这个runtime exception;    runtime exception是从java.lang.RuntimeException或java.lang.Error类衍生出来的。    * Checked exceptions:    定义方法时必须声明所有可能会抛出的checked exception;    在调用这个方法时,必须捕获它的checked exception,不然就得把它的exception传递下去;    checked exception是从java.lang.Exception类衍生出来的。    二、逻辑上    从逻辑的角度来说,checked exceptions和runtime exception是有不同的使用目的的。checked exception用来指示一种调用方能够直接处理的异常情况。而runtime exception则用来指示一种调用方本身无法处理或恢复的程序错误。    checked exception迫使你捕获它并处理这种异常情况。以java.net.URL类的构建器(constructor)为例,它的每一个构建器都会抛出MalformedURLException。MalformedURLException就是一种checked exception。设想一下,你有一个简单的程序,用来提示用户输入一个URL,然后通过这个URL去下载一个网页。如果用户输入的URL有错误,构建器就会抛出一个exception。既然这个exception是checked exception,你的程序就可以捕获它并正确处理:比如说提示用户重新输入。 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值