Threading and latency in x264

 

 

Threading and latency in x264

scjohnson

12th September 2008, 21:20

I am interested in more fully understanding the current x264 threading mechanism and expected roadmap. I was using an old (2 1/2 years) version of x264 in a multi-cpu environment (>>1). In those days, threading was handled by slicing the frame. My understanding is the current algorithm threads by frames instead. Unfortunately, if your goals is streaming compression you worry about the latencies this approach introduces.

Naively if I could run a streaming compression on 15 processor in real time, that would introduce a 1/2 second latency right off the top in a 30fps environment (for example). Is my naive view of the x264 approach correct? Any thoughts of the best way to tackle this (other than waiting for faster processors)?

I did some searching on this and other forums, but only found this (http://mailman.videolan.org/pipermail/x264-devel/2006-December/002484.html) as the most recent comment. My apologies if I've missed this being discussed at length here or elsewhere.

Thanks.


fields_g

12th September 2008, 23:45

There has been vast changes in speed and quality over the last 2.5 years. It is true that x264 is frame based now. Ok... let me get some clarification from you.
1) Is 30fps what you are able to achieve right now with the old version?
2) 15 (processors) is not a very "computer-like" number. What is your setup?
3) What command line parameters are you using?

I might not be the one to be able to give you a good answer, but with the above answers, the community will have a little more info to help you out.


Dark Shikari

13th September 2008, 01:02

Yes, the current threading method has such latency, which could be problematic for applications that need extremely low latency, such as videoconferencing. However, x264 doesn't need more than one thread to do realtime SD encoding, so you can still get that realtime with near-zero latency. HD encoding can be done with just two or three threads on a top-end CPU, for just a couple frames of latency.


scjohnson

15th September 2008, 16:55

Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).

That's running on a 720p/30 with qp=24, resulting in a 5Mb/s output for my data -- about as low of resolution as I can stand. For 1080p, it's worse of course. As you may have guessed, I'm not using any standard OTS platform and porting the most recent x264 to it will take some effort. I'm trying to evaluate if it's worth it.

If the current code is significantly faster, perhaps a reduction from the 1/2 to something below 1/4 would work.

I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)

I'm not sure I've clarified my questions at all, but do appreciate the thoughts.

Thanks again.


Dark Shikari

15th September 2008, 17:58

Thanks for the prompt responses.

What I'm interested in is real-time 1080p/30 encoding. For HD applications I've primarily seen parallel encoding, but to do this real time requires a number of processors. I'm in a low-power environment where you wouldn't choose to use a top-end CPU, but opt instead for lower clock rate and more processors.

I haven't performed a port of the most recent x264 code base to this kind of platform, but my results using someone else's port of the x264 code from 2006 required about 16 processors for real-time on an HD stream. If the current code is approximately as fast, but uses frame-based threading this would introduce a 1/2 second delay (unacceptable in some environments).You shouldn't need nearly such a system to do that. I've clocked x264 as running 1080p24 in realtime on a single core CPU (well, one core of a top-end Penryn) with absolute max speed settings. Avail Media does realtime 1080i30 with multiref and RDO with 8 cores; only about 4-6 are used on x264.
I was also curious to understand why the choice was made to be frame-based rather than slice-based threaded. (Ease of programming? Something more subtle?)Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.


Inventive Software

15th September 2008, 17:58

The "thread-pool" patch (search for it) might be of benefit in that case, but you'd need fast, minimal search settings, and I'm assuming a CRF mode, to do 1080p30 real-time.


Manao

15th September 2008, 18:24

Slice-based threading is intolerably inefficient; it caps out very quickly in terms of overall performance increase and does not effectively utilize large numbers of cores.No, not in a realtime & low latency environment. Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.

Said otherwise, if you adapt subme settings according to CPU charge, frame-base will give a better subme average quality than slice based, but the worst case will be the same. Since low latency prevents you from buffering enough frames to adapt subme, you are forced to use a constant subme setting. Which means, in realtime, that you gain nothing between slice & frame based (except a slightly better coding efficiency for frame-base).


Dark Shikari

15th September 2008, 18:27

Slice-base threading is inefficient on offline encoding, but it has the same worst case scenario encoding time as frame based, which is what matters here.Of course not. With slice-based, if a frame has dramatically differing encoding times per slice, you will spend the vast majority of your time waiting on the last slice to finish. With frame-based, no such problem exists; there's only a problem if an entire frame takes a huge amount longer than other frames.

From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.

Also, real time does inherently not imply "low latency" at all. And if you need speedcontrol, the current speedcontrol patch probably works just fine with 30 or even 15 frames of buffer. Furthermore, since it acts on a thread level rather than frame level, it can re-use the buffer that already exists for threads to use; i.e. it shouldn't add any new delay.


Manao

15th September 2008, 18:42

I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.

And, the worst case scenario, in both cases, is the whole frame taking huge amount of time. Since subme is constant due to low latency, and since worst case scenario has the same speed for both slice & frame based, you end up with the same subme for both slice & frame based threading. And that is achieved at the same CPU usage (since both do the same amount of work).


Dark Shikari

15th September 2008, 18:46

I didn't say realtime == low latency. I said in a realtime & low latency environment, which is his case.

You propose a 15 frame buffer for speed control, which isn't low latency, so you admit low latency forces a constant subme.What do you define as low latency--any number less than a number I use in my post? I never said speedcontrol wouldn't work at a lower buffer size, I gave a suggestion. Also, I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing.And, the worst case scenario, in both cases, is the whole frame taking huge amount of time.No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bits. Therefore, any case in which you consistently get many frames that take way too long, you will have already violated VBV anyways. That is the only possible case in which frame-based threading could reach your rather hypothetical "worst-case scenario." Slice-based threading, on the other hand, can consistently fail even if VBV never has any problems.


Manao

15th September 2008, 19:05

No, it isn't, because the time that the frame takes is going to be roughly proportional to its size in bitsNo, the time is something like Ax(macroblock count)x(complexity) + Bx(bitrate), with a clearly non neglictible first part (example : foreman, -8 -q 20 -m 6 -b 3 : 22 fps for 1 mbps, -8 -q 40 -m 6 -b 3 : 44 fps for 60 kbps, so bitrate at q 20 amounts for half the encoding time on foreman)


scjohnson

15th September 2008, 19:43

Thanks again for all the great insight.

[...] I do consider 15 frames to be low latency. 300 frames is high latency, which is what Avail Media uses for television broadcast. I cannot imagine a case in which lower than 15 frames is absolutely necessary except perhaps for videoconferencing. [...]

That's a big one ... and, one might argue, one of the largest poorly tapped markets in the real-time encoding field.

For telepresence applications, 15 frames is not low latency. You'd never stand for 1/2 second delay on your cell phone and that's why many telepresence applications are very awkward for the general user.

Low latency is 100 ms, which means you need to hold <3 frames in a buffer @30fps.

For a television broadcast, a 10 second delay is no big deal and there's where the threading choice of x264 makes perfect sense. However, if we want x264 to be applicable for other problems, it needs to approach a single frame latency, which naively lends itself to slice-based threading.


fields_g

15th September 2008, 20:04

I like this thread even more! I'm involved in videoconferencing.

In the entire system, latency comes from many places: drivers, buffers, encoding, distance, routing, decoding etc. Latency is inescapable! Theoretically, even talking to someone face-to-face has latency (Distance apart/speed of sound). In technology, you need to determine the overall acceptable latency of a system, then budget for each part of the process. Drivers, buffering, encoding, transmission, distance, queuing, decoding, etc. all add latency.

My definition of "Realtime" encoding means being able to indefinitely sustain an encoding rate equal or surpassing the input frame rate. Not including buffering, encoding realtime at 30 fps still can add up to 33ms to the conversation.

I believe that 150ms is generally considered the maximum one way latency for a 2-way voice conversation. This can adjust depending on the format of the conversation and the personalities of the people on each end. This should be approximately the same for videoconferencing.

I guess and important question is: Is the application latency 2-way videoconferencing? If not, the budget can be expanded greatly. Either way we need to know how much latency can we afford toward encoding/buffering. This will dictate the performance/processor, # of processors, and encoding options that are needed.


Shinigami-Sama

15th September 2008, 21:54

if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...


BlackSharkfr

16th September 2008, 07:52

if you're doing videoconferencing you should beable to do SD in a single thread and if you're doing it in HD you should give your head a shake...

I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...


foxyshadis

16th September 2008, 09:44

If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.


Dark Shikari

16th September 2008, 09:46

If you need HD videoconferencing, you pay for the CPU needed to minimize latency. Do note that there's a slice-based patch floating around, which combines both threading methods - I have no way of even testing the latency benefit of combining both, but who knows, it might work.The slices patch doesn't use slice-based threading; it uses slices in frame-based threading. It only affects syntax elements, not the threading model.


fields_g

16th September 2008, 11:58

I think scjohnson's product makes sense.
HD videoconferencing should be available soon.

People already have FullHD camcorders, FullHD TVs, although individuals don't have the required bandwidth to transfer realtime FullHD streams yet, many businesses can afford it.
I can't wait to have a fiber connection at home...

HD videoconferencing has been in the market for quite a while now. It relies on h.264 to get quality at 720p resolutions. Lifesize was a startup company who was the first company to debut HD "telepresence", followed by the the other two (already established), Tandberg and Polycom. Cisco came later, but pushed the envelope to include 1080p.

All these companies rely on hardware to encode their streams. I no longer have access to these machines, and when I did, I didn't try too hard to get an H.264 stream to analyze. I'm sure it would be interesting though.

Many of these companies lock much of their hardware so that it can't setup calls faster than 1.5-2Mbits unless you purchase key codes for greater bitrates. This is unfortunate since we all could imagine what the quality of such an image encoded by a hasty low latency encoder would look like.


scjohnson

16th September 2008, 14:45

From the benchmarks I have of realtime encoding with slice-based threading, the maximum performance increase of threads capped out at about 200%, a pathetic value. Frame-based can achieve more than double or triple that. And what really matters is what happens in practice, not some theoretical situation that doesn't actually exist.


My benchmarks using an x264 port from almost three years ago with slice-based threading on multiple cores scales quite well up to the first 9 threads (88%) and caps out around 25 threads on HD video where I have a 16x improvement over 1 thread.

I don't consider that pathetic.


Sagekilla

16th September 2008, 15:38

Would it theoretically be possible to have a hybrid frame-slice based threading? Something like: Frame 0 goes to thread group (0,1,2,3) and Frame 1 goes to thread group (4,5,6,7).

I'm guessing it's very difficult because of temporal dependencies but I was curious.


akupenguin

16th September 2008, 16:01

It's possible to have hybrid xvid-threading and frame-threading, with the caveat that both kinds of threads use up the same spatial buffer, so you're only trading off scaling efficiency vs latency, not increasing the max number of threads. And the caveat that xvid-threading is incompatible with accurate RDO (actually I don't know how inaccurate it would be, only that it can't keep track of the bitstream state).
Slice-threading can't mix with frame-threading.
GOP-threading can mix with everything, but that's not what you want for latency ;)
Pipeline-threading can also mix with everything, but it reduces compression efficiency and scaling efficiency.


Manao

16th September 2008, 18:19

I wonder what pipeline threading is.


akupenguin

17th September 2008, 01:59

I wonder what pipeline threading is.
doing different things in different threads; me, mode decision, entropy coding, hpel/deblock, ...


foxyshadis

17th September 2008, 05:40

Slice-threading can't mix with frame-threading.

In that case, what about splitting slices into separate instances of the encoder, then combining and packaging the output as each NAL is output in the control program? Not necessarily x264, but it's an avenue to explore if anyone wants unique ways to lower latency.

Videoconferencing won't really live or die from the efficiency loss of using slices, anyway, it's largely talking heads on static backgrounds.


Shinigami-Sama

17th September 2008, 05:45

doing different things in different threads; me, mode decision, entropy coding, hpel/deblock, ...

I thought thats what was already done?

seems like the best way to do to me
but I'm not a programmer...


Manao

17th September 2008, 05:57

Currently, each thread encode a whole frame (ie, does me, mode decision, entropy, hpel, deblock), so each thread have roughly the same amount of work. If you were to give each thread a specific task instead, the thread doing hpel (for example) would do its task faster than the thread doing the ME, so it would unbalanced CPU load, thus generate threading inefficiency.


akupenguin

17th September 2008, 08:28

what about splitting slices into separate instances of the encoder
Then you can't have mvs span the middle of the frame (nor even near the boundary, i.e. edge mbs can't have small subpel mvs). This is exactly why slice-threading is incompatible with frame-threading.


Quark.Fusion

17th September 2008, 12:28

Pipeline-threading can also mix with everything, but it reduces compression efficiency and scaling efficiency.
Why it reduces compression efficiency?


Manao

17th September 2008, 12:46

If you thread decision and entropy, RD costs during the decision will be slightly off, because you won't have the exact cabac context.


Quark.Fusion

17th September 2008, 16:30

Maybe then thread everything that not hurt? Thread balancing is already broken with source decoding, b-adapt, 1.5 thread per core and etc. (Don't forget other processes on system that also distrub x264 thread balancing).
If you want perfect threading you must make multitask threads (or many single-task threads for each core) and thread server (with higher priority), that will split source task to smaller parts and distribute them in a smart way to keep all cores busy.


Sagekilla

17th September 2008, 16:47

So basically you're saying creating a client/server model where the server assigns the tasks and the clients do the grunt work. If they switched to that threading model, you could probably even span it across multiple systems easier.

Edit: If you could get it to work, you could get close to 35 fps (1080p frames) running at the max of gigabit network, or 85 fps if you're doing 720p frames.


Manao

17th September 2008, 17:52

Source decoding isn't part of x264, and will limit any threading scheme, so it can be ignored. b-adapt can be threaded without losing anything. 1.5 thread per core is needed because frames don't take systematically the same time to encode (and because sometimes, motion vectors restriction make a thread wait for another, so in such cases it's nice to have a third thread that can run during that time), it's not quite the same thing as load balancing

Also, don't forget that the more thread, the more overhead from context switching. So a thread server, though it looks nice and paper, can cost a lot.

In any case, without consideration for latency, slice-type decision is currently the only limiting factor, and it is fairly easy to thread it.


Quark.Fusion

17th September 2008, 19:57

All that isn't part of x264 can't be ignored as x264 isn't the only thing that running on a computer. It's pointless to split task in equal parts per thread as other processes will break all balance. Threads must be dynamically loaded with small tasks from tasks buffer.
Thread server don't need many context switches. But more that 1 thread per core produces concurrency. There was thread pool patch somewhere
did it provide better efficiency?


Ranguvar

17th September 2008, 20:25

It did for large numbers of threads (> 4) and worsened efficiency for less threads. IIRC.

http://komisar.gin.by/x.patch/BugMaster/20080908/03_bm_x264_thread_pool.r965.diff

 

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值