优化着色器信息加载,或查看Yer数据!

A story about a million shader variants, optimizing using Instruments and looking at the data to optimize some more.

关于一百万个着色器变体的故事,使用乐器进行优化并查看数据以进行更多优化。

错误报告 (The Bug Report)

The bug report I was looking into was along the lines of “when we put these shaders into our project, then building a game becomes much slower – even if shaders aren’t being used”.

我正在研究的错误报告大致“当我们将这些着色器放入我们的项目中时,即使不使用着色器,构建游戏也会变得慢得多”

Indeed it was. Quick look revealed that for ComplicatedReasons(tm) we load information about all shaders during the game build – that explains why the slowdown was happening even if shaders were not actually used.

确实是这样。 快速浏览显示,对于ComplicatedReasons(tm),我们在游戏构建期间会加载有关所有着色器的信息,这解释了为什么即使实际上未使用着色器也会出现速度下降的原因。

This issue must be fixed! There’s probably no really good reason we must know about all the shaders for a game build. But to fix it, I’ll need to pair up with someone who knows anything about game data build pipeline, our data serialization and so on. So that will be someday in the future.

必须解决此问题! 我们必须了解游戏构建的所有着色器的确没有充分的理由。 但是要解决此问题,我需要与对游戏数据构建管道,我们的数据序列化等一无所知的人结对。 所以这将是将来的一天。

Meanwhile… another problem was that loading the “information for a shader” was slow in this project. Did I say slow? It was very slow.

同时,另一个问题是,在此项目中,加载“着色器信息”的速度很慢。 我说的慢吗? 太慢了

That’s a good thing to look at. Shader data is not only loaded while building the game; it’s also loaded when the shader is needed for the first time (e.g. clicking on it in Unity’s project view); or when we actually have a material that uses it etc. All these operations were quite slow in this project.

这是一件好事。 着色器数据不仅在构建游戏时加载; 第一次需要着色器时也会加载它(例如,在Unity的项目视图中单击它); 或当我们实际上有使用它的材料等时。在此项目中,所有这些操作都很慢。

Turns out this particular shader had massive internal variant count. In Unity, what looks like “a single shader” to the user often has many variants inside (to handle different lights, lightmaps, shadows, HDR and whatnot – typical ubershader setup). Usually shaders have from a few dozen to a few thousand variants. This shader had 1.9 million. And there were about ten shaders like that in the project.

事实证明,此特定着色器具有大量内部变体计数。 在Unity中,对用户来说看起来像“单一着色器”的内部通常有许多变体 (以处理不同的灯光,光照贴图,阴影,HDR和其他(典型的ubershader设置))。 通常,着色器有几十种到几千种。 该着色器有190万个 。 项目中大约有十个这样的着色器。

设置 (The Setup)

Let’s create several shaders with different variant counts for testing: 27 thousand, 111 thousand, 333 thousand and 1 million variants. I’ll call them 27k, 111k, 333k and 1M respectively. For reference, the new “Standard” shader in Unity 5.0 has about 33 thousand internal variants. I’ll do tests on MacBook Pro (2.3 GHz Core i7) using 64 bit Release build.

让我们创建几个具有不同变体计数的着色器进行测试:2.7万个,11.1万个,33.3万个和100万个变体。 我将它们分别称为27k,111k,333k和1M。 作为参考,Unity 5.0中新的“标准”着色器具有约3.3万个内部变体。 我将使用64位Release版本在MacBook Pro(2.3 GHz Core i7)上进行测试。

Things I’ll be measuring:

我要衡量的事情:

  • Import time. How much time it takes to reimport the shader in Unity editor. Since Unity 4.5 this doesn’t do much of actual shader compilation; it just extracts information about shader snippets that need compiling, and the variants that are there, etc.

    导入时间。 在Unity编辑器中重新导入着色器需要花费多少时间。 从Unity 4.5开始 ,实际着色器的编译工作不多; 它只是提取有关需要编译的着色器代码片段以及其中存在的变体等信息。

  • Imported data size. How large is the imported shader data (serialized representation of actual shader asset; i.e. files that live in Library/metadata folder of a Unity project).

    导入的数据大小。 导入的着色器数据的大小(实际着色器资产的序列化表示;即,位于Unity项目的Library/metadata文件夹中的文件)。

So the data is:

所以数据是:

1

2
3
4
5
6
Shader   Import    Load    Size
   27k    420ms   120ms    6.4MB
  111k   2013ms   492ms   27.9MB
  333k   7779ms  1719ms   89.2MB
    1M  16192ms  4231ms  272.4MB

1

2
3
4
5
6
Shader   Import     Load     Size
   27k      420ms    120ms      6.4MB
   111k    2013ms    492ms    27.9MB
   333k    7779ms    1719ms    89.2MB
     1M    16192ms    4231ms    272.4MB

输入乐器 (Enter Instruments)

Last time we used xperf to do some profiling. We’re on a Mac this time, so let’s use Apple Instruments. Just like xperf, Instruments can show a lot of interesting data. We’re looking at the most simple one, “Time Profiler” (though profiling Zombies is very tempting!). You pick that instrument, attach to the executable, start recording, and get some results out.

上一次,我们使用xperf进行了分析。 这次我们使用的是Mac,因此让我们使用Apple Instruments 。 就像xperf一样,Instruments可以显示很多有趣的数据。 我们正在寻找最简单的工具,即“ Time Profiler” (尽管对Zombies进行概要分析非常诱人!) 。 您选择该乐器,将其附加到可执行文件,开始录音,然后得出一些结果。

shaderopt01-instruments 
shaderopt02-attach 
shaderopt03-timeprofile

You then select the time range you’re interested in, and expand the stack trace. Protip: Alt-Click (ok ok, Option-Click you Mac peoples) expands full tree.

然后,选择您感兴趣的时间范围,并展开堆栈跟踪。 提示:按住Alt键单击(确定,Mac族中按住Option键单击),会展开整个树。

shaderopt04-expand

So far the whole stack is just going deep into Cocoa stuff. “Hide System Libraries” is very helpful with that:

到目前为止,整个堆栈都只是深入到可可粉中。 “隐藏系统库”在以下方面非常有帮助:

shaderopt05-hidesystem

Another very useful feature is inverting the call tree, where the results are presented from the heaviest “self time” functions (we won’t be using that here though).

另一个非常有用的功能是反转调用树,其中的结果是通过最重的“自拍时间”函数显示的(尽管这里我们不会使用它)。

shaderopt06-inverttree

When hovering over an item, an arrow is shown on the right (see image above). Clicking on that does “focus on subtree”, i.e. ignores everything outside of that item, and time percentages are shown relative to the item. Here we’ve focused on ShaderCompilerPreprocess (which does majority of shader “importing” work).

将鼠标悬停在项目上时,右侧会显示一个箭头(请参见上图)。 单击该按钮会“专注于子树”,即忽略该项目以外的所有内容,并显示相对于该项目的时间百分比。 在这里,我们集中于ShaderCompilerPreprocess (它完成了大部分着色器“导入”工作)。

shaderopt08-focused

Looks like we’re spending a lot of time appending to strings. That usually means strings did not have enough storage buffer reserved and are causing a lot of memory allocations. Code change:

看起来我们在字符串上花了很多时间。 这通常意味着字符串没有预留足够的存储缓冲区,并导致大量内存分配。 代码更改:

shaderopt09-reserve

This small change has cut down shader importing time by 20-40%! Very nice!

这个小小的变化将着色器的导入时间减少了20-40%! 非常好!

I did a couple other small tweaks from looking at this profiling data – none of them resulted in any signifinant benefit though.

通过查看此概要分析数据,我做了其他一些小调整-尽管这些调整都没有带来任何显着的好处。

Profiling shader load time also says that most of the time ends up being spent on loading editor related data that is arrays of arrays of strings and so on:

分析着色器的加载时间还表示,大部分时间最终都花在了加载与编辑器相关的数据上,这些数据是字符串数组的数组,等等:

shaderopt10-loadprofile

I could have picked functions from the profiler results, went though each of them and optimized, and perhaps would have achieved a solid 2-3x improvement over initial results. Very often that’s enough to be proud!

我本可以从探查器结果中挑选出功能,逐一分析并进行优化,也许可以比初始结果提高2-3倍。 很多时候,这足以令人感到骄傲!

However…

然而…

退后一步 (Taking a step back)

Or like Mike Acton would say, ”look at your data!” (check his CppCon2014 slides or video). Another saying is also applicable: ”think!

或就像Mike Acton所说的那样,“ 看看您的数据! ”(查看他的CppCon2014 幻灯片视频 )。 另一句话也适用:“ 思考!

Why do we have this problem to begin with?

为什么我们要从这个问题开始呢?

For example, in 333k variant shader case, we end up sending 610560 lines of shader variant information between shader compiler process & editor, with macro strings in each of them. In total we’re sending 91 megabytes of data over RPC pipe during shader import.

例如,在333k变量着色器的情况下,我们最终在着色器编译器进程和编辑器之间发送610560行着色器变量信息,并且每个变量中都包含宏字符串。 在着色器导入期间,我们总共通过RPC管道发送了91兆字节的数据。

One possible area for improvement: the data we send over and store in imported shader data is a small set of macro strings repeated over and over and over again. Instead of sending or storing the strings, we could just send the set of strings used by a shader once, assign numbers to them, and then send & store the full set as lists of numbers (or fixed size bitmasks). This should cut down on the amount of string operations we do (massively cut down on number of small allocations), size of data we send, and size of data we store.

有一个可能需要改进的地方:我们反复发送并存储在导入的着色器数据中的数据是一小组一遍又一遍地重复的宏字符串。 无需发送或存储字符串,我们只需发送一次着色器使用的字符串集,为它们分配数字,然后将完整的集合发送并存储为数字列表(或固定大小的位掩码)。 这样可以减少我们执行的字符串操作的数量(大大减少小分配的数量),发送的数据大小和存储的数据大小。

Another possible approach: right now we have source data in shader that indicate which variants to generate. This data is very small: just a list of on/off features, and some built-in variant lists (“all variants to handle lighting in forward rendering”). We do the full combinatorial explosion of that in the shader compiler process, send the full set over to the editor, and the editor stores that in imported shader data.

另一种可能的方法:现在,我们在着色器中具有源数据,指示要生成哪些变体。 该数据非常小:只有开/关功能列表以及一些内置的变体列表(“用于处理正向渲染中所有照明的所有变体”)。 我们在着色器编译器过程中进行完整的组合分解,将完整的集发送给编辑器,然后编辑器将其存储在导入的着色器数据中。

But the way we do the “explosion of source data into full set” is always the same. We could just send the source data from shader compiler to the editor (a very small amount!), and furthermore, just store that in imported shader data. We can rebuild the full set when needed at any time.

但是我们“将源数据分解为完整数据集”的方式始终相同 。 我们可以将源数据从着色器编译器发送到编辑器(非常少!),此外,只需将其存储在导入的着色器数据中即可。 我们可以在需要时随时重建完整集。

变更资料 (Changing the data)

So let’s try to do that. First let’s deal with RPC only, without changing serialized shader data. A few commits later…

因此,让我们尝试这样做。 首先,让我们仅处理RPC,而不更改序列化着色器数据。 稍后再提交…

shaderopt12-optimizerpc

This made shader importing over twice as fast!

这使着色器的导入速度提高两倍

1

2
3
4
5
6
Shader   Import
   27k    419ms ->  200ms
  111k   1702ms ->  791ms
  333k   5362ms -> 2530ms
    1M  16784ms -> 8280ms

1

2
3
4
5
6
Shader   Import
   27k      419ms - & gt ;    200ms
   111k    1702ms - & gt ;    791ms
   333k    5362ms - & gt ; 2530ms
     1M    16784ms - & gt ; 8280ms

Let’s do the other part too; where we change serialized shader variant data representation. Instead of storing full set of possible variants, we only store data needed to generate the full set:

我们也要做另一部分。 在此更改序列化着色器变体数据表示形式。 除了存储全套可能的变量之外,我们仅存储生成全套所需的数据:

shaderopt14-optimizestorage

1

2
3
4
5
6
Shader   Import              Load                 Size
   27k    200ms ->   285ms    103ms ->    396ms     6.4MB -> 55kB
  111k    791ms ->  1229ms    426ms ->   1832ms    27.9MB -> 55kB
  333k   2530ms ->  3893ms   1410ms ->   5892ms    89.2MB -> 56kB
    1M   8280ms -> 12416ms   4498ms ->  18949ms   272.4MB -> 57kB

1

2
3
4
5
6
Shader   Import               Load                 Size
   27k      200ms - & gt ;    285ms      103ms - & gt ;      396ms      6.4MB - & gt ; 55kB
   111k      791ms - & gt ;    1229ms      426ms - & gt ;    1832ms      27.9MB - & gt ; 55kB
   333k    2530ms - & gt ;    3893ms    1410ms - & gt ;    5892ms      89.2MB - & gt ; 56kB
     1M    8280ms - & gt ; 12416ms    4498ms - & gt ;    18949ms    272.4MB - & gt ; 57kB

Everything seems to work, and the serialized file size got massively decreased. But, both importing and loading got slower?! Clearly I did something stupid. Profile!

一切似乎正常,序列化的文件大小大大减少。 但是,导入和加载都变慢了吗? 显然我做了一些愚蠢的事情。 个人资料!

shaderopt15-rebuildprofile

Right. So after importing or loading the shader (from now a small file on disk), we generate the full set of shader variant data. Which right now is resulting in a lot of string allocations, since it is generating arrays of arrays of strings or somesuch.

对。 因此,在导入或加载着色器之后(从磁盘上的一个小文件开始),我们将生成完整的着色器变体数据集。 现在哪个会导致大量的字符串分配,因为它正在生成字符串数组或类似的数组。

But we don’t really need the strings at this point; for example after loading the shader we only need the internal representation of “shader variant key” which is a fairly small bitmask. A couple of tweaks to fix that, and we’re at:

但是现在我们真的不需要字符串了。 例如,在加载着色器后,我们只需要“着色器变体键”的内部表示,这是一个相当小的位掩码。 为了解决这个问题,我们进行了一些调整,我们位于:

1

2
3
4
5
6
Shader  Import    Load
   27k    42ms     7ms
  111k    47ms    27ms
  333k    94ms    76ms
    1M   231ms   225ms

1

2
3
4
5
6
Shader   Import     Load
   27k      42ms      7ms
   111k      47ms      27ms
   333k      94ms      76ms
     1M    231ms    225ms

Look at that! Importing a 333k variant shader got 82 times faster; loading its metadata got 22 times faster, and the imported file size got over a thousand times smaller!

看那个! 导入333k变体着色器的速度提高了82倍 ; 加载元数据的速度提高了22倍 ,而导入的文件大小却缩小了1000倍

One final look at the profiler, just because:

最后看一下分析器,原因仅在于:

shaderopt18-profileimport

Weird, time is spent in memory allocation but there shouldn’t be any at this point in that function; we aren’t creating any new strings there. Ahh, implicit std::string to UnityStr (our own string class with better memory reporting) conversion operators (long story…). Fix that, and we’ve got another 2x improvement:

奇怪的是,时间花在了内存分配上,但是该函数此时不应该有任何时间。 我们没有在那里创建任何新的字符串。 啊,将std::string隐式转换为UnityStr (我们自己的字符串类,具有更好的内存报告)转换运算符( UnityStr ……) 。 修复此问题,我们又有了2倍的改进:

1

2
3
4
5
6
Shader  Import    Load
   27k    42ms     5ms
  111k    44ms    18ms
  333k    53ms    46ms
    1M   130ms   128ms

1

2
3
4
5
6
Shader   Import     Load
   27k      42ms      5ms
   111k      44ms      18ms
   333k      53ms      46ms
     1M    130ms    128ms

The code could still be optimized further, but there ain’t no easy fixes left I think. And at this point I’ll have more important tasks to do…

该代码仍可以进一步优化,但是我认为没有简单的修复方法。 在这一点上,我将有更多重要的工作要做...

我们有什么 (What we’ve got)

So in total, here’s what we have so far:

因此,总的来说,这就是我们目前所拥有的:

1

2
3
4
5
6
Shader   Import                Load                 Size
   27k    420ms-> 42ms (10x)    120ms->  5ms (24x)    6.4MB->55kB (119x)
  111k   2013ms-> 44ms (46x)    492ms-> 18ms (27x)   27.9MB->55kB (519x)
  333k   7779ms-> 53ms (147x)  1719ms-> 46ms (37x)   89.2MB->56kB (this is getting)
    1M  16192ms->130ms (125x)  4231ms->128ms (33x)  272.4MB->57kB (ridiculous!)

1

2
3
4
5
6
Shader   Import                 Load                 Size
   27k      420ms - & gt ; 42ms ( 10x )      120ms - & gt ;    5ms ( 24x )      6.4MB - & gt ; 55kB ( 119x )
   111k    2013ms - & gt ; 44ms ( 46x )      492ms - & gt ; 18ms ( 27x )    27.9MB - & gt ; 55kB ( 519x )
   333k    7779ms - & gt ; 53ms ( 147x )    1719ms - & gt ; 46ms ( 37x )    89.2MB - & gt ; 56kB ( this is getting )
     1M    16192ms - & gt ; 130ms ( 125x )    4231ms - & gt ; 128ms ( 33x )    272.4MB - & gt ; 57kB ( ridiculous ! )

And a fairly small pull request to achieve all this (~400 lines of code changed, ~400 new added – out of which half were new unit tests I did to feel safer before I started changing things):

而实现这一切的请求很小(更改了约400行代码,增加了约400行新代码–其中一半是新的单元测试,在我开始进行更改之前,我确实感到更加安全):

shaderopt22-pr
shaderopt23-pr

Overall I’ve probably spent something like 8 hours on this – hard to say exactly since I did some breaks and other things. Also I was writing down notes & making sceenshots for the blog too :) The fix/optimization is already in Unity 5.0 beta 20 by the way.

总体而言,我可能在此上花费了大约8个小时-确切地说,因为我做了一些休息和其他事情。 另外,我也正在为博客写下笔记并做截图:)顺便说一下,修复/优化已在Unity 5.0 beta 20中进行。

结论 (Conclusion)

Apple’s Instruments is a nice profiling tool (and unlike xperf, the UI is not intimidating…).

Apple的Instruments是一个不错的分析工具(与xperf不同,UI并不令人生畏……)。

However, Profiler Is Not A Replacement For Thinking! I could have just looked at the profiling results and tried to optimize “what’s at top of the profiler” one by one, and maybe achieved 2-3x better performance. But by thinking about the actual problem and why it happens, I got a way, way better result.

但是, Profiler不能代替思维! 我本可以查看分析结果,然后尝试一个一个地优化“探查器顶部的功能”,也许可以将性能提高2-3倍。 但是,通过考虑实际问题及其发生的原因 ,我得到了更好的结果。

Happy thinking!

思考愉快!

翻译自: https://blogs.unity3d.com/2015/01/18/optimizing-shader-info-loading-or-look-at-yer-data/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值