尝试抓住加速我的代码?

本文翻译自:Try-catch speeding up my code?

I wrote some code for testing the impact of try-catch, but seeing some surprising results. 我写了一些代码来测试try-catch的影响,但看到了一些令人惊讶的结果。

static void Main(string[] args)
{
    Thread.CurrentThread.Priority = ThreadPriority.Highest;
    Process.GetCurrentProcess().PriorityClass = ProcessPriorityClass.RealTime;

    long start = 0, stop = 0, elapsed = 0;
    double avg = 0.0;

    long temp = Fibo(1);

    for (int i = 1; i < 100000000; i++)
    {
        start = Stopwatch.GetTimestamp();
        temp = Fibo(100);
        stop = Stopwatch.GetTimestamp();

        elapsed = stop - start;
        avg = avg + ((double)elapsed - avg) / i;
    }

    Console.WriteLine("Elapsed: " + avg);
    Console.ReadKey();
}

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    for (int i = 1; i < n; i++)
    {
        n1 = n2;
        n2 = fibo;
        fibo = n1 + n2;
    }

    return fibo;
}

On my computer, this consistently prints out a value around 0.96.. 在我的电脑上,这始终打印出一个大约0.96的值。

When I wrap the for loop inside Fibo() with a try-catch block like this: 当我使用try-catch块在Fibo()中包装for循环时,如下所示:

static long Fibo(int n)
{
    long n1 = 0, n2 = 1, fibo = 0;
    n++;

    try
    {
        for (int i = 1; i < n; i++)
        {
            n1 = n2;
            n2 = fibo;
            fibo = n1 + n2;
        }
    }
    catch {}

    return fibo;
}

Now it consistently prints out 0.69... -- it actually runs faster! 现在它一直打印出0.69 ...... - 它实际上运行得更快! But why? 但为什么?

Note: I compiled this using the Release configuration and directly ran the EXE file (outside Visual Studio). 注意:我使用Release配置编译它并直接运行EXE文件(在Visual Studio外部)。

EDIT: Jon Skeet's excellent analysis shows that try-catch is somehow causing the x86 CLR to use the CPU registers in a more favorable way in this specific case (and I think we're yet to understand why). 编辑: Jon Skeet的优秀分析表明try-catch在某种程度上导致x86 CLR在这种特定情况下以更有利的方式使用CPU寄存器(我认为我们还没理解为什么)。 I confirmed Jon's finding that x64 CLR doesn't have this difference, and that it was faster than the x86 CLR. 我确认Jon发现x64 CLR没有这个区别,并且它比x86 CLR更快。 I also tested using int types inside the Fibo method instead of long types, and then the x86 CLR was as equally fast as the x64 CLR. 我还在Fibo方法中使用int类型而不是long类型进行了测试,然后x86 CLR和x64 CLR一样快。


UPDATE: It looks like this issue has been fixed by Roslyn. 更新:看起来这个问题已由Roslyn修复。 Same machine, same CLR version -- the issue remains as above when compiled with VS 2013, but the problem goes away when compiled with VS 2015. 相同的机器,相同的CLR版本 - 在使用VS 2013编译时问题仍然如上所述,但在使用VS 2015编译时问题就消失了。


#1楼

参考:https://stackoom.com/question/bSgV/尝试抓住加速我的代码


#2楼

This looks like a case of inlining gone bad. 这看起来像一个内联变坏的情况。 On an x86 core, the jitter has the ebx, edx, esi and edi register available for general purpose storage of local variables. 在x86内核上,抖动具有ebx,edx,esi和edi寄存器,可用于本地变量的通用存储。 The ecx register becomes available in a static method, it doesn't have to store this . ecx寄存器在静态方法中可用,它不必存储 The eax register often is needed for calculations. 计算通常需要eax寄存器。 But these are 32-bit registers, for variables of type long it must use a pair of registers. 但这些是32位寄存器,对于long类型的变量,它必须使用一对寄存器。 Which are edx:eax for calculations and edi:ebx for storage. 其中edx:用于计算的eax和用于存储的edi:ebx。

Which is what stands out in the disassembly for the slow version, neither edi nor ebx are used. 在慢速版本的反汇编中,这是最突出的,既不使用edi也不使用ebx。

When the jitter can't find enough registers to store local variables then it must generate code to load and store them from the stack frame. 当抖动找不到足够的寄存器来存储局部变量时,它必须生成代码以从堆栈帧加载和存储它们。 That slows down code, it prevents a processor optimization named "register renaming", an internal processor core optimization trick that uses multiple copies of a register and allows super-scalar execution. 这会降低代码速度,它会阻止名为“寄存器重命名”的处理器优化,这是一种内部处理器核心优化技巧,它使用寄存器的多个副本并允许超标量执行。 Which permits several instructions to run concurrently, even when they use the same register. 这允许多个指令同时运行,即使它们使用相同的寄存器。 Not having enough registers is a common problem on x86 cores, addressed in x64 which has 8 extra registers (r9 through r15). 没有足够的寄存器是x86内核的常见问题,在x64中解决,它有8个额外的寄存器(r9到r15)。

The jitter will do its best to apply another code generation optimization, it will try to inline your Fibo() method. 抖动将尽力应用另一代码生成优化,它将尝试内联您的Fibo()方法。 In other words, not make a call to the method but generate the code for the method inline in the Main() method. 换句话说,不要调用方法,而是在Main()方法中生成内联方法的代码。 Pretty important optimization that, for one, makes properties of a C# class for free, giving them the perf of a field. 非常重要的优化,例如,免费提供C#类的属性,给它们一个字段的性能。 It avoids the overhead of making the method call and setting up its stack frame, saves a couple of nanoseconds. 它避免了调用方法和设置堆栈帧的开销,节省了几纳秒。

There are several rules that determine exactly when a method can be inlined. 有几个规则可以确定何时可以内联方法。 They are not exactly documented but have been mentioned in blog posts. 它们没有完全记录,但已在博客文章中提及过。 One rule is that it won't happen when the method body is too large. 一个规则是当方法体太大时不会发生。 That defeats the gain from inlining, it generates too much code that doesn't fit as well in the L1 instruction cache. 这会破坏内联的收益,它会生成太多不适合L1指令缓存的代码。 Another hard rule that applies here is that a method won't be inlined when it contains a try/catch statement. 这里适用的另一个硬性规则是,当包含try / catch语句时,不会内联方法。 The background behind that one is an implementation detail of exceptions, they piggy-back onto Windows' built-in support for SEH (Structure Exception Handling) which is stack-frame based. 这一背后的背景是异常的实现细节,它们捎带到Windows的内置支持SEH(结构异常处理),它是基于堆栈帧的。

One behavior of the register allocation algorithm in the jitter can be inferred from playing with this code. 抖动中的寄存器分配算法的一种行为可以通过使用该代码来推断。 It appears to be aware of when the jitter is trying to inline a method. 它似乎知道抖动何时试图内联一个方法。 One rule it appears to use that only the edx:eax register pair can be used for inlined code that has local variables of type long. 似乎使用的一条规则是只有edx:eax寄存器对可以用于具有long类型的局部变量的内联代码。 But not edi:ebx. 但不是edi:ebx。 No doubt because that would be too detrimental to the code generation for the calling method, both edi and ebx are important storage registers. 毫无疑问,因为这对调用方法的代码生成过于不利,edi和ebx都是重要的存储寄存器。

So you get the fast version because the jitter knows up front that the method body contains try/catch statements. 所以你得到了快速版本,因为抖动预先知道方法体包含try / catch语句。 It knows it can never be inlined so readily uses edi:ebx for storage for the long variable. 它知道它永远不会被内联,所以很容易使用edi:ebx来存储long变量。 You got the slow version because the jitter didn't know up front that inlining wouldn't work. 你得到了慢版本,因为抖动事先并不知道内联不起作用。 It only found out after generating the code for the method body. 它只生成方法体的代码后才发现。

The flaw then is that it didn't go back and re-generate the code for the method. 然后,缺陷是它没有返回并重新生成该方法的代码。 Which is understandable, given the time constraints it has to operate in. 考虑到它必须运行的时间限制,这是可以理解的。

This slow-down doesn't occur on x64 because for one it has 8 more registers. 在x64上不会发生这种减速,因为对于一个它有8个寄存器。 For another because it can store a long in just one register (like rax). 另一个是因为它可以在一个寄存器(如rax)中存储一个long。 And the slow-down doesn't occur when you use int instead of long because the jitter has a lot more flexibility in picking registers. 当使用int而不是long时,不会发生减速,因为抖动在选择寄存器时具有更大的灵活性。


#3楼

Well, the way you're timing things looks pretty nasty to me. 好吧,你对事情进行计时的方式对我来说非常讨厌。 It would be much more sensible to just time the whole loop: 对整个循环进行计时会更加明智:

var stopwatch = Stopwatch.StartNew();
for (int i = 1; i < 100000000; i++)
{
    Fibo(100);
}
stopwatch.Stop();
Console.WriteLine("Elapsed time: {0}", stopwatch.Elapsed);

That way you're not at the mercy of tiny timings, floating point arithmetic and accumulated error. 这样你就不会受到微小时序,浮点运算和累积误差的影响。

Having made that change, see whether the "non-catch" version is still slower than the "catch" version. 进行了更改后,查看“非捕获”版本是否仍然比“catch”版本慢。

EDIT: Okay, I've tried it myself - and I'm seeing the same result. 编辑:好的,我自己尝试过 - 我看到的结果相同。 Very odd. 很奇怪。 I wondered whether the try/catch was disabling some bad inlining, but using [MethodImpl(MethodImplOptions.NoInlining)] instead didn't help... 我想知道try / catch是否禁用了一些错误的内联,但使用[MethodImpl(MethodImplOptions.NoInlining)]却没有帮助...

Basically you'll need to look at the optimized JITted code under cordbg, I suspect... 基本上你需要在cordbg下查看优化的JITted代码,我怀疑......

EDIT: A few more bits of information: 编辑:一些信息:

  • Putting the try/catch around just the n++; 将try / catch放在n++;周围n++; line still improves performance, but not by as much as putting it around the whole block line仍然提高了性能,但并没有将它放在整个块上
  • If you catch a specific exception ( ArgumentException in my tests) it's still fast 如果你捕到一个特定的异常(我的测试中的ArgumentException ),它仍然很快
  • If you print the exception in the catch block it's still fast 如果在catch块中打印异常,它仍然很快
  • If you rethrow the exception in the catch block it's slow again 如果你在catch块中重新抛出异常,它会再次变慢
  • If you use a finally block instead of a catch block it's slow again 如果你使用finally块而不是catch块,它会再次变慢
  • If you use a finally block as well as a catch block, it's fast 如果你使用finally块 catch块,那就快了

Weird... 奇怪的...

EDIT: Okay, we have disassembly... 编辑:好的,我们有拆卸......

This is using the C# 2 compiler and .NET 2 (32-bit) CLR, disassembling with mdbg (as I don't have cordbg on my machine). 这是使用C#2编译器和.NET 2(32位)CLR,使用mdbg进行反汇编(因为我的机器上没有cordbg)。 I still see the same performance effects, even under the debugger. 即使在调试器下,我仍然会看到相同的性能影响。 The fast version uses a try block around everything between the variable declarations and the return statement, with just a catch{} handler. 快速版本使用try块来处理变量声明和return语句之间的所有内容,只使用catch{}处理程序。 Obviously the slow version is the same except without the try/catch. 显然慢速版本是相同的,除了没有try / catch。 The calling code (ie Main) is the same in both cases, and has the same assembly representation (so it's not an inlining issue). 调用代码(即Main)在两种情况下都是相同的,并且具有相同的程序集表示(因此它不是内联问题)。

Disassembled code for fast version: 快速版本的反汇编代码:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        edi
 [0004] push        esi
 [0005] push        ebx
 [0006] sub         esp,1Ch
 [0009] xor         eax,eax
 [000b] mov         dword ptr [ebp-20h],eax
 [000e] mov         dword ptr [ebp-1Ch],eax
 [0011] mov         dword ptr [ebp-18h],eax
 [0014] mov         dword ptr [ebp-14h],eax
 [0017] xor         eax,eax
 [0019] mov         dword ptr [ebp-18h],eax
*[001c] mov         esi,1
 [0021] xor         edi,edi
 [0023] mov         dword ptr [ebp-28h],1
 [002a] mov         dword ptr [ebp-24h],0
 [0031] inc         ecx
 [0032] mov         ebx,2
 [0037] cmp         ecx,2
 [003a] jle         00000024
 [003c] mov         eax,esi
 [003e] mov         edx,edi
 [0040] mov         esi,dword ptr [ebp-28h]
 [0043] mov         edi,dword ptr [ebp-24h]
 [0046] add         eax,dword ptr [ebp-28h]
 [0049] adc         edx,dword ptr [ebp-24h]
 [004c] mov         dword ptr [ebp-28h],eax
 [004f] mov         dword ptr [ebp-24h],edx
 [0052] inc         ebx
 [0053] cmp         ebx,ecx
 [0055] jl          FFFFFFE7
 [0057] jmp         00000007
 [0059] call        64571ACB
 [005e] mov         eax,dword ptr [ebp-28h]
 [0061] mov         edx,dword ptr [ebp-24h]
 [0064] lea         esp,[ebp-0Ch]
 [0067] pop         ebx
 [0068] pop         esi
 [0069] pop         edi
 [006a] pop         ebp
 [006b] ret

Disassembled code for slow version: 用于慢速版本的反汇编代码:

 [0000] push        ebp
 [0001] mov         ebp,esp
 [0003] push        esi
 [0004] sub         esp,18h
*[0007] mov         dword ptr [ebp-14h],1
 [000e] mov         dword ptr [ebp-10h],0
 [0015] mov         dword ptr [ebp-1Ch],1
 [001c] mov         dword ptr [ebp-18h],0
 [0023] inc         ecx
 [0024] mov         esi,2
 [0029] cmp         ecx,2
 [002c] jle         00000031
 [002e] mov         eax,dword ptr [ebp-14h]
 [0031] mov         edx,dword ptr [ebp-10h]
 [0034] mov         dword ptr [ebp-0Ch],eax
 [0037] mov         dword ptr [ebp-8],edx
 [003a] mov         eax,dword ptr [ebp-1Ch]
 [003d] mov         edx,dword ptr [ebp-18h]
 [0040] mov         dword ptr [ebp-14h],eax
 [0043] mov         dword ptr [ebp-10h],edx
 [0046] mov         eax,dword ptr [ebp-0Ch]
 [0049] mov         edx,dword ptr [ebp-8]
 [004c] add         eax,dword ptr [ebp-1Ch]
 [004f] adc         edx,dword ptr [ebp-18h]
 [0052] mov         dword ptr [ebp-1Ch],eax
 [0055] mov         dword ptr [ebp-18h],edx
 [0058] inc         esi
 [0059] cmp         esi,ecx
 [005b] jl          FFFFFFD3
 [005d] mov         eax,dword ptr [ebp-1Ch]
 [0060] mov         edx,dword ptr [ebp-18h]
 [0063] lea         esp,[ebp-4]
 [0066] pop         esi
 [0067] pop         ebp
 [0068] ret

In each case the * shows where the debugger entered in a simple "step-into". 在每种情况下, *表示调试器在简单的“步入”中输入的位置。

EDIT: Okay, I've now looked through the code and I think I can see how each version works... and I believe the slower version is slower because it uses fewer registers and more stack space. 编辑:好的,我现在已经查看了代码,我想我可以看到每个版本的工作方式......我相信较慢的版本较慢,因为它使用较少的寄存器和更多的堆栈空间。 For small values of n that's possibly faster - but when the loop takes up the bulk of the time, it's slower. 对于可能更快的小n值 - 但是当循环占用大部分时间时,它会变慢。

Possibly the try/catch block forces more registers to be saved and restored, so the JIT uses those for the loop as well... which happens to improve the performance overall. try / catch块可能会强制保存和恢复更多寄存器,因此JIT也会将这些寄存器用于循环......这样可以提高整体性能。 It's not clear whether it's a reasonable decision for the JIT to not use as many registers in the "normal" code. 目前尚不清楚JIT是否合理地决定不在 “普通”代码中使用尽可能多的寄存器。

EDIT: Just tried this on my x64 machine. 编辑:刚试了我的x64机器。 The x64 CLR is much faster (about 3-4 times faster) than the x86 CLR on this code, and under x64 the try/catch block doesn't make a noticeable difference. 在64位CLR的速度快得多(约3-4倍的速度)比该代码在x86 CLR,并在x64的try / catch块不会使一个显着的差异。


#4楼

Jon's disassemblies show, that the difference between the two versions is that the fast version uses a pair of registers ( esi,edi ) to store one of the local variables where the slow version doesn't. Jon的反汇编显示,两个版本之间的区别在于快速版本使用一对寄存器( esi,edi )来存储慢速版本不存在的局部变量之一。

The JIT compiler makes different assumptions regarding register use for code that contains a try-catch block vs. code which doesn't. JIT编译器对包含try-catch块的代码与不代码的代码的寄存器使用做出了不同的假设。 This causes it to make different register allocation choices. 这导致它做出不同的寄存器分配选择。 In this case, this favors the code with the try-catch block. 在这种情况下,这有利于try-catch块的代码。 Different code may lead to the opposite effect, so I would not count this as a general-purpose speed-up technique. 不同的代码可能导致相反的效果,所以我不认为这是一种通用的加速技术。

In the end, it's very hard to tell which code will end up running the fastest. 最后,很难说哪些代码最终会以最快的速度运行。 Something like register allocation and the factors that influence it are such low-level implementation details that I don't see how any specific technique could reliably produce faster code. 寄存器分配和影响它的因素之类的是低级实现细节,我没有看到任何特定技术如何能够可靠地生成更快的代码。

For example, consider the following two methods. 例如,请考虑以下两种方法。 They were adapted from a real-life example: 它们改编自现实生活中的例子:

interface IIndexed { int this[int index] { get; set; } }
struct StructArray : IIndexed { 
    public int[] Array;
    public int this[int index] {
        get { return Array[index]; }
        set { Array[index] = value; }
    }
}

static int Generic<T>(int length, T a, T b) where T : IIndexed {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}
static int Specialized(int length, StructArray a, StructArray b) {
    int sum = 0;
    for (int i = 0; i < length; i++)
        sum += a[i] * b[i];
    return sum;
}

One is a generic version of the other. 一个是另一个的通用版本。 Replacing the generic type with StructArray would make the methods identical. StructArray替换泛型类型会使方法相同。 Because StructArray is a value type, it gets its own compiled version of the generic method. 因为StructArray是一个值类型,所以它获得了自己的泛型方法的编译版本。 Yet the actual running time is significantly longer than the specialized method's, but only for x86. 然而实际运行时间明显长于专用方法,但仅适用于x86。 For x64, the timings are pretty much identical. 对于x64,时间几乎完全相同。 In other cases, I've observed differences for x64 as well. 在其他情况下,我也观察到了x64的差异。


#5楼

I'd have put this in as a comment as I'm really not certain that this is likely to be the case, but as I recall it doesn't a try/except statement involve a modification to the way the garbage disposal mechanism of the compiler works, in that it clears up object memory allocations in a recursive way off the stack. 我已经把它作为一个评论,因为我真的不确定这可能是这种情况,但我记得它不是一个尝试/除了声明涉及修改垃圾处理机制的方式编译器工作,因为它以递归方式从堆栈中清除对象内存分配。 There may not be an object to be cleared up in this case or the for loop may constitute a closure that the garbage collection mechanism recognises sufficient to enforce a different collection method. 在这种情况下可能没有要清除的对象,或者for循环可能构成垃圾收集机制识别出足以强制执行不同收集方法的闭包。 Probably not, but I thought it worth a mention as I hadn't seen it discussed anywhere else. 可能不是,但我认为值得一提,因为我没有看到它在其他地方讨论过。


#6楼

One of the Roslyn engineers who specializes in understanding optimization of stack usage took a look at this and reports to me that there seems to be a problem in the interaction between the way the C# compiler generates local variable stores and the way the JIT compiler does register scheduling in the corresponding x86 code. 专门了解堆栈使用优化的Roslyn工程师之一看了一下这个并告诉我,C#编译器生成局部变量存储的方式与JIT编译器注册方式之间的交互似乎存在问题在相应的x86代码中进行调度。 The result is suboptimal code generation on the loads and stores of the locals. 结果是在本地的加载和存储上生成次优代码。

For some reason unclear to all of us, the problematic code generation path is avoided when the JITter knows that the block is in a try-protected region. 由于某些原因我们所有人都不清楚,当JITter知道该块处于try-protected区域时,可以避免有问题的代码生成路径。

This is pretty weird. 这很奇怪。 We'll follow up with the JITter team and see whether we can get a bug entered so that they can fix this. 我们将跟进JITter团队,看看我们是否可以输入错误,以便他们可以解决这个问题。

Also, we are working on improvements for Roslyn to the C# and VB compilers' algorithms for determining when locals can be made "ephemeral" -- that is, just pushed and popped on the stack, rather than allocated a specific location on the stack for the duration of the activation. 此外,我们正在努力改进Roslyn到C#和VB编译器的算法,以确定何时可以使本地变为“短暂” - 也就是说,只是在堆栈上推送和弹出,而不是在堆栈上分配特定位置激活的持续时间。 We believe that the JITter will be able to do a better job of register allocation and whatnot if we give it better hints about when locals can be made "dead" earlier. 我们相信JITter能够更好地完成寄存器分配,如果我们给出更好的提示,可以让当地人更早地“死”。

Thanks for bringing this to our attention, and apologies for the odd behaviour. 感谢您引起我们的注意,并为奇怪的行为道歉。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
http://blog.csdn.net/atfield/article/details/1471465 Introduction Shared Source CLI 2.0 (开发代号Rotor) 是微软.NET Framework 2.0的Shared Source实现版本。Shared Source是微软推出的源代码共享计划,可以在一定限制的情况下获得/使用源代码,详情可以参考Microsoft Shared Source Initiative主页:http://www.microsoft.com/resources/sharedsource/。Rotor的代码可以在非商业行为的前提下可以自由修改和发布,只需保留License声明即可。Rotor包含了下面的内容: 1. CLI的运行时(CLR)的符合ECMA标准的实现 2. C# & Jscript编译器 3. .NET Framework中的部分工具,如ilasm, ildasm, cordbg, gacutil, sn等 4. Build工具,如Build, Binplace, NMake等 5. PAL (Platform Adaptation Layer),支持其他平台 6. 自动化回归测试 7. Samples 由此可见,Rotor是研究.NET Framework实现的最佳材料。 Rotor的全部源代码可以在微软的MSDN网站上下载:http://msdn.microsoft.com/net/sscli/ 下载后解压缩,可以看到如下的目录结构: 目录 内容 Binaries.x86*.rotor Build出来的可执行文件和Symbols。这个目录在Build之后才会出现 clr CLI和BCL(Base Class Library)的实现 csharp C#编译器的实现就在这里了 Docs 文档 Env Build时所需的一些文件 Fx 类库的实现 Jscript Jscript编译器的实现 Pal Platform Adaptation Layer的实现。PAL是Rotor对Windows和Unix的一些基本API的封装 Palrt Platform Adaptive Layer Runtime的实现。PAL中的和操作系统平台无关可重用的部分 prebuilt 存放着build所需的一些事先用工具生成好的文件 Samples Samples Tests Regression Test Suite Tools Build工具,包括binplace, build, cppmunge等 Vscommon 公共头文件,主要是微软内部使用的一些产品有关的公用的宏定义 总共大概有150MB左右。 本人先发在CSDN上做个备份。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值