ARM 浮点运算详解

最新推荐文章于 2024-08-13 19:05:25 发布
nancygreen
最新推荐文章于 2024-08-13 19:05:25 发布
阅读量1.8k
点赞数 1
分类专栏：编译
编译专栏收录该内容
28 篇文章 0 订阅
订阅专栏
 
  一：早期ARM上的浮点模拟器： 
 
早期的ARM没有协处理器，所以浮点运算是由CPU来模拟的，即所需浮点运算均在浮点运算模拟器（float math emulation）上进行，需要的浮点运算，常要耗费数千个循环才能执行完毕，因此特别缓慢。
直到今天，在ARM Kernel配置时，都有如下选项：
Floating point emulation  --->
[ ] NWFPE math emulation
[ ] FastFPE math emulation (EXPERIMENTAL) 
在这里，可以配置ARM 浮点模拟器。
 
浮点模拟器 模拟浮点是利用了undefined instrction handler，在运算过程中遇到浮点计算是产生异常中断，这么做带来的后果是带来极频繁的exception，大大增加中断延迟，降低系统实时性。
 
二：软浮点技术：
软浮点支持是由交叉工具链提供的功能，与Linux内核无关。当使用软浮点工具链编译浮点操作时，编译器会用内联的浮点库替换掉浮点操作，使得生成的机器码完全不含浮点指令，但是又能够完成正确的浮点操作。
 
三：浮点协处理器：
在较新版本的ARM中，可以添加协处理器。 一些ARM CPU为了更好的处理浮点计算的需要，添加了浮点协处理器。
并定义了浮点指令集。 如果不存在实际的硬件，则这些指令被截获并由浮点模拟器模块(FPEmulator)来执行。
 
 
四: 硬件浮点协处理器以及对应指令集的使用:
想要使用硬件浮点协处理器来帮助运算Application中的浮点运算。需要以下几个前提条件：
1. Kernel中设置支持硬件协处理器。
2. 编译器支持将浮点运算翻译成硬件浮点运算指令，或者在需要浮点运算的时候手动调用相应的浮点运算指令。
 
1. Kernle的支持：
如果Kernel不支持浮点协处理器，则因为协处理器寄存器等使用权限等问题，协处理器对应指令无法运行。
网络上有位高手指出：
CP15 c1 协处理器访问控制寄存器，这个寄存器规定了用户模式和特权对协处理器的访问权限。我们要使用VFP当然要运行用户模式访问CP10和CP11。
 另外一个寄存器是VFP的FPEXC Bit30这是VFP功能的使用位。
 其实操作系统在做了这两件事情之后，用户程序就可以使用VFP了。当然，Kernel 除了这2件事外，还处理了其他一些事情。
 
Floating point emulation  --->
 [*] VFP-format floating point maths
Include VFP support code in the kernel. This is needed IF your hardware includes a VFP unit.
 
2. 编译器指定浮点指令：
编译器可以显式指定将浮点运算翻译成何种浮点指令。
 
如果编译器支持软浮点，则其可能会将浮点运算翻译成编译器中自带的浮点库。则不会有真正的浮点运算。
否则，可以翻译成FPA(Floating Point Accelerator)指令。 FPA指令再去查看是否有浮点模拟器。
还可以将浮点运算指定为VFP（vector floating point）指令或者neon向量浮点指令。
 
 
五. 编译器指定编译硬浮点指令:
测试浮点加减乘除等运算的时间长度：
 
float src_mem_32[1024] = {1.024};

 float dst_mem_32[1024] = {0.933};
 
for(j = 0; j < 1024; j++)
 {
      for(i = 0; i < 1024; i++)
      {
           src_32 = src_mem_32[i] + dst_mem_32[i];
      }
 }
通过printf 计算前后毫秒数的差值来看计算能力。
 
编译：
arm-hisiv200-linux-gcc -c   -Wall fcpu.c -o fcpu.o
arm-hisiv200-linux-gcc fcpu.o -o FCPU -L./
运行，则得到32位浮点数加1024次所需要时间。
 
如果要使用VFP呢？
arm-hisiv200-linux-gcc -c   -Wall -mfpu=vfp -mfloat-abi=softfp  fcpu.c -o fcpu.o
arm-hisiv200-linux-gcc -Wall -mfpu=vfp -mfloat-abi=softfp   fcpu.o -o FCPU -L./
则运行后发现，所需要时间几乎减小了一半。 说明还是非常有效果的。
关于-mfpu   -mfloat-abi讲解：见附录2。 
 
另外，如何才能在直观的检查出是否使用VFP呢？
可以通过察看编译出的ASM程序得到结论。
 
#arm-hisiv200-linux-objdump -d fcpu.o
00000000 <test_F32bit_addition>:
    0:   e52db004        push    {fp}            ; (str fp, [sp, #-4]!)
    4:   e28db000        add     fp, sp, #0
    8:   e24dd00c        sub     sp, sp, #12
    c:   e3a03000        mov     r3, #0
   10:   e50b300c        str     r3, [fp, #-12]
   14:   e3a03000        mov     r3, #0
   18:   e50b3008        str     r3, [fp, #-8]
   1c:   e3a03000        mov     r3, #0
   20:   e50b3008        str     r3, [fp, #-8]
   24:   ea000017        b       88 <test_F32bit_addition+0x88>
   28:   e3a03000        mov     r3, #0
   2c:   e50b300c        str     r3, [fp, #-12]
   30:   ea00000d        b       6c <test_F32bit_addition+0x6c>
   34:   e51b200c        ldr     r2, [fp, #-12]
   38:   e59f3064        ldr     r3, [pc, #100]  ; a4 <test_F32bit_addition+0xa4>
   3c:   e0831102        add     r1, r3, r2, lsl #2
   40:   ed917a00        vldr    s14, [r1]
   44:   e51b200c        ldr     r2, [fp, #-12]
   48:   e59f3058        ldr     r3, [pc, #88]   ; a8 <test_F32bit_addition+0xa8>
   4c:   e0831102        add     r1, r3, r2, lsl #2
   50:   edd17a00        vldr    s15, [r1]
   54:   ee777a27        vadd.f32        s15, s14, s15
   58:   e59f304c        ldr     r3, [pc, #76]   ; ac <test_F32bit_addition+0xac>
   5c:   edc37a00        vstr    s15, [r3]
   60:   e51b300c        ldr     r3, [fp, #-12]
   64:   e2833001        add     r3, r3, #1
   68:   e50b300c        str     r3, [fp, #-12]
   6c:   e51b200c        ldr     r2, [fp, #-12]
   70:   e59f3038        ldr     r3, [pc, #56]   ; b0 <test_F32bit_addition+0xb0>
   74:   e1520003        cmp     r2, r3
   78:   daffffed        ble     34 <test_F32bit_addition+0x34>
   7c:   e51b3008        ldr     r3, [fp, #-8]
   80:   e2833001        add     r3, r3, #1
   84:   e50b3008        str     r3, [fp, #-8]
   88:   e51b2008        ldr     r2, [fp, #-8]
   8c:   e59f301c        ldr     r3, [pc, #28]   ; b0 <test_F32bit_addition+0xb0>
   90:   e1520003        cmp     r2, r3
   94:   daffffe3        ble     28 <test_F32bit_addition+0x28>
   98:   e28bd000        add     sp, fp, #0
   9c:   e49db004        pop     {fp}            ; (ldr fp, [sp], #4)
   a0:   e12fff1e        bx      lr
 
这里明显包含vfp指令。 所以是使用vfp指令的：
arm-hisiv200-linux-gcc -c   -Wall -mfpu=vfp -mfloat-abi=softfp  fcpu.c -o fcpu.o
注意：VFP 指令指令在附录1中。
 
 
如果使用：
arm-hisiv200-linux-gcc -c   -Wall fcpu.c -o fcpu.o
 
#arm-hisiv200-linux-objdump -d fcpu.o
00000000 <test_F32bit_addition>:
    0:   e92d4800        push    {fp, lr}
    4:   e28db004        add     fp, sp, #4
    8:   e24dd008        sub     sp, sp, #8
    c:   e3a03000        mov     r3, #0
   10:   e50b300c        str     r3, [fp, #-12]
   14:   e3a03000        mov     r3, #0
   18:   e50b3008        str     r3, [fp, #-8]
   1c:   e3a03000        mov     r3, #0
   20:   e50b3008        str     r3, [fp, #-8]
   24:   ea000019        b       90 <test_F32bit_addition+0x90>
   28:   e3a03000        mov     r3, #0
   2c:   e50b300c        str     r3, [fp, #-12]
   30:   ea00000f        b       74 <test_F32bit_addition+0x74>
   34:   e51b200c        ldr     r2, [fp, #-12]
   38:   e59f3068        ldr     r3, [pc, #104]  ; a8 <test_F32bit_addition+0xa8>
   3c:   e7932102        ldr     r2, [r3, r2, lsl #2]
   40:   e51b100c        ldr     r1, [fp, #-12]
   44:   e59f3060        ldr     r3, [pc, #96]   ; ac <test_F32bit_addition+0xac>
   48:   e7933101        ldr     r3, [r3, r1, lsl #2]
   4c:   e1a00002        mov     r0, r2
   50:   e1a01003        mov     r1, r3
   54:   ebfffffe        bl      0 <__aeabi_fadd>
   58:   e1a03000        mov     r3, r0
   5c:   e1a02003        mov     r2, r3
   60:   e59f3048        ldr     r3, [pc, #72]   ; b0 <test_F32bit_addition+0xb0>
   64:   e5832000        str     r2, [r3]
   68:   e51b300c        ldr     r3, [fp, #-12]
   6c:   e2833001        add     r3, r3, #1
   70:   e50b300c        str     r3, [fp, #-12]
   74:   e51b200c        ldr     r2, [fp, #-12]
   78:   e59f3034        ldr     r3, [pc, #52]   ; b4 <test_F32bit_addition+0xb4>
   7c:   e1520003        cmp     r2, r3
   80:   daffffeb        ble     34 <test_F32bit_addition+0x34>
   84:   e51b3008        ldr     r3, [fp, #-8]
   88:   e2833001        add     r3, r3, #1
   8c:   e50b3008        str     r3, [fp, #-8]
   90:   e51b2008        ldr     r2, [fp, #-8]
   94:   e59f3018        ldr     r3, [pc, #24]   ; b4 <test_F32bit_addition+0xb4>
   98:   e1520003        cmp     r2, r3
   9c:   daffffe1        ble     28 <test_F32bit_addition+0x28>
   a0:   e24bd004        sub     sp, fp, #4
   a4:   e8bd8800        pop     {fp, pc}
则不包含VFP指令。
且去调用 __aeabi_fadd
 
 
 
 
 
 
 
 
 
附录1 ：VFP 指令
   可以查看arm的realView文档。 
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0204ic/Bcffbdga.html
 
附录2：
       -mfpu=name
        -mfpe=number
        -mfp=number
            This specifies what floating point hardware (or hardware emulation) is available on the target.  Permissible names are: fpa, fpe2, fpe3, maverick, vfp.  -mfp and -mfpe are synonyms for -mfpu=fpenumber, for compatibility with older versions of GCC.
 
 
 
-mfloat-abi=name
            Specifies which ABI to use for floating point values.  Permissible values are: soft, softfp and hard.
           soft and hard are equivalent to -msoft-float and -mhard-float respectively.  softfp allows the generation of floating point instructions, but still uses the soft-float calling conventions.