搞定“超超超难”剑桥面试数学题番外篇：ARM64汇编

大熊猫侯佩

已于 2023-06-11 20:32:22 修改

阅读量1.2k

点赞数 7

分类专栏：极客文章标签：算法 ARM64 汇编 Apple Silicon Swift SIMD

于 2023-06-09 19:35:40 首次发布

本文链接：https://blog.csdn.net/mydo/article/details/131121094

版权

极客专栏收录该内容

67 篇文章 8 订阅

订阅专栏

文章讲述了如何在M2芯片的Mac上使用ARM64汇编语言实现一道算法题，对比了ARM64汇编、Swift、C和Ruby的执行效率，并探讨了SIMD指令对性能的影响。实验结果显示，尽管ARM64汇编在这次测试中速度较慢，但展示了其潜在的优化空间。

摘要由CSDN通过智能技术生成

在这里插入图片描述

0. 概览

在有趣的小实验：四种语言搞定“超超超难”剑桥面试数学题那篇博文中，我们使用 4 种语言（x64汇编、C、Swift 以及 Ruby）实现了一道算法题。

不过，其中的汇编语言对应的是 intel CPU 上的 x64 指令集，那么能不能在 Apple Silicon 芯片（M1，M2）上使用汇编来原生实现相同的功能呢？

答案是肯定的！

在本篇博文中，我们将会在 M2 芯片的 Mac 上使用 ARM64 汇编语言完成算法题的挑战。

本文测试环境为：M2 芯片 + 16GB 内存的 MBA，macOS Ventura 13.4。

闲言少叙，让我们躁起来 😉

Let’s go！！！

1. 情景重现

让我们回顾一下原题，这是一道剑桥面试题（如下图所示），图中用了 4 个 ‘Super’ 来形容它的“超超超超级难”：

在这里插入图片描述

题目的意思很简单：

如果 a + b + c + d = 63；
求 ab + bc + cd 的最大值；
其中 a、b、c、d 都为自然数；

如果说用用数学证明它“难得离谱”的话，那么编程来解决则恰恰相反：没有比写代码搞定它更容易的事了。

2. M2 芯片上的重测结果

首先，我们用之前博文中相同的 Ruby、Swift 和 C 代码生成可执行文件，在 M2 上重新运行一下，看看速度如何。

为了方便小伙伴们阅览，下面把每种语言原来的代码一并奉上。

2.1 Ruby 语言

#!/usr/bin/ruby

max = 0
r = 1..63
for a in r do
    for b in r do
        for c in r do
            for d in r do
                if a + b + c + d == 63
                    rlt = a*b + b*c + c*d
                    if rlt >= max 
                        max = rlt
                    end
                end
            end
        end
    end
end

print("max is #{max}\n")

连续执行 3 次，平均耗时大约 0.7 秒左右：

% time ./test.rb
max is 991
./test.rb  0.67s user 0.03s system 99% cpu 0.701 total
% time ./test.rb
max is 991
./test.rb  0.67s user 0.03s system 99% cpu 0.706 total
% time ./test.rb
max is 991
./test.rb  0.67s user 0.03s system 99% cpu 0.702 total

2.2 Swift 语言

import Foundation

typealias GroupNumbers = (a: Int, b: Int, c: Int, d: Int, rlt: Int)

@inline(__always) func value(_ g: GroupNumbers) -> Int {
    g.a * g.b + g.b * g.c + g.c * g.d
}

var max = 0
let r = 1...63
for a in r {
    for b in r {
        for c in r {
            for d in r {
                if a + b + c + d == 63 {
                    let v = (a: a, b: b, c: c, d: d, rlt: 0)
                    let rlt = value(v)
                    
                    if rlt >= max {
                        max = rlt
                    }
                }
            }
        }
    }
}

print("max is \(max)")

照旧使用 Xcode 创建 Command Line Tool 类型项目，将源代码贴入 main.swift 文件中，不过这次为我们干苦力的是最新的 Xcode 15 beta。

我们直接切换至 Release 模式，运行结果如下：

% time ./test
max is 991
./test  0.02s user 0.00s system 86% cpu 0.023 total
% time ./test
max is 991
./test  0.02s user 0.00s system 86% cpu 0.024 total
% time ./test
max is 991
./test  0.02s user 0.00s system 89% cpu 0.021 total

可以看到，Swift 优化还是一如既往的棒，只需 0.02 秒多一点。

2.3 C 语言

#include <stdio.h>

int main() {
    int start = 1;
    int end = 63;
    int max = 0;

    for (int a = start; a <= end; a++) {
        for (int b = start; b <= end; b++) {
            for (int c = start; c <= end; c++) {
                for (int d = start; d < end; d++) {
                    if(a + b + c + d == 63) {
                        int rlt = a*b + b*c + c*d;
                        if(rlt >= max) {
                            max = rlt;
                        }
                    }
                }
            }
        }
    }

    printf("max is %d\n", max);
    return 0;
}

同样，我们使用clang 的 O2 优化选项来编译 C 代码，耗时稍微比 Swift 慢一丢丢：

% time ./test
max is 991
./test  0.02s user 0.00s system 93% cpu 0.027 total
% time ./test
max is 991
./test  0.02s user 0.00s system 93% cpu 0.027 total
% time ./test
max is 991
./test  0.02s user 0.00s system 93% cpu 0.027 total

ARM64 汇编

因为无论是 C、Swift 还是 Ruby 的源代码都可以“原封不动”的在 intel 或 M2 芯片上编译执行，所以它们的兼容性非常高！

不过，汇编语言可就“拖后腿”了。

x64 汇编是无法直接在 M2 上执行的。于是乎，我们呼唤 M2 芯片原生指令集汇编火速抵达战场。

x64 汇编生成的可执行文件能不能通过 Rosetta2 在 M2 上执行呢？这是一个有趣的问题。

我们把之前在 intel mac 上创建的可执行文件拷贝到 M2 mac 中，直接在终端中运行，提示错误：

在这里插入图片描述

在访达（Finder）中选中 x64 文件，点击鼠标右键弹出菜单中的打开选项，在新弹出的窗口中选择打开：

在这里插入图片描述

再回到终端中，我们发现可以运行 x64 程序了。

测试发现，由于 M2 上 Rosetta2 的转译，intel x64 指令执行速度大打折扣：

% time ./x64
max is 992
./x64  0.03s user 0.01s system 87% cpu 0.043 total
% time ./x64
max is 992
./x64  0.03s user 0.01s system 88% cpu 0.042 total
% time ./x64
max is 992
./x64  0.03s user 0.01s system 89% cpu 0.041 total

所以现在，轮到我们主角 ARM64 闪亮登场了。

Apple Silicon（M1，M2等）是苹果公司开发的基于 ARM 架构的系列芯片，用于替代之前使用的英特尔芯片，旨在提高 Mac 电脑的性能和能效。

Apple Silicon 芯片兼容 ARMv8 指令集，并以 64 位模式（AArch64）运行。所以，我们可以使用 ARM64 汇编语言来构建 M2 上的可执行文件。

ARM 是 risc 架构的 cpu，它的指令是等长（32位）的，不像 intel 是变长指令集。

关于更多 64 位汇编语言的相关知识，感谢兴趣的小伙伴们可以移步到我的汇编（Asm）专栏中观赏相关文章：

大熊猫侯佩的 Asm 系列专栏

下面的 ARM64 汇编代码在关键处做了详细的注释，大家可以轻松读懂：

# as test_arm64.s -o ta64.o
# ld ta64.o -lSystem -L `xcrun --show-sdk-path -sdk macosx`/usr/lib -o ta64
# test_a64.s

.equ        total, 63

    .text
    .globl  _main
    .p2align    2
_main:
    sub     sp,sp,#32
    stp     x29,x30,[sp,#16]
    add     x29,sp,#16

    mov     w0,1    // a in w0
    mov     w1,w0   // b in w1
    mov     w2,w0   // c in w2
    mov     w3,w0   // d in w3
    mov     w11,wzr // max in w11
start_a_loop:
    cmp     w0,total
    b.hi    end_a_loop
start_b_loop:
    cmp     w1,total
    b.hi    end_b_loop
start_c_loop:
    cmp     w2,total
    b.hi    end_c_loop
start_d_loop:
    cmp     w3,total
    b.hi    end_d_loop
    // 计算 a + b + c + d 的值
    add     w4,w0,w1
    add     w4,w4,w2
    add     w4,w4,w3
    cmp     w4,total
    b.ne    not_equ_63
    // 若等于 a + b + c + d = 63，则计算 ab + bc + cd 的值 x
    mul     w4,w0,w1
    mul     w5,w1,w2
    mul     w6,w2,w3
    add     w5,w5,w6
    add     w4,w4,w5
    // 若 x > max ，则需要更新 max 为 x 值
    cmp     w4,w11
    b.ls    not_equ_63
    mov     w11,w4
not_equ_63:
    add     w3,w3,#1
    b       start_d_loop
end_d_loop:
    mov     w3,wzr
    add     w2,w2,#1
    b start_c_loop
end_c_loop:
    mov     w2,wzr
    add     w1,w1,#1
    b       start_b_loop
end_b_loop:
    mov     w1,wzr
    add     w0,w0,#1
    b       start_a_loop
end_a_loop:
    adr     x0,string
    str     x11,[sp]
    bl      _printf

    ldp     x29,x30,[sp,#16]
    add     sp,sp,#32
    # eor x0,x0,x0
    mov     x0,#0
    ret
string: .asciz  "max is %ld\n"

使用 as 汇编器和 ld 链接器生成可执行文件，那么它的执行速度如何呢？

% time ./ta64
max is 992
./ta64  0.03s user 0.00s system 91% cpu 0.034 total
% time ./ta64
max is 992
./ta64  0.03s user 0.00s system 94% cpu 0.031 total
% time ./ta64
max is 992
./ta64  0.03s user 0.00s system 94% cpu 0.031 total

可能大家会对上面的结果失望了，它比 Swift 和 C 还要慢。

至于是什么原因，我们在之前的博文中已经谈过了。

虽然上面汇编代码运行速度只能排在第 3 名，但它也是可以再被优化的。不过，限于篇幅这里不再展开说明了，有兴趣的童鞋可以关注我的其它博文。

另一种乘法的实现

上面我们在实现 ab + bc + cd 的计算时，使用的是标准的乘法指令： mul。M2 芯片还支持 SIMD （单指令流多数据流）指令集，它们可以在一条指令中同时完成多个数据的计算。

下面，我们就用 SIMD 指令来重写之前 mul 乘法计算的逻辑：

// 其余代码不变，省略之...
start_d_loop:
    cmp     w3,total
    b.hi    end_d_loop
    // 计算 a + b + c + d 的值
    add     w4,w0,w1
    add     w4,w4,w2
    add     w4,w4,w3
    cmp     w4,total
    b.ne    not_equ_63
    // 若等于 a + b + c + d = 63，则计算 ab + bc + cd 的值 x
    // 将 a,b,c 依次放入 v1 SIMD 寄存器的 3 个 32 位通道中
    mov     v1.s[0],w0
    mov     v1.s[1],w1
    mov     v1.s[2],w2
	// 同样将 b,c,d 依次放入 v2 寄存器的对应通道中
    mov     v2.s[0],w1
    mov     v2.s[1],w2
    mov     v2.s[2],w3
	// 用 mul.4s 指令同时计算 a*b，b*c，c*d 的值
    mul.4s  v0,v1,v2
    // 计算 ab + bc + cd 的值，并将结果放入 w4 寄存器
    mov     w4,v0.s[0]
    mov     w5,v0.s[1]
    mov     w6,v0.s[2]
    add     w5,w5,w6
    add     w4,w4,w5
    /* 原来的 mul 指令...
    mul     w4,w0,w1
    mul     w5,w1,w2
    mul     w6,w2,w3
    add     w5,w5,w6
    add     w4,w4,w5
	*/
    // 若 x > max ，则需要更新 max 为 x 值
    cmp     w4,w11
    b.ls    not_equ_63
    mov     w11,w4
not_equ_63:

运行发现，结果并没有太大变化，可能该场景不是 SIMD 指令发挥最大威力的领域：

% time ./ta64
max is 992
./ta64  0.03s user 0.00s system 93% cpu 0.031 total
% time ./ta64
max is 992
./ta64  0.03s user 0.00s system 93% cpu 0.031 total