ARMv8架构的CPU在linux-5.4.18系统下对物理内存ECC错误的处理

目录

1 简介

2 ARMv8 手册上的信息

2.1 Synchronous exception types

2.2 ARMv8的 External aborts 和 ECC errors

2.3 ESR_ELx, Exception Syndrome Register (ELx)

2.4 手册信息总结

3 kernel-5.4.18内核中处理物理内存ECC错误的流程

3.1 异常级别1的同步异常处理

3.2 数据中止的处理函数: el1_da

3.3 处理Synchronous external abort(sea)的函数:do_sea()

4 总结


1 简介

物理内存硬件上的ECC功能可以检测内存错误

单bit错误可以纠正,所以不需要内核进行特殊处理。相反,多bit错误因为无法纠正,会对程序运行造成无法估计的影响。

本文分析ARMv8架构在linux-5.4.18下对多bit错误的处理。

2 ARMv8 手册上的信息

2.1 Synchronous exception types

......

In some implementations, External aborts. External aborts are failed memory accesses, and include accesses to those parts of the memory system that occur during the address translation. The ARMv8 architecture permits, but does not require, implementations to treat such exceptions synchronously.

                                                                《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D1-1477

2.2 ARMv8的 External aborts 和 ECC errors

The ARM architecture defines external aborts as errors that occur in the memory system, other than those that are detected by the MMU or debug logic. External aborts include parity or ECC errors detected by the caches or other parts of the memory system.

                                                                《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D3-1638

The ARM architecture supports the reporting of both synchronous and asynchronous parity or ECC errors from the cache system. It is IMPLEMENTATION DEFINED what parity or ECC errors in the cache systems, if any, result in synchronous or asynchronous parity or ECC errors.

A fault code is defined for reporting parity or ECC errors, see Use of the ESR_EL1, ESR_EL2, and ESR_EL3 on page D1-1453.                                                                 《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D3-1639
 

2.3 ESR_ELx, Exception Syndrome Register (ELx)

Field descriptions
bitsname位域说明Value
( 二进制 )
取值说明
[31:26]ECException Class. Indicates the reason for the exception that this register holds information about.100100Data Abort that caused entry from a lower Exception level, where that Exception level could be using AArch64 or using AArch32.
Used for MMU faults generated by data accesses, alignment faults other than those caused by the Stack Pointer misalignment, and Synchronous external aborts, including synchronous parity or ECC errors. Not used for debug related exceptions.
100101Data Abort that caused entry from a current Exception level, where the current Exception level must be using AArch64.
Used for MMU faults generated by data accesses, alignment faults other than those caused by the Stack Pointer misalignment, and Synchronous external aborts, including synchronous parity or ECC errors. Not used for debug related exceptions.
......
[24:0]ISSInstruction Specific SyndromeIFSC, bits [5:0]
=010000
Synchronous external abort, other than synchronous
parity or ECC error, not on translation table walk
IFSC, bits [5:0]
= 011000
Synchronous parity or ECC error on memory access, not
on translation table walk
IFSC, bits [5:0]
= 010100
Synchronous external abort, other than synchronous
parity or ECC error, on translation table walk, level 0
IFSC, bits [5:0]
= 010101
Synchronous external abort, other than synchronous
parity or ECC error, on translation table walk, level 1
IFSC, bits [5:0]
= 010110
Synchronous external abort, other than synchronous
parity or ECC error, on translation table walk, level 2
IFSC, bits [5:0]
= 010111
Synchronous external abort, other than synchronous
parity or ECC error, on translation table walk, level 3
IFSC, bits [5:0]
= 011100
Synchronous parity or ECC error on memory access on
translation table walk, level 0
IFSC, bits [5:0]
= 011101
Synchronous parity or ECC error on memory access on
translation table walk, level 1
IFSC, bits [5:0]
= 011110
Synchronous parity or ECC error on memory access on
translation table walk, level 2
IFSC, bits [5:0]
= 011111
Synchronous parity or ECC error on memory access on
translation table walk, level 3

                                                                《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D7-1850

                                                                《ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile》Page D7-1869

2.4 手册信息总结

ARMv8架构下有4个等级的异常,每个异常又细分为4个异常分别为:

  1. 同步异常
  2. 中断
  3. 快速中断
  4. 系统错误

同步异常中又细分为

  1. 系统调用
  2. 异常级别0使用svc(Supervisor Call)指令陷入异常级别1
  3. 异常级别1使用hvc(Hypervisor Call)指令陷入异常级别2
  4. 异常级别2使用smc(Secure Monitor Call)指令陷入异常级别3
  5. 数据中止
  6. 指令中止
  7. 栈指针或指令地址没有对齐        
  8. 没有定义的指令
  9. 调试异常

                                                                                                                                        《Linux内核深度解析》P405

ECC检测到的物理内存错误是由“同步异常”中的“数据中止(Data Abort)”异常来处理的。

3 kernel-5.4.18内核中处理物理内存ECC错误的流程

3.1 异常级别1的同步异常处理

//arch/arm64/kernel/entry.S
el1_sync:
    kernel_entry 1
    mrs x1, esr_el1         // read the syndrome register
    lsr x24, x1, #ESR_ELx_EC_SHIFT  // exception class
    cmp x24, #ESR_ELx_EC_DABT_CUR   // data abort in EL1  ;ESR_ELx_EC_DABT_CUR的值是0x25,二进制是:100101
    b.eq    el1_da
    cmp x24, #ESR_ELx_EC_IABT_CUR   // instruction abort in EL1
    b.eq    el1_ia
    cmp x24, #ESR_ELx_EC_SYS64      // configurable trap
    b.eq    el1_undef
    cmp x24, #ESR_ELx_EC_PC_ALIGN   // pc alignment exception
    b.eq    el1_pc
    cmp x24, #ESR_ELx_EC_UNKNOWN    // unknown exception in EL1
    b.eq    el1_undef
    cmp x24, #ESR_ELx_EC_BREAKPT_CUR    // debug exception in EL1
    b.ge    el1_dbg
    b   el1_inv

3.2 数据中止的处理函数: el1_da

3.3 处理Synchronous external abort(sea)的函数:do_sea()

4 总结

        ARMv8架构的CPU在linux-5.4.18内核下,物理内存的ECC如果检测到多bit不可纠正错误,会根据错误发生在用户模式还是内核模式采取不同的措施:

  •  错误发生在用户模式,会将对应的进程杀死。
  •  错误发生在内核模式,会导致panic。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值