Linux内核下RAS(Reliability, Availability and Serviceability)功能分析记录

1 简介

Reliability, Availability and Serviceability (RAS) — The Linux Kernel documentation

在服务器 和 卫星等领域,对设备的稳定性要求很高,需要及时的发现并处理软/硬件上的错误。RAS功能可以用来及时的发现硬件上的错误。

RAS功能需要硬件的支持。

目前我了解到的linux内核下的RAS功能有以下几类:

  • EDAC:主要用来检测物理内存 和 PCI硬件错误
  • APEI:基于ACPI的RAS
  • ARMv8架构的RAS:使用这个功能的CPU很少,目前只知道飞腾D2000V使用了这个功能。
  • AMDGPU的RAS

2 EDAC(Error Detection And Correction)

2.1 简介

The ``edac`` kernel module's goal is to detect and report hardware errors that occur within the computer system running under linux.
                                《<kernel_src/Documentation/admin-guide/ras.rst>》

2.2 EDAC的核⼼模块:edac_core.ko

2.2.1 中断 或者 轮训模式 来获取硬件错误信息

全局变量edac_op_state用来控制使用中断 或者 轮训模式,可以通过模块参数来设置edac_op_state的值,例如:

drivers/edac/amd64_edac.c:3753:module_param(edac_op_state, int, 0444);
drivers/edac/x38_edac.c:523:module_param(edac_op_state, int, 0444);

默认为轮训模式。轮训模式下,内核会创建专用的工作队列——edac-poller来周期获取硬件错误信息。

2.2.2 创建专用工作队列——edac-poller

edac_init();
    -> edac_workqueue_setup();
        -> alloc_ordered_workqueue("edac-poller", WQ_MEM_RECLAIM);

对应的可以在系统下看到一个工作队列处理线程

# ps aux | grep edac-
root         124  0.0  0.0      0     0 ?        I<   10:09   0:00 [edac-poller]

2.2.3 向专用工作队列(edac-poller)添加工作项

bool edac_queue_work(struct delayed_work *work, unsigned long delay)                                                                                    
{
    return queue_delayed_work(wq, work, delay);
}
EXPORT_SYMBOL_GPL(edac_queue_work);

2.2.4 模块参数

# ls /sys/module/edac_core/parameters/
check_pci_errors  edac_mc_log_ue       edac_mc_poll_msec
edac_mc_log_ce    edac_mc_panic_on_ue  edac_pci_panic_on_pe

2.3 通过EDAC功能来获取物理内存的硬件ECC错误

2.3.1 ECC功能简介

ECC的⼯作原理

As mentioned on the previous section, ECC memory has extra bits to be
used for error correction. So, on 64 bit systems, a memory module
has 64 bits of *data width*, and 74 bits of *total width*. So, there are
8 bits extra bits to be used for the error detection and correction
mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.

So, when the cpu requests the memory controller to write a word with 
*data width*, the memory controller calculates the *syndrome* in real time,
using Hamming code, or some other error correction code, like SECDED+,
producing a code with *total width* size. Such code is then written
on the memory modules.

At read, the *total width* bits code is converted back, using the same
ECC code used on write, producing a word with *data width* and a *syndrome*.
The word with *data width* is sent to the CPU, even when errors happen.

The memory controller also looks at the *syndrome* in order to check if
there was an error, and if the ECC code was able to fix such error.
If the error was corrected, a Corrected Error (CE) happened. If not, an
Uncorrected Error (UE) happened.
                                《<kernel_src>/Documentation/admin-guide/ras.rst》

2.3.2 数据结构——struct mem_ctl_info;

struct mem_ctl_info {
    ......
    /* pointer to edac checking routine */
    void (*edac_check) (struct mem_ctl_info * mci);
    ......
};

2.3.3 创建/sys/devices/system/edac/mc/下的文件 并 创建工作项

edac_mc_add_mc_with_groups();
    -> edac_create_sysfs_mci_device();
    -> INIT_DELAYED_WORK(&mci->work, edac_mc_workq_function);

2.3.4 工作项处理函数——edac_mc_workq_function();

edac_mc_workq_function();
    -> mci->edac_check(mci);      //获取硬件错误的具体函数
    -> edac_queue_work(&mci->work, msecs_to_jiffies(edac_mc_get_poll_msec()));    //不断周期运行

上面的程序会周期运行,周期为模块参数/sys/module/edac_core/parameters/edac_mc_poll_msec。

2.3.5 调试方法

2.3.5.1 /sys/devices/system/edac/mc/

请参考《<kernel_src>/Documentation/admin-guide/ras.rst》

2.3.5.2 应用层工具:edac-util 和 edac-ctl

用法请看man手册信息(man edac-util 和 man edac-ctl)。

2.3.6 实际用例(Freescale的MPC8572处理器) 

MPC8572手册上的DDR Memory Controllers信息,《MPC8572E PowerQUICC™ III Integrated Host Processor Family Reference Manual》Page9-1

设备树

//arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi
    memory-controller@2000 {
        compatible = "fsl,mpc8572-memory-controller";
        reg = <0x2000 0x1000>;
        interrupts = <18 2 0 0>;
    };  

    memory-controller@6000 {
        compatible = "fsl,mpc8572-memory-controller";
        reg = <0x6000 0x1000>;
        interrupts = <18 2 0 0>;
    };

edac驱动

fsl_mc_err_probe();    //drivers/edac/mpc85xx_edac.c
    -> mci->edac_check = fsl_mc_check;    //获取物理内存错误信息的关键函数
    -> edac_mc_add_mc_with_groups(mci, fsl_ddr_dev_groups);

2.4 通过EDAC功能来获取PCI硬件错误

Linux下通过EDAC功能检测PCIE硬件错误_linux如何查询pcie误码率-CSDN博客

2.5 通过EDAC功能获取其他类型硬件的错误

edac_device_add_device();
    -> edac_device_create_sysfs();
    -> edac_device_workq_setup();
        -> INIT_DELAYED_WORK(&edac_dev->work, edac_device_workq_function);

3 APEI(ACPI Platform Error Interface)

3.1 简介

APEI allows to report errors (for example from the chipset) to the operating system. This improves NMI handling especially. In addition it supports error serialization and error injection.
                                《<kernel_src>/drivers/acpi/apei/Kconfig》

ACPI Platform Error Interfaces (APEI), which provide a means for a computer platform to convey error information to OSPM.
APEI consists of four separate tables:

  • Error Record Serialization Table (ERST)
  • Boot Error Record Table (BERT)
  • Hardware Error Source Table (HEST)
  • Error Injection Table (EINJ)

                                《Advanced Configuration and Power Interface (ACPI) Specification》P793

3.2 APEI Generic Hardware Error Source(GHES)

内核配置:CONFIG_ACPI_APEI_GHES

Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux.
                                《drivers/acpi/apei/Kconfig》

3.3 APEI PCIe AER logging/recovering support

内核配置:CONFIG_ACPI_APEI_PCIEAER

PCIe AER errors may be reported via APEI firmware first mode. Turn on this option to enable the corresponding support.
                                《drivers/acpi/apei/Kconfig》

调试方法

/sys/kernel/debug/tracing/events/ras/aer_event/

3.4 APEI memory error recovering support

内核配置: CONFIG_ACPI_APEI_MEMORY_FAILURE

Memory errors may be reported via APEI firmware first mode. Turn on this option to enable the memory recovering support.
                                《drivers/acpi/apei/Kconfig》

调试方法

/sys/kernel/debug/tracing/events/ras/mc_event/

3.5 APEI Error INJection (EINJ)

3.5.1 简介

内核配置: CONFIG_ACPI_APEI_EINJ

EINJ provides a hardware error injection mechanism, it is mainly used for debugging and testing the other parts of APEI and some other RAS features.
                                《drivers/acpi/apei/Kconfig》

3.5.2 Error Injection Table

The Error Injection (EINJ) table provides a generic interface mechanism through which OSPM can inject hardware errors to the platform without requiring platform specific OSPM software. System firmware is responsible for building this table, which is made up of Injection Instruction entries.
                                《Advanced Configuration and Power Interface (ACPI) Specification》P832

3.5.3 是否支持EINJ

是否存在 /sys/firmware/acpi/tables/EINJ。

go into BIOS setup to see if the BIOS has an option to enable error injection. Look for something called WHEA or similar. Often, you need to enable an ACPI5 support option prior, in order to see the APEI,EINJ,... functionality supported and exposed by the BIOS menu.
                                《Documentation/firmware-guide/acpi/apei/einj.rst》

3.5.4 /sys/kernel/debug/apei/einj/

使用方法:服务器内存故障预测居然可以这样做

3.6 ARMv8架构下对APEI中断的支持

APEI requires the equivalent of an SCI and an NMI on ARMv8. The SCI is used to notify the OSPM of errors that have occurred but can be corrected and the system can continue correct operation, even if possibly degraded. The NMI is used to indicate fatal errors that cannot be corrected, and require immediate attention.

Since there is no direct equivalent of the x86 SCI or NMI, arm64 handles these slightly differently. The SCI is handled as a high priority interrupt; given that these are corrected (or correctable) errors being reported, this is sufficient. The NMI is emulated as the highest priority interrupt possible. This implies some caution must be used since there could be interrupts at higher privilege levels or even interrupts at the same priority as the emulated NMI. In Linux, this should not be the case but one should be aware it could happen.
                                《<kernel_src>/Documentation/arm64/acpi_object_usage.rst》

3.7 关键函数——ghes_do_proc();

3.8 调试方法

# ls /sys/kernel/debug/tracing/events/ras/ -l
总用量 0
drwxr-x--- 2 root root 0 5月  13 10:09 aer_event
drwxr-x--- 2 root root 0 5月  13 10:09 arm_event
-rw-r----- 1 root root 0 5月  13 10:09 enable
drwxr-x--- 2 root root 0 5月  13 10:09 extlog_mem_event
-rw-r----- 1 root root 0 5月  13 10:09 filter
drwxr-x--- 2 root root 0 5月  13 10:09 mc_event
drwxr-x--- 2 root root 0 5月  13 10:09 memory_failure_event
drwxr-x--- 2 root root 0 5月  13 10:09 non_standard_event

4 ARMv8 的 RAS

RAS System Architecture,请看《Arm Architecture Reference Manual for Aprofile architecture》Page11593

我目前接触过的ARM处理器中,只有飞腾D2000V使用了ARMv8手册中所描述的RAS功能。

5 AMDGPU RAS Support

https://www.kernel.org/doc/html/latest/gpu/amdgpu/ras.html

  • 18
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
ARM可靠性、可用性和可维护性(RAS)规范是ARM架构设计中重要的一个考虑因素。RAS规范是为了确保ARM处理器在运行过程中能够提供高可靠性、可用性和易维护性。 首先,ARM RAS规范要求ARM处理器具有良好的可靠性。可靠性是指ARM处理器在长时间运行下不出现错误或故障。ARM RAS规范要求处理器必须具有强大的错误检测和修复机制,以确保错误可以被及时发现和处理,从而提高处理器的可靠性。 其次,可用性是ARM RAS规范的另一个重要要求。可用性是指ARM处理器能够始终处于可用状态,不受外部或内部干扰的影响。为了提高可用性,ARM RAS规范要求处理器必须具有故障检测和容错机制,以便处理硬件或软件故障,并尽快恢复正常工作状态。 最后,ARM RAS规范还要求ARM处理器具有良好的可维护性。可维护性是指ARM处理器能够方便地进行维护和修复,以提高其寿命和性能。ARM RAS规范要求处理器必须支持在线故障诊断和修复功能,以便在无需关闭整个系统的情况下进行维护操作。 总之,ARM可靠性、可用性和可维护性(RAS)规范是为了确保ARM处理器在运行过程中能够提供稳定可靠的性能而制定的。通过强大的错误检测和修复机制、故障检测和容错机制以及在线故障诊断和修复功能,ARM处理器能够提供高可靠性、可用性和易维护性的服务。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值