Linux内核下RAS(Reliability, Availability and Serviceability)功能分析记录

u010936265

已于 2024-05-14 14:48:48 修改

阅读量3.2k

点赞数 19

分类专栏： linux Linux device driver 文章标签： linux 运维服务器

于 2024-05-13 18:07:52 首次发布

本文链接：https://blog.csdn.net/u010936265/article/details/138770295

版权

linux 同时被 2 个专栏收录

13 篇文章

订阅专栏

Linux device driver

9 篇文章

订阅专栏

1 简介

Reliability, Availability and Serviceability (RAS) — The Linux Kernel documentation

在服务器和卫星等领域，对设备的稳定性要求很高，需要及时的发现并处理软/硬件上的错误。RAS功能可以用来及时的发现硬件上的错误。

RAS功能需要硬件的支持。

目前我了解到的linux内核下的RAS功能有以下几类：

EDAC：主要用来检测物理内存和 PCI硬件错误
APEI：基于ACPI的RAS
ARMv8架构的RAS：使用这个功能的CPU很少，目前只知道飞腾D2000V使用了这个功能。
AMDGPU的RAS

2 EDAC(Error Detection And Correction)

2.1 简介

The ``edac`` kernel module's goal is to detect and report hardware errors that occur within the computer system running under linux.
《<kernel_src/Documentation/admin-guide/ras.rst>》

2.2 EDAC的核⼼模块：edac_core.ko

2.2.1 中断或者轮训模式来获取硬件错误信息

全局变量edac_op_state用来控制使用中断或者轮训模式，可以通过模块参数来设置edac_op_state的值，例如：

drivers/edac/amd64_edac.c:3753:module_param(edac_op_state, int, 0444);
drivers/edac/x38_edac.c:523:module_param(edac_op_state, int, 0444);

默认为轮训模式。轮训模式下，内核会创建专用的工作队列——edac-poller来周期获取硬件错误信息。

2.2.2 创建专用工作队列——edac-poller

edac_init();
    -> edac_workqueue_setup();
        -> alloc_ordered_workqueue("edac-poller", WQ_MEM_RECLAIM);

对应的可以在系统下看到一个工作队列处理线程

# ps aux | grep edac-
root         124  0.0  0.0      0     0 ?        I<   10:09   0:00 [edac-poller]

2.2.3 向专用工作队列(edac-poller)添加工作项

bool edac_queue_work(struct delayed_work *work, unsigned long delay)                                                                                    
{
    return queue_delayed_work(wq, work, delay);
}
EXPORT_SYMBOL_GPL(edac_queue_work);

2.2.4 模块参数

# ls /sys/module/edac_core/parameters/
check_pci_errors  edac_mc_log_ue       edac_mc_poll_msec
edac_mc_log_ce    edac_mc_panic_on_ue  edac_pci_panic_on_pe

2.3 通过EDAC功能来获取物理内存的硬件ECC错误

2.3.1 ECC功能简介

ECC的⼯作原理

As mentioned on the previous section, ECC memory has extra bits to be
used for error correction. So, on 64 bit systems, a memory module
has 64 bits of *data width*, and 74 bits of *total width*. So, there are
8 bits extra bits to be used for the error detection and correction
mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.

So, when the cpu requests the memory controller to write a word with
*data width*, the memory controller calculates the *syndrome* in real time,
using Hamming code, or some other error correction code, like SECDED+,
producing a code with *total width* size. Such code is then written
on the memory modules.

At read, the *total width* bits code is converted back, using the same
ECC code used on write, producing a word with *data width* and a *syndrome*.
The word with *data width* is sent to the CPU, even when errors happen.

The memory controller also looks at the *syndrome* in order to check if
there was an error, and if the ECC code was able to fix such error.
If the error was corrected, a Corrected Error (CE) happened. If not, an
Uncorrected Error (UE) happened.
《<kernel_src>/Documentation/admin-guide/ras.rst》

2.3.2 数据结构——struct mem_ctl_info;

struct mem_ctl_info {
    ......
    /* pointer to edac checking routine */
    void (*edac_check) (struct mem_ctl_info * mci);
    ......
}；

2.3.3 创建/sys/devices/system/edac/mc/下的文件并创建工作项

edac_mc_add_mc_with_groups();
    -> edac_create_sysfs_mci_device();
    -> INIT_DELAYED_WORK(&mci->work, edac_mc_workq_function);

2.3.4 工作项处理函数——edac_mc_workq_function();

edac_mc_workq_function();
    -> mci->edac_check(mci);      //获取硬件错误的具体函数
    -> edac_queue_work(&mci->work, msecs_to_jiffies(edac_mc_get_poll_msec()));    //不断周期运行

上面的程序会周期运行，周期为模块参数/sys/module/edac_core/parameters/edac_mc_poll_msec。

2.3.5 调试方法

2.3.5.1 /sys/devices/system/edac/mc/

请参考《<kernel_src>/Documentation/admin-guide/ras.rst》

2.3.5.2 应用层工具：edac-util 和 edac-ctl

用法请看man手册信息(man edac-util 和 man edac-ctl)。

2.3.6 实际用例(Freescale的MPC8572处理器)

MPC8572手册上的DDR Memory Controllers信息，《MPC8572E PowerQUICC™ III Integrated Host Processor Family Reference Manual》Page9-1

设备树

//arch/powerpc/boot/dts/fsl/mpc8572si-post.dtsi
    memory-controller@2000 {
        compatible = "fsl,mpc8572-memory-controller";
        reg = <0x2000 0x1000>;
        interrupts = <18 2 0 0>;
    };  

    memory-controller@6000 {
        compatible = "fsl,mpc8572-memory-controller";
        reg = <0x6000 0x1000>;
        interrupts = <18 2 0 0>;
    };

edac驱动

fsl_mc_err_probe();    //drivers/edac/mpc85xx_edac.c
    -> mci->edac_check = fsl_mc_check;    //获取物理内存错误信息的关键函数
    -> edac_mc_add_mc_with_groups(mci, fsl_ddr_dev_groups);

2.4 通过EDAC功能来获取PCI硬件错误

Linux下通过EDAC功能检测PCIE硬件错误_linux如何查询pcie误码率-CSDN博客

2.5 通过EDAC功能获取其他类型硬件的错误

edac_device_add_device();
    -> edac_device_create_sysfs();
    -> edac_device_workq_setup();
        -> INIT_DELAYED_WORK(&edac_dev->work, edac_device_workq_function);

3 APEI(ACPI Platform Error Interface)

3.1 简介

APEI allows to report errors (for example from the chipset) to the operating system. This improves NMI handling especially. In addition it supports error serialization and error injection.
《<kernel_src>/drivers/acpi/apei/Kconfig》

ACPI Platform Error Interfaces (APEI), which provide a means for a computer platform to convey error information to OSPM.
APEI consists of four separate tables:

Error Record Serialization Table (ERST)
Boot Error Record Table (BERT)
Hardware Error Source Table (HEST)
Error Injection Table (EINJ)

《Advanced Configuration and Power Interface (ACPI) Specification》P793

3.2 APEI Generic Hardware Error Source(GHES)

内核配置：CONFIG_ACPI_APEI_GHES

Generic Hardware Error Source provides a way to report platform hardware errors (such as that from chipset). It works in so called "Firmware First" mode, that is, hardware errors are reported to firmware firstly, then reported to Linux by firmware. This way, some non-standard hardware error registers or non-standard hardware link can be checked by firmware to produce more valuable hardware error information for Linux.
《drivers/acpi/apei/Kconfig》

3.3 APEI PCIe AER logging/recovering support

内核配置：CONFIG_ACPI_APEI_PCIEAER

PCIe AER errors may be reported via APEI firmware first mode. Turn on this option to enable the corresponding support.
《drivers/acpi/apei/Kconfig》

调试方法

/sys/kernel/debug/tracing/events/ras/aer_event/

3.4 APEI memory error recovering support

内核配置: CONFIG_ACPI_APEI_MEMORY_FAILURE

Memory errors may be reported via APEI firmware first mode. Turn on this option to enable the memory recovering support.
《drivers/acpi/apei/Kconfig》

调试方法

/sys/kernel/debug/tracing/events/ras/mc_event/

3.5 APEI Error INJection (EINJ)

3.5.1 简介

内核配置: CONFIG_ACPI_APEI_EINJ

EINJ provides a hardware error injection mechanism, it is mainly used for debugging and testing the other parts of APEI and some other RAS features.
《drivers/acpi/apei/Kconfig》

3.5.2 Error Injection Table

The Error Injection (EINJ) table provides a generic interface mechanism through which OSPM can inject hardware errors to the platform without requiring platform specific OSPM software. System firmware is responsible for building this table, which is made up of Injection Instruction entries.
《Advanced Configuration and Power Interface (ACPI) Specification》P832

3.5.3 是否支持EINJ

是否存在 /sys/firmware/acpi/tables/EINJ。

go into BIOS setup to see if the BIOS has an option to enable error injection. Look for something called WHEA or similar. Often, you need to enable an ACPI5 support option prior, in order to see the APEI,EINJ,... functionality supported and exposed by the BIOS menu.
《Documentation/firmware-guide/acpi/apei/einj.rst》

3.5.4 /sys/kernel/debug/apei/einj/

使用方法：服务器内存故障预测居然可以这样做

3.6 ARMv8架构下对APEI中断的支持

APEI requires the equivalent of an SCI and an NMI on ARMv8. The SCI is used to notify the OSPM of errors that have occurred but can be corrected and the system can continue correct operation, even if possibly degraded. The NMI is used to indicate fatal errors that cannot be corrected, and require immediate attention.

Since there is no direct equivalent of the x86 SCI or NMI, arm64 handles these slightly differently. The SCI is handled as a high priority interrupt; given that these are corrected (or correctable) errors being reported, this is sufficient. The NMI is emulated as the highest priority interrupt possible. This implies some caution must be used since there could be interrupts at higher privilege levels or even interrupts at the same priority as the emulated NMI. In Linux, this should not be the case but one should be aware it could happen.
《<kernel_src>/Documentation/arm64/acpi_object_usage.rst》

3.7 关键函数——ghes_do_proc();

3.8 调试方法

# ls /sys/kernel/debug/tracing/events/ras/ -l
总用量 0
drwxr-x--- 2 root root 0 5月  13 10:09 aer_event
drwxr-x--- 2 root root 0 5月  13 10:09 arm_event
-rw-r----- 1 root root 0 5月  13 10:09 enable
drwxr-x--- 2 root root 0 5月  13 10:09 extlog_mem_event
-rw-r----- 1 root root 0 5月  13 10:09 filter
drwxr-x--- 2 root root 0 5月  13 10:09 mc_event
drwxr-x--- 2 root root 0 5月  13 10:09 memory_failure_event
drwxr-x--- 2 root root 0 5月  13 10:09 non_standard_event