CXL RAS

前言

此章节初次接触,仍然只是简单机翻,没有更加深刻的理解,后续理解深入了,再去更新。

12.0 RAS (Reliability, Availability and Serviceability)

可靠性、可用性和可维护性

CXL RAS capabilities are built on top of PCI Express. Additional capabilities are introduced to address cache coherency and memory as listed below.

CXL RAS 功能建立在 PCI Express 之上。 引入了附加功能来解决缓存一致性和内存问题,如下所示。

12.1 Supported RAS Features

The table below lists the RAS features supported by CXL and their applicability to CXL.io vs. CXL.cache and CXL.mem.

下图列出 RAS 支持的特征
请添加图片描述

12.2 CXL Error Handing

As shown in Figure 180, CXL can simultaneously carry three protocols: CXL.io, CXL.cache and CXL.mem. CXL.io carries PCIe like semantics and must be supported by all CXL Endpoints. All RAS capabilities must address all of these protocols and usages. For details of CXL architecture and all protocols, please refer to the other sections in this document.

如图180, CXL可以同时承载三种协议:CXL.io,CXL.cache 和 CXL.mem。 CXL.io 带有类似 PCIe 的语义,所有 CXL 端点必须支持此协议。 所有 RAS 功能都必须实现所有这些协议和用途。有关 CXL 架构和所有协议的详细信息,请参阅其他章节。

Figure 180 below is an illustration of CXL and the focus areas for CXL RAS. Namely, Link & protocol RAS, which applies to the CXL component to component communication mechanism and Device RAS which applies exclusively to the device itself. All CXL protocol errors are reflected to the OS via PCIe AER mechanisms as “Correctable Internal Error” (CIE) or “Uncorrectable Internal Error” (UIE). Errors may also be reflected to Platform software if so configured.

图180 显示了 CXL 和 CXL RAS 的重点领域。 即Link & protocol RAS,它适用于CXL组件到组件的通信机制和只适用于设备本身的 Device RAS。 所有 CXL 协议错误都通过 PCIe AER 机制作为“可纠正内部错误”(CIE)或“不可纠正内部错误”(UIE)反映给操作系统。 如果这样配置,错误也可能被反映到平台软件。

请添加图片描述

Referring to Figure 180, the CXL 1.1 Host/Root Complex is located on the north side and contains all the usual error handling mechanisms such as Core error handling (e.g. MCA architecture), PCIe AER, RCEC and other platform level error reporting and handling mechanisms. CXL.mem and CXL.cache protocol errors encountered by the device are communicated to the CPU across CXL.io. to be logged in PCIe AER registers. The following sections will focus on the link layer and transaction layer error handling mechanisms as well as CXL device error handling.

参考图 180,CXL 1.1 主机/RC 位于北侧,包含所有常见的错误处理机制,例如核心错误处理(例如 MCA 架构)、PCIe AER、RCEC 和其他平台级错误报告和处理机制 。设备遇到的 CXL.mem 和 CXL.cache 协议错误通过 CXL.io 传递给 CPU,记录在 PCIe AER 寄存器中。 以下部分将重点介绍链路层和事务层错误处理机制以及 CXL 设备错误处理。

Errors detected by CXL 2.0 ports are escalated and reported using standard PCIe error reporting mechanisms over CXL.io as UIE/CIE.

CXL 2.0 端口检测到的错误通过 CXL.io 作为 UIE/CIE 使用标准 PCIe 错误报告机制进行升级和报告。

12.2.1 Protocol and Link Layer Error Reporting

Protocol and Link errors are detected and communicated to the Host where they can be exposed and handled. There are no error pins connecting CXL devices to the Host. Errors are communicated between the Host and the CXL device via messages over CXL.io.

检测到协议和链路层错误并将其传达给可以暴露和处理的主机。 没有将 CXL 设备连接到主机的错误引脚。 错误在主机和 CXL 设备之间通过 CXL.io 上的消息进行通信。

12.2.1.1 CXL 1.1 Downstream Port (DP) Detected Errors

Errors detected by the CXL 1.1 DP are escalated and reported via the Root Complex error reporting mechanisms as UIE/CIE. The various signaling and logging steps are listed below and illustrated in Figure 181.

CXL 1.1 DP 检测到的错误通过 RC 错误报告机制如 UIE/CIE 升级并报告。 下面列出了各种信令和记录步骤,并在图 181 中进行了说明。

  1. DPA CXL.io detected errors are logged in local AER Extended Capability in DPA RCRB. Software must ensure that Root Port Control register in DPA AER Extended Capability are not configured to generate interrupt.

DPA CXL.io 检测到的错误记录在 DPA RCRB 的本地 AER 扩展能力中。 软件必须确保 DPA AER 扩展功能中的根端口控制寄存器没有配置为产生中断。

  1. DPA CXL.cache and CXL.mem logs errors in CXL RAS Capability Structure

DPA CXL.cache and CXL.mem 在 RAS 能力结构体中的日志错误

  1. DPA CXL.cache, CXL.mem, or CXL.io sends error message to RCEC

DPA CXL.cache, CXL.mem, or CXL.io 向 RCEC 发送错误消息

  1. RCEC logs UIE/CIE
  2. RCEC generates MSI if enabled

OS error handler may begin by inspecting the RCEC AER Extended Capability and follow PCI Express rules to discover the source of the error. Platform Software Error Handler may interrogate the Platform specific error logs in addition to the error logs defined in PCI Express Base Specification and this specification.

操作系统错误处理程序可以从检查 RCEC AER 扩展能力开始,并遵循 PCI Express 规则来发现错误的来源。 除了 PCI Express 基本规范和本规范中定义的错误日志之外,平台软件错误处理程序还可以询问平台特定的错误日志。

请添加图片描述

12.2.1.2 CXL 1.1 Upstream Port (UP) Detected Errors

Errors detected by the CXL 1.1 UP are also escalated and reported via the Root Complex Event Collector. The various signaling and logging steps are listed below and illustrated in Figure 182.

CXL 1.1 UP 检测到的错误也会通过 Root Complex Event Collector 升级和报告。 下面列出了各种信令和记录步骤,并在图 182 中进行了说明。

  1. If CXL.cache or CXL.mem block in UPZ detects protocol or link error, it shall log it in CXL RAS Capability Structure (Section 8.2.5.9)

如果 UP 中的 CXL.cache 或 CXL.mem 块检测到协议或链路错误,则应将其记录在 CXL RAS 能力结构中

  1. UP RCRB shall not implement AER Extended Capability

UP RCRB 不应该实现 AER 扩展能力

  1. UP sends error message to all CXL.io Functions that are affected by this error (This example shows a device with a single function. The message must include all the details the CXL.io function needs for constructing AER record

UP 向所有受此错误影响的 CXL.io 功能发送错误消息(此示例显示具有单个功能的设备。消息必须包含 CXL.io 功能构建 AER 记录所需的所有详细信息

  1. .IO Functions log received message in their respective AER Extended Capability

.IO 功能在各自的 AER 扩展能力中记录收到的消息

  1. Each affected CXL.io Function sends ERR_ message to UPZ with its own Requestor ID

每个受影响的 CXL.io 功能使用自己的请求者 ID 向 UP 发送 ERR_ 消息

  1. UP forwards this Error message across the Link without logging

UP 通过链接转发此错误消息而不记录

  1. DP forwards Error message to RCEC

DP 向 RCEC 转发此错误消息

  1. RCEC logs the error and signals interrupt if enabled in accordance with PCIe Base Specification

如果按照 PCIe 基本规范启用,RCEC 会记录错误并发出中断信号

请添加图片描述

12.2.1.3 CXL 1.1 RCiEP Detected Errors

Errors detected by the CXL 1.1 RCiEP are also escalated and reported via the Root Complex Event Collector. The various signaling and logging steps are listed below and also illustrated in Figure 183.

CXL 1.1 RCiEP 检测到的错误也会通过 Root Complex Event Collector 升级和报告。 下面列出了各种信令和日志记录步骤,并在图 183 中进行了说明。

  1. CXL.cache (or CXL.mem) notifies all affected CXL.io Functions of the error

CXL.cache (or CXL.mem) 向所有受影响的 .IO 功能通知错误

  1. All affected CXL.io Functions logs UIE/CIE in their respective AER Extended Capability

所有受影响的 IO 功能在各自的 AER 扩展能力中记录 UIE/CIE

  1. CXL.io Functions generate PCIe ERR_ message on the Link with Tag = 0

IO 功能在链路上使用 Tag = 0 产生 PCIe ERR_ 消息

  1. DP forwards the ERR_ message to RCEC

DP 向 RCEC 转发这个错误消息

  1. RCEC logs UIE/CIE and generates MSI if enabled in accordance with PCIe Base Specification

RCEC 记录 UIE/CIE,并产生 MSI 中断

请添加图片描述

12.2.2 CXL 2.0 Root Ports, Downstream Switch Ports, and Upstream Switch Ports

Errors detected by these ports are escalated and reported using PCIe error reporting mechanisms as UIE/CIE.

使用 PCIe 错误报告机制(如 UIE/CIE)升级和报告这些端口检测到的错误。

OS error handler may begin by inspecting the Root Port AER Extended Capability and follow PCI Express rules to discover the source of the error. Platform Software Error Handler may interrogate the Platform specific error logs in addition to the error logs defined in PCI Express Base Specification and this specification.

操作系统错误处理程序可以从检查根端口 AER 扩展能力开始,并遵循 PCI Express 规则来发现错误的来源。 除了 PCI Express 基本规范和本规范中定义的错误日志之外,平台软件错误处理程序还可以询问平台特定的错误日志。

12.2.3 CXL Device Error Handling

Whenever a CXL device returns data that is either known to be bad or suspect, it must ensure the consumer of the data is made aware of the nature of the data either at the time of consumption or prior to the consumption of data. This allows the consumer to take the appropriate containment actions.

每当 CXL 设备返回已知错误或可疑的数据时,它必须确保数据的使用者在使用数据时或在使用数据之前了解数据的性质。 这允许消费者采取适当的遏制措施。

CXL defines two containment mechanisms - poison and viral

  1. Poison: Return data on CXL.io and CXL.cachemem may be tagged as poisoned.
  2. Viral: CXL.cachemem supports viral, which is generally used to indicate more severe error conditions at the device. See section . Any data returned by a device on CXL.cachemem after it has communicated Viral is considered suspect even if it is not explicitly poisoned.

CXL 定义了两个控制机制 - poison and viral

  1. Poison : CXL.io 和 CXL.cachemem 上的返回数据可能被标记为已中毒。
  2. Viral : CXL.cachemem 支持 viral,通常用于指示设备上更严重的错误情况。CXL.cachemem 上的设备在与 viral 通信后返回的任何数据都被认为是可疑的,即使它没有明确中毒。

A device must set the MetaField to No-op in CXL.cachemem return response when the MetaData is suspect.

当 MetaData 可疑时,设备必须在 CXL.cachemem 返回响应中将 MetaField 设置为 No-op。

If a CXL component is not in the Viral condition, it shall poison all data responses on CXL interface if the data being returned is known to be bad or suspect.

如果 CXL 组件不处于 Viral 状态, 并且已知返回的数据是坏的或可疑的,它将毒化 CXL 接口上的所有数据响应。

If Viral is enabled and a CXL component is in the Viral condition, it is recommended that the component not poison the subsequent data responses on CXL.cachemem interface to avoid error pollution.

如果启用了 Viral 并且 CXL 组件处于 Viral 状态,建议该组件不要毒化 CXL.cachemem 接口上的后续数据响应,以避免错误污染。

The Host may send poisoned data to the CXL connected device. How the CXL device responds to Poison is device specific but must follow PCIe guidelines. The device must consciously make a decision about what to make of poisoned data. In some cases, simply ignoring poisoned data may lead to SDC (Silent Data Corruption). A CXL 2.0 device is required keep track of any poison data it receives on a 64 Byte granularity.

主机可能会向 CXL 连接的设备发送中毒数据。 CXL 设备如何响应 Poison 是特定于设备的,但必须遵循 PCIe 准则。 设备必须有意识地决定如何处理受污染的数据。 在某些情况下,简单地忽略受污染的数据可能会导致 SDC(静默数据损坏)。 CXL 2.0 设备需要以 64 字节的粒度跟踪它接收到的任何有害数据。

Any device errors that cannot be handled with Poison indication shall be signaled by the device back to the Host as messages since there are no error pins. To that end, Table 224 below shows a summary of the error types, their mappings and error reporting guidelines for devices that do not implement Memory Error Logging and Signaling Enhancements (Section 12.2.3.2).

任何无法通过中毒指示处理的设备错误都应由设备作标记发送回主机,因为没有错误引 pins。 为此,下面的表 224 显示了错误类型的摘要,它们的映射和错误报告指南,用于未实现内存错误记录和信号增强的设备(第 12.2.3.2 节)。

For devices that implement Memory Error Logging and Signaling Enhancements, Section 12.2.3.2 describes how memory errors are logged and signaled. Such devices should follow Table 224 for dealing with all non-memory errors.

对于实现内存错误记录和信号增强的设备,第 12.2.3.2 节描述了如何记录和发出内存错误。 此类设备应遵循表 224 来处理所有非内存错误。
请添加图片描述
请添加图片描述

12.2.3.1 CXL.mem and CXL.cache Errors

If demand accesses to memory results in an uncorrected data error, the CXL device must return data with poison. The requester (processor core or a peer device) is responsible for dealing with the poison indication. The CXL device should not signal an uncorrected error along with the poison. If the processor core consumes the poison, the error will be logged and signaled by the Host.

如果对内存的需求访问导致未更正的数据错误,则 CXL 设备必须返回带毒的数据。 请求者(处理器核心或对等设备)负责处理中毒指示。 CXL 设备不应与 poison 一起发出未纠正的错误信号。 如果处理器内核消耗了 poison,主机将记录错误并发出信号。

Any non-demand uncorrected errors detected by CXL 1.1 device (e.g., memory scrub logic in CXL device memory controller) will be signaled to the device driver via device MSI or MSI-X. Any corrected memory errors will be signaled to the device driver via device MSI or MSI-X. The driver may choose to deallocate memory pages with repeated errors. Neither the platform firmware nor the OS directly deal with these errors. A CXL 1.1 device may implement the capabilities described in Section 12.2.3.2, in which case a device driver is not required.

CXL 1.1 设备检测到的任何非需求未更正错误(例如,CXL 设备内存控制器中的内存清理逻辑)将通过设备 MSI 或 MSI-X 向设备驱动程序发送信号。 任何已纠正的内存错误都将通过设备 MSI 或 MSI-X 发送给设备驱动程序。 驱动程序可能会选择释放重复错误的内存页面。 平台固件和操作系统都不会直接处理这些错误。 CXL 1.1 设备可以实现第 12.2.3.2 节中描述的功能,在这种情况下不需要设备驱动程序。

If a CXL 2.0 component is not able to positively decode a CXL.mem address, the handling is described in Section 8.2.5.12.2. If a component does not implement HDM Decoders (Section 8.2.5.12), it shall drop such a write transaction and return all 1s response to such a read transaction.

如果 CXL 2.0 组件无法正确解码 CXL.mem 地址,则处理在第 8.2.5.12.2 节中描述。 如果一个组件没有实现 HDM 解码器(第 8.2.5.12 节),它应该丢弃这样的写事务并返回全 1 的响应来响应这样的读事务。

12.2.3.2 Memory Error Logging and Signaling Enhancements

Errors in memory may be encountered during a demand access or independent of any request issued to it and it is important to log enough data about such errors to enable the use of host platform-level RAS features, such as page retirement, without dependence on a driver.

内存中的错误可能会在请求访问期间遇到,或者与向其发出的任何请求无关,并且记录有关此类错误的足够数据以启用主机平台级 RAS 功能(例如页面退出)的使用非常重要,而不依赖于驱动。

In addition, general device events unrelated to the media at all, including changes in the devices health or environmental conditions detected by the device, need to be reported using the same general event logging facility.

此外,与媒体完全无关的一般设备事件,包括设备检测到的设备健康或环境条件的变化,需要使用相同的一般事件记录工具进行报告。

Figure 184 illustrates a use case where the two methods of signaling supported by a CXL.mem device - VDM and MSI/MSI-X – are used by a host to implement Firmwarefirst and OS-first error handling

图 184 说明了一个用例,其中 CXL.mem 设备支持的两种信令方法 - VDM 和 MSI/MSI-X - 被主机用于实现固件优先和操作系统优先错误处理
请添加图片描述

A CXL device that supports Memory Error Logging and Signaling Enhancements capability, must log such errors locally and expose the error log to system software via MMIO Mailbox (Section 8.2.8.4.3). Reading an error record from the mailbox will not automatically result in deletion of the error record on the device. An explicit clear operation is required to delete an error record from the device. To support error record access and deletion, the device shall implement the Get Event Records and Clear Event Records commands.

支持内存错误记录和信令增强功能的 CXL 设备必须在本地记录此类错误并通过 MMIO 邮箱将错误日志公开给系统软件(第 8.2.8.4.3 节)。 从邮箱读取错误记录不会自动删除设备上的错误记录。 从设备中删除错误记录需要显式清除操作。 为了支持错误记录的访问和删除,设备应执行 Get Event Records 和 Clear Event Records 命令。

Both operations must execute atomically. Furthermore, all writes or updates to the error records by the CXL.mem device must also execute atomically.

这两个操作都必须以原子方式执行。 此外,CXL.mem 设备对错误记录的所有写入或更新也必须以原子方式执行。

Using these two operations, a host can retrieve an error record as follows:

  1. The host reads a number of event records using the Get Event Records command.
  2. When complete, the host clears the event records from the device with the Clear Event Records command, supplying one or more event record handles to clear.

使用这两个操作,主机可以检索错误记录,如下所示:

  1. 主机使用 Get Event Records 命令读取许多事件记录。
  2. 完成后,主机使用 Clear Event Records 命令从设备中清除事件记录,提供一个或多个事件记录句柄来清除。

The error records will be owned by the host firmware or OS so that all logged errors are made available to the host to support platform-level RAS features.

错误记录将归主机固件或操作系统所有,以便主机可以使用所有记录的错误以支持平台级 RAS 功能。

Error records stored on the CXL device must be sticky across device resets. The records must not be initialized or modified by a hot reset or FLR or Enhanced FLR. Devices that consume auxiliary power must preserve the error records when auxiliary power consumption is enabled. In these cases, the error records are neither initialized nor modified by hot, warm, or cold reset

存储在 CXL 设备上的错误记录必须在设备重置时保持持久。 记录不得通过热重置或 FLR 或增强 FLR 进行初始化或修改。 当启用辅助功耗时,消耗辅助功率的设备必须保留错误记录。 在这些情况下,错误记录既不会被热复位、热复位或冷复位初始化,也不会被修改

12.2.3.3 CXL Device Error Handling Flows

CXL 1.1 Device errors maybe sourced from a Root Port (RP) or Endpoint (RCiEP). For the purpose of differentiation RCiEP sourced errors shall use tag value of zero whereas RP sourced errors shall use tag of non-zero value. CXL 2.0 device errors may be sourced from CXL 2.0 Endpoint (EP).

CXL 1.1 设备错误可能源自根端口 (RP) 或 EP (RCiEP)。为了区分 RCiEP 源错误应使用零标签值 tag = 0,而 RP 源错误应使用非零值标签 tag != 0。 CXL 2.0 设备错误可能源自 CXL 2.0 端点 (EP)。

Errors detected by the CXL device shall be communicated to the host via PCIe Error messages across the CXL.io link. Errors that are not related to any specific Function within the device (Non-Function errors) and not reported via MSI/MSI-X are reported to the Host via PCIe error messages where they can be escalated to the platform. The UP reports non-function errors to all EPs/RCiEPs where they are logged. Each EP/RCiEP reports the non-function specific errors to the host via PCIe error messages. Software should be aware that even though an RCiEP does not have a software-visible link, it may still log link-related errors.

CXL 设备检测到的错误应通过 CXL.io 链路上的 PCIe 错误消息传递给主机。与设备中的任何特定功能无关且未通过 MSI/MSI-X 报告的错误将通过 PCIe 错误消息报告给主机,在这些错误消息中可以升级到平台。 UP 向记录它们的所有 EP/RCiEP 报告非功能错误。每个 EP/RCiEP 通过 PCIe 错误消息向主机报告非功能特定错误。软件应注意,即使 RCiEP 没有软件可见的链接,它仍可能记录与链接相关的错误。

At most one error message of a given severity is generated for a multi-function device. The error message must include the Requester ID of a function that is enabled to send the error message. Error messages with the same Requester ID may be merged for different errors with the same severity. No error message is sent if no function is enabled to do so. If different functions are enabled to send error messages of different severity, at most one error of each severity level is sent. Error generated by a CXL 1.1 RCiEP will be sent to the corresponding RCEC. Each RCiEP must be associated with no more than one RCEC. Error generated by a CXL 2.0 component will be logged in the CXL 2.0 Root Port.

对于多功能设备,最多生成一个给定严重性的错误消息。错误消息必须包含启用以发送错误消息的功能的请求者 ID。对于具有相同严重性的不同错误,具有相同请求者 ID 的错误消息可能会被合并。如果未启用任何功能,则不会发送错误消息。如果启用不同的功能发送不同严重程度的错误信息,则每个严重程度最多发送一个错误信息。 CXL 1.1 RCiEP 生成的错误将被发送到相应的 RCEC。每个 RCiEP 必须与不超过一个 RCEC 关联。 CXL 2.0 组件生成的错误将记录在 CXL 2.0 根端口中。

12.3 CXL Link Down Handling

There is no expectation of a graceful Link Down. A Link Down condition will most likely result in a timeout in the Host since it is quite possible that there are transactions headed to or from the CXL device that will end up not making progress.

没有完美的 Link Down 的期望。 Link Down 条件很可能会导致主机超时,因为很有可能有往来于 CXL 设备的事务最终不会取得进展。

Software may configure CXL Downstream Port to trigger Downstream Port Containment (DPC) upon certain class of errors. eDPC may enable predictable containment in certain scenarios but would generally not be a recoverable event.

软件可以配置 CXL 下游端口,以在特定类别的错误时触发下游端口遏制 (DPC)。 eDPC 可以在某些情况下实现可预测的遏制,但通常不是可恢复的事件。

12.4 CXL Viral Handling

CXL links and CXL devices are expected to be Viral compliant. Viral is an error containment mechanism. A platform must choose to enable Viral at boot time. The Host implementation of Viral allows the platform to enable the Viral feature by writing into a register. Similarly, a BIOS accessible control register on the device is written to enable Viral behavior (both receiving and sending) on the device. Viral support capability and control for enabling are reflected in DVSEC.

CXL 链接和 CXL 设备应符合 Viral 要求。 Viral 是一种错误控制机制。 平台必须选择在启动时启用 Viral。 Viral 的 Host 实现允许平台通过写入寄存器来启用 Viral 功能。 同样,写入设备上的 BIOS 可访问控制寄存器以启用设备上的 Viral 行为(接收和发送)。 Viral 支持能力和启用控制反映在 DVSEC 中。

When enabled, a Viral indication is generated whenever an Uncorrected_Fatal error is detected. Viral is not a replacement for existing error reporting mechanisms. Instead, its purpose is an additional error containment mechanism. The detector of the error is responsible for reporting the error through AER and generating a Viral indication. Any entity that is capable of reporting Uncorrected_Fatal errors must also be capable of generating a Viral indication.

启用后,只要检测到 Uncorrected_Fatal 错误,就会生成 Viral 指示。 Viral 不能替代现有的错误报告机制。 相反,它的目的是一种额外的错误控制机制。 错误检测器负责通过 AER 报告错误并生成 Viral 指示。 任何能够报告 Uncorrected_Fatal 错误的实体也必须能够生成病毒指示。

CXL.mem and CXL.cache come enabled with the Viral concept. Viral needs to be communicated in both directions. When Viral is enabled and the Host runs into a Viral condition, it shall communicate Viral across CXL.mem and/or CXL.cache to all downstream components. The Viral indication must arrive before any data that may have been affected by the error (general Viral requirement). If the host receives a Viral indication from any CXL components, it shall propagate Viral to all downstream components.

CXL.mem 和 CXL.cache 带有 Viral 概念。 Viral 需要双向传播。 当启用 Viral 并且主机遇到 Viral 条件时,它应通过 CXL.mem 和/或 CXL.cache 与所有下游组件通信。 Viral 指示必须在可能受到错误影响的任何数据之前到达(一般病毒要求)。 如果主机从任何 CXL 组件接收到 Viral 指示,它应将 Viral 传播到所有下游组件。

All types of Conventional Resets shall clear viral condition. CXL Reset shall have no effect on viral condition. FLR shall have no effect on viral condition.

所有类型的常规重置都应清除 Viral 状况。 CXL 重置对 Viral 状况没有影响。 FLR 对 Viral 状况没有影响。

12.4.2 Device Considerations

The device’s reaction to Viral is going to be device specific but the device is expected to take error containment actions consistent with Viral requirements. Chiefly, it must prevent bad data from being committed to permanent storage. If the device is connected to any permanent storage or to an external interface that may be connected to permanent storage, then the device is required to self-isolate in order to be Viral compliant. This means that the device has to take containment actions without depending on help from the Host.

该设备对病毒的反应将是特定于设备的,但预计该设备将采取符合病毒要求的错误遏制措施。首先,它必须防止坏数据被提交到永久存储中。 如果设备连接到任何永久存储或可能连接到永久存储的外部接口,则设备需要自我隔离才能符合病毒要求。 这意味着设备必须在不依赖 host 帮助的情况下采取遏制措施。

The containment actions taken by the device must not prevent the Host from making forward progress. This is important for diagnostic purposes as well as avoiding error pollution (e.g., withholding data for read transactions to device memory may cause cascading timeouts in the Hosts). Therefore, on Viral detection, in addition to the containment requirements, the device shall:

  • Drop writes to the persistent HDM ranges on the device or connected to the device.
  • Completion response must always be returned.
  • Set MetaField to No-op in all read responses.
  • Fail the Set Shutdown State command (defined in Section 8.2.9.5.3.5) with an Internal Error when attempting to change the state from “dirty” to “clean”.
  • Not transition the Shutdown State to “clean” after a GPF flow.
  • Commit to the persistent HDM ranges any writes that were completed over the CXL interface before receipt of the viral.
  • Keep responding to snoops.
  • Complete pending writes to Host memory.
  • Complete all reads and writes to Device volatile memory

设备采取的遏制措施不得阻止主机向前推进。这对于诊断目的以及避免错误污染很重要(例如,将读取事务的数据保留到设备内存可能会导致主机中的级联超时)。 因此,在病毒检测中,除了抑制要求外,设备还应:

  • 删除对设备上或连接到设备的持久 HDM 范围的写入
  • 必须始终返回完成响应
  • 在所有读取响应中将 MetaField 设置为 No-op
  • 尝试将状态从“脏”更改为“干净”时,设置关闭状态命令失败并出现内部错误
  • 在 GPF 流之后不将关闭状态转换为“干净”
  • 在收到病毒之前通过 CXL 接口完成的任何写入都提交到持久性 HDM 范围
  • 保持对 snoops 的响应
  • 完成对主机内存的挂起写入
  • 完成对设备易失性存储器的所有读取和写入

When the device itself runs into a Viral condition and Viral is enabled, it shall:

  • Set the Viral Status bit to indicate that a Viral condition has occurred
  • Containment – Take steps to contain the error within the device (or logical device in an MLD component) and follow the Viral containment steps listed above.
  • Communicate the Viral condition back up CXL.mem and CXL.cache towards the Host.
  • Viral propagates to all devices in the Virtual Hierarchy including the host.

当设备本身遇到病毒情况并启用病毒时,它应:

  • 设置病毒状态位以指示已发生病毒情况
  • 遏制 - 采取措施将错误包含在设备(或 MLD 组件中的逻辑设备)中,并遵循上面列出的病毒遏制步骤
  • 将病毒状况备份 CXL.mem 和 CXL.cache 传达给主机
  • 病毒传播到虚拟层次结构中的所有设备,包括主机

Viral Control and Status bits are defined in DVSEC

12.5 CXL Error Injection

The major aim of error injection mechanisms is to allow system validation and system FW/software development etc. the means to create error scenarios and error handling flows. To this end, CXL UP and DP are recommended to implement the following error injection hooks to a specified address (where applicable):

错误注入机制的主要目的是为系统验证和系统固件/软件开发等提供创建错误场景和错误处理流程的手段。 为此,建议 CXL UP 和 DP 对指定地址(如适用)实施以下错误注入挂钩:

  • One type of CXL.io UC error (optional - similar to PCIe).
    -CXL.io is always present in any CXL connection
  • One type of CXL.mem UC error (if applicable)
  • One type of CXL.cache UC error (if applicable)
  • Link Correctable errors
    -Transient mode and
    -Persistent mode
  • Returning Poison on a read to a specified address (CXL.mem only)
  • 一种 IO UC 错误(可选,类似于 PCIe)
    • IO 总存在于任何 CXL 链接
  • 一种 mem UC 错误
  • 一种 cache UC 错误
  • 连接正确性错误
    • 传输模式
    • 持久模式
  • 在读一个指定地址时返回 Poison

参考

  • CXL 2.0 Spec 12.0
CXL (Compute Express Link)是一种用于连接计算设备的高速互联技术。它提供了一种可扩展、高带宽和低延迟的连接方式,旨在满足数据中心和高性能计算领域对于高性能、低能耗和高效能的需求。引用 关于CXL的连接过程,引用中提到了CXL的多层协议结构。其中,Flex Bus结构用于将CXL.io, CXL.mem和CXL.cache的处理逻辑分成多个独立的层次,以便进行灵活的处理。在CXL ARB/MUX的支持下,可以同时传输两种不同的业务流。PCIe Trans层和Data Link层协议的实现是可选的,可以根据需要进行融合进CXL.io的逻辑中。 至于"cxl linkup"的具体含义,由于没有更多的上下文信息,无法准确回答你的问题。但可以理解为在CXL连接中建立连接的过程或者表示连接建立成功的状态。如果你有更多的信息或者具体的问题,欢迎进一步说明,以便我能够提供更准确的答案。<span class="em">1</span><span class="em">2</span><span class="em">3</span> #### 引用[.reference_title] - *1* *2* [[New Tech] Compute Express Link or CXL What it is and Examples](https://blog.csdn.net/wangyijieonline/article/details/123047027)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] - *3* [CXL简介](https://blog.csdn.net/maxwell2ic/article/details/123306538)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 50%"] [ .reference_list ]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Call Me Gavyn

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值