DDR5 ECC

随着DRAM技术的发展,数据完整性变得至关重要。DDR5和LPDDR5引入了OnDieECC和ECS功能,通过模式寄存器管理错误检查和擦除。文章讨论了ECC的工作原理,自刷新模式下的数据保护,以及行锤攻击和温度对DRAM的影响。同时,提到了NVDIMM-N作为防止数据丢失的解决方案。
摘要由CSDN通过智能技术生成

随着 DRAM 制程从 1x 到 1y 到 1z 并进一步发展到 1\alpha、1\beta 节点,以及 DRAM 设备速度上升到 LPDDR5 的 8533 和 DDR5 的 8800,数据完整性正在成为 OEM 和其它用户必须考虑的一部分,依赖于存储在 DRAM 中的数据的正确性,系统才能按设计工作。

这是一个复杂的问题,需要多种方法来处理。

 MR9 (MA[7:0]=09H) - Writeback Suppression and TM
MR9 Register Information

OP[7]OP[6]OP[5]OP[4]OP[3]OP[2]OP[1]OP[0]
TMRFUx4 WriteECS
Writeback

FunctionRegister
Type
OperandDataNotes
ECS WritebackR/WOP[0]0B: Do not suppress writeback of Data and ECC
Check Bits (Default)
1B: Suppress writeback of Data and ECC Check
Bits (Optional)
1
x4 WritesR/WOP[1]0B: Do not suppress writeback of Data during RMW
(Default)
1B: Suppress writeback of Data during RMW
(Optional)
1
RFURFUOP[6:2]RFU
TMWOP[7]0B: Normal (Default)
1B: Test Mode
NOTE 1 DDR5 SPD Byte 14 Bits[2:1] indicates if feature is supported and will also indicate whether to use MR9 or MR15 for
enabling the modes.

MR14 (MA[7:0]=0EH) - Transparency ECC Configuration
MR14 Register Information

OP[7]OP[6]OP[5]OP[4]OP[3]OP[2]OP[1]OP[0]
ECS ModeReset ECS
Counter
Row Mode/
Code Word
Mode
RFUCID3CID2CID1CID0

FunctionRegister
Type
OperandDataNotes
ECS Error Register Index/
MBIST Rank Select
R/WOP[3:0]CID[3:0]1,2,3,4
RFURFUOP[4]RFU
Code Word/Row CountR/WOP[5]0B: ECS counts Rows with errors
1B: ECS counts Code words with errors
1
ECS Reset CounterWOP[6]0B: Normal (Default)
1B: Reset ECC Counter
1,4
ECS ModeR/WOP[7]0B: Manual ECS Mode Disabled (Default)
1B: Manual ECS Mode Enabled
1
NOTE 1 MR14:OP[3:0] must be setup by MRW to indicate which slice in the 3DS-DDR5 stack is referenced by the MRR for MR14-
MR20 ECS transparency data, MR22 MBIST transparency data, and MR54-MR57 hPPR resource availability. On 3DS
devices that support optional MBIST/mPPR, prior to MBIST initialization via MR23:OP[4] followed by guard keys,
MR14:OP[3:0] must be programmed by MRW according to the logical rank that is desired to perform MBIST.
NOTE 2 CID[3:0] encoding is based on the stack height of the device and varies depending on the number of dies in the stack.
NOTE 3 For Monolithic DDR5, CID[3:0] should be set to 0.
NOTE 4 ECS stands for Error Check Scrub op

MR15 (MA[7:0]=0FH) - Transparency ECC Threshold per Gb of Memory Cells
and Automatic ECS in Self Refresh
MR15 Register Information

OP[7]OP[6]OP[5]OP[4]OP[3]OP[2]OP[1]OP[0]
x4 WritesECS
Writeback
RFUAutomatic
ECS in Self
Refresh
ECS Error Threshold Count (ETC)

FunctionRegister
Type
OperandDataNotes
ECS Error Threshold
Count (ETC)
R/WOP[2:0]000B: 4
001B: 16
010B: 64
011B: 256 (Default)
100B: 1024
101B: 4096
110B: RFU
111B: RFU
Automatic ECS in Self
Refresh
WOP[3]0B: Automatic ECS disabled in Self-Refresh in
Manual ECS mode (default)
1B: Automatic ECS enabled in Self-Refresh in
Manual ECS mode
RFURFUOP[5:4]RFU
ECS WritebackR/WOP[6]0B: Do not suppress writeback of Data and ECC
Check Bits (Default)
1B: Suppress writeback of Data and ECC Check
Bits (Optional)
4
x4 WritesR/WOP[7]0B: Do not suppress writeback of Data during RMW
(Default)
1B: Suppress writeback of Data during RMW
(Optional)
4
NOTE 1 MR14:OP[3:0] applies to CID[3:0] for 3DS-DDR5 and must be setup to indicate which slice in the 3DS-DDR5 stack is
referenced in the MR14 through MR20 transparency data.
NOTE 2 DDR5 performs Automatic ECS operation while in Self-Refresh mode either by enabling MR15:OP[3]=1B (Automatic ECS
in Self-Refresh enable) or disabling MR14:OP[7]=0B (Automatic ECS mode enable).
NOTE 3 If the Automatic ECS in Self-Refresh is enabled, transparency mode-registers updated cannot be controlled by the
number of Manual ECS operation MPC command since the ECS counter is increased by both manual ECS command
and the Automatic ECS Operation in Self-Refresh mode.
NOTE 4 DDR5 SPD Byte 14 Bits[2:1] indicates if feature is supported and will also indicate whether to use MR9 or MR15 for
enabling

On-Die ECC
DDR5 devices shall implement internal Single Error Correction (SEC) ECC to improve the data integrity within the DRAM. The DRAM
shall use 128 data bits to compute the ECC code of 8 ECC Check Bits.
For a x4 DDR5 device, internal prefetch for on-die ECC is 128 bits even though a x4 is a 64-bit prefetch device. For each read or write
transaction in a x4 device, an additional section of the DRAM array is accessed internally to provide the required additional 64 bits
used in the 128-bit ECC computation. In other words, in a x4 device, each 8-bit ECC Check Bit word is tied to two 64-bit sections of
the DRAM. In the case of a x8 device, no extra prefetch is required, as the prefetch is the same as the external transfer size. For a
x16 device, two 128-bit data words and their corresponding 8 check bits are fetched from different internal banks (same external bank
address). Each 128 Data bits and the corresponding 8 check bits are checked separately and in parallel.
On reads, the DRAM corrects any single-bit errors before returning the data to the memory controller. The DRAM shall not write the
corrected data back to the array during a read cycle.
On writes, the DRAM computes ECC and writes data and ECC bits to the array. If the external data transfer size is smaller than the
128 data bits code word (x4 devices), then DRAM will have to perform an internal 'read-modify-write' operation. The DRAM will
correct any single-bit errors that result from the internal read before merging the incoming write data and then re-compute 8 ECC
Check bits before writing data and ECC bits to the array. In the case of a x8 and x16 DDR5, no internal read is required.
For a x16 device, two 136-bit code words are read from two internal banks(same external bank address), one code word is mapped
to DQ[0:7] and the other code word is mapped to DQ[8:15]
 

SEC Overview
The ECC blocks show in Figure 153 are the ECC Check Bit Generator, Syndrome Generator, Syndrome Decode and Correction. The
Check Bit Generator and Syndrome Generator blocks are fully specified by the H matrix.
The Syndrome Decode block executes the following function:
Zero Syndrome => No Error
Non-Zero Syndrome matches one of the columns of the H matrix => Flip Corresponding bit
Non-Zero Syndrome that does not match any of the columns in the H matrix => DUE
DUE: Detected Uncorrected


Figure 153 — On Die ECC Block Diagram

DDR5 ECC Transparency and Error Scrub
DDR5 ECC Transparency and Error Scrub incorporates an ECC Error Check and Scrub (ECS) mode with an error counting scheme
for transparency. The ECS mode allows the DRAM to internally read, correct single bit errors, and write back corrected data bits to the
array (scrub errors) while providing transparency to error counts. It is recommended that a full error scrub of the DRAM is
performed a minimum of once every 24 hours.
There are two options for ECS mode, set via Mode Register. The Manual ECS mode (MR14:OP[7] = 1B) allows for ECS operations
via the Multi-Purpose Command. The Automatic ECS mode (MR14:OP[7] = 0B, default setting) allows for the ECS to run internal to
the DRAM.
The ECS feature is available on all device configurations.
ECS mode implements two counters to track ECC code word errors detected during operation: Error Counter (EC) and Errors per
Row Counter (EpRC). The EC defaults to counting rows with errors; however, it may also be configured to count code words with
errors. In row mode (default), the EC tracks the number of rows that have at least one code word error detected subject to a threshold
filter. In the code word mode, the EC tracks the total number of code word errors, also subject to the threshold filter. The second
counter, EpRC, tracks the error count of the row with the largest number of code word errors along with the address of that row. EpRC
error reporting is also subject to a separate threshold filter. A general functional block diagram example of the ECS Mode operation is
shown in Figure 154 while the ECC Error Checking and Scrub mode, Mode Register (MR14), is shown in Table 153.

Table 153 — MR14 ECC Transparency and Error Scrub Mode Register Information

OP[7]OP[6]OP[5]OP[4]OP[3]OP[2]OP[1]OP[0]
ECS ModeReset ECS
Counter
Row Mode/
Code Word
Mode
RFUCID3CID2CID1CID0


4.37.1 Mode Register and DRAM Initialization Prior to ECS Mode Operation
The ECC Transparency and Error Scrub counters are set to zero and the internal ECS Address Counters are initialized either by a
RESET or by manually writing MR14 OP[6]=1B. While MR14:OP[6]=1B, ECS counters are reset and no additional ECS operations
shall occur. If manual reset via mode register is utilized, mode register bit MR14 OP[6] shall be written back to a 0 before any
subsequent ECS operations will continue or a subsequent reset can be applied.
ECS mode selections, MR15 OP[3], Automatic ECS in Self-Refresh, MR14 OP[7], Manual/Automatic ECS Mode, and MR14 OP[5],
row/code word mode shall be programmed during DRAM initialization and shall not be changed once the first ECS operation occurs
unless followed by issuing a RESET or ECS Reset Counters, otherwise an unknown operation could result during subsequent ECS
operations.
An ECS Reset Counters operation requires setting MR14:OP[6]=1B to reset MR16 - MR20. Setting MR14:OP[6]=0B is then required
to re-enable Manual or Automatic ECS operations.
Manual ECS mode is enabled by MR14 OP[7] = 1B. A manual ECS operation requires an MPC command with OP[7:0]=0000 1100B.
The DRAM must have all array bits written to prior to executing ECS operations to avoid generating false failures.
4.37.2 ECS Operation
All banks shall be precharged and in an idle state prior to executing a manual ECS operation.
Executing a manual ECS operation, MPC command with OP[7:0]=0000 1100B, generates the following internally self-timed command
sequence: ACTRDWRPRE. ECS operation timing is shown in Figure 155.
Figure 155 — ECS Operation Timing Diagram
The minimum time for the ECS operation to execute is tECSc (tMPC_Delay + tRCD + WL + tWR + tRP + ntCK). ntCK is required to satisfy
tECSc.
Table 154 — ECS Operation Timing Parameter
Upon executing a manual ECS operation. DQ’s will remain in RTT_PARK and DQS in DQS_RTT_PARK. The only commands
allowed other than DES during tECSc for a manual ECS operation are ODT NT commands, which may change the DQ and DQS
termination state.
Any illegal usage of manual ECS mode (e.g. refresh or temperature violations) will result in operation not being guaranteed.

ParameterSymbolMinMaxUnitNOTE
ECS Operation timetECScMax(176nCK, 110ns)-ns

CA[13:0]
CMD
CK_t,
CK_c
t0 ta ta+1 ta+2 ta+3 tb tb+1 tb+2 b+3 tc tc+1
CS0

tMPC_Delay
DES DES DES
VALID VALID VALID
VALID
WL + tWR
tRCD tRP + ntCK
tECSc = max(45nCK,110ns)
Normal
Mode ECS Mode Entry
ECS Mode
Normal Mode
ES CMD
D
ES
ES DES DES DES DES DES DES DES DES DES DES DES DES DES DD


传统上,处理数据错误的主要方法之一是依赖 ECC(还有目前的On Die ECC)。ECC 需要额外的内存存储(RDIMM中有单独的DRAM颗粒用于ECC),ECC计算并将ECC data写入 DRAM ,这些Data将在读取时间与内存数据(64bit)一起被读回,并与数据核对以确保没有错误。典型的 ECC 方案使用汉明码,为每Burst提single bit错误校正和double bits 错误检测。此外,虽然前几代 DRAM 要求主机为 ECC 存储留出系统内存,但最新的 DRAM(如 LPDDR5 和 DDR5)引入了On Die ECC(通过ECC Error Check and Scrub (ECS)),可以使用模式寄存器启用。

 

不满足 DRAM 刷新要求是导致数据丢失的主要因素。这可能具有挑战性,因为 PVT 变化会导致刷新要求随时间变化,将 DRAM 置于自刷新模式可以帮助将刷新跟踪任务安排到 DRAM,但可能会阻止主机进行其它调度优化,应仔细考虑。

其他一些可能影响 DRAM 数据的因素包括:

  1. Row hammer ,其中相同或相邻的行一次又一次地被激活,导致未寻址的行中的数据内容丢失或更改。LPDDR5/DDR5 等最新 DRAM 支持刷新管理(包括 DRFM 和 ARFM),允许主机通过发出专用 RFM 命令来补偿这些问题,帮助 DRAM 处理由行锤攻击引起的潜在数据丢失问题。
  2. Device temperature 是需要注意的另一个重要因素,如果应用程序需要 DRAM 在高温下运行。用户需要与 DRAM 供应商确认 DRAM 可以运行的温度范围。无论刷新率如何,都无法保证阈值大于某个温度时的数据完整性,除非 DRAM 的制造能够承受这种情况。
  3. DRAM掉电 将导致 DRAM 丢失所有内容。如果这是系统设计人员真正关心的问题,他们应该考虑使用 NVDIMM-N 设备,该设备具有片上控制器和电源,在断电前足以将 DRAM 内容复制到备用非易失性存储器中。当电力恢复时,非易失性存储器中存储的内存内容将被写回DRAM,系统可以继续像断电事件发生之前一样运行。

对于传输和生成错误,DRAM 支持附加功能,例如 CRC、DFE、Pre-Emphasis和 PPR。

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值