随着 DRAM 制程从 1x 到 1y 到 1z 并进一步发展到 1、1 节点,以及 DRAM 设备速度上升到 LPDDR5 的 8533 和 DDR5 的 8800,数据完整性正在成为 OEM 和其它用户必须考虑的一部分,依赖于存储在 DRAM 中的数据的正确性,系统才能按设计工作。
这是一个复杂的问题,需要多种方法来处理。
MR9 (MA[7:0]=09H) - Writeback Suppression and TM
MR9 Register Information
OP[7] | OP[6] | OP[5] | OP[4] | OP[3] | OP[2] | OP[1] | OP[0] |
TM | RFU | x4 Write | ECS Writeback |
Function | Register Type | Operand | Data | Notes |
ECS Writeback | R/W | OP[0] | 0B: Do not suppress writeback of Data and ECC Check Bits (Default) 1B: Suppress writeback of Data and ECC Check Bits (Optional) | 1 |
x4 Writes | R/W | OP[1] | 0B: Do not suppress writeback of Data during RMW (Default) 1B: Suppress writeback of Data during RMW (Optional) | 1 |
RFU | RFU | OP[6:2] | RFU | |
TM | W | OP[7] | 0B: Normal (Default) 1B: Test Mode | |
NOTE 1 DDR5 SPD Byte 14 Bits[2:1] indicates if feature is supported and will also indicate whether to use MR9 or MR15 for enabling the modes. |
MR14 (MA[7:0]=0EH) - Transparency ECC Configuration
MR14 Register Information
OP[7] | OP[6] | OP[5] | OP[4] | OP[3] | OP[2] | OP[1] | OP[0] |
ECS Mode | Reset ECS Counter | Row Mode/ Code Word Mode | RFU | CID3 | CID2 | CID1 | CID0 |
Function | Register Type | Operand | Data | Notes |
ECS Error Register Index/ MBIST Rank Select | R/W | OP[3:0] | CID[3:0] | 1,2,3,4 |
RFU | RFU | OP[4] | RFU | |
Code Word/Row Count | R/W | OP[5] | 0B: ECS counts Rows with errors 1B: ECS counts Code words with errors | 1 |
ECS Reset Counter | W | OP[6] | 0B: Normal (Default) 1B: Reset ECC Counter | 1,4 |
ECS Mode | R/W | OP[7] | 0B: Manual ECS Mode Disabled (Default) 1B: Manual ECS Mode Enabled | 1 |
NOTE 1 MR14:OP[3:0] must be setup by MRW to indicate which slice in the 3DS-DDR5 stack is referenced by the MRR for MR14- MR20 ECS transparency data, MR22 MBIST transparency data, and MR54-MR57 hPPR resource availability. On 3DS devices that support optional MBIST/mPPR, prior to MBIST initialization via MR23:OP[4] followed by guard keys, MR14:OP[3:0] must be programmed by MRW according to the logical rank that is desired to perform MBIST. NOTE 2 CID[3:0] encoding is based on the stack height of the device and varies depending on the number of dies in the stack. NOTE 3 For Monolithic DDR5, CID[3:0] should be set to 0. NOTE 4 ECS stands for Error Check Scrub op |
MR15 (MA[7:0]=0FH) - Transparency ECC Threshold per Gb of Memory Cells
and Automatic ECS in Self Refresh
MR15 Register Information
OP[7] | OP[6] | OP[5] | OP[4] | OP[3] | OP[2] | OP[1] | OP[0] |
x4 Writes | ECS Writeback | RFU | Automatic ECS in Self Refresh | ECS Error Threshold Count (ETC) |
Function | Register Type | Operand | Data | Notes |
ECS Error Threshold Count (ETC) | R/W | OP[2:0] | 000B: 4 001B: 16 010B: 64 011B: 256 (Default) 100B: 1024 101B: 4096 110B: RFU 111B: RFU | |
Automatic ECS in Self Refresh | W | OP[3] | 0B: Automatic ECS disabled in Self-Refresh in Manual ECS mode (default) 1B: Automatic ECS enabled in Self-Refresh in Manual ECS mode | |
RFU | RFU | OP[5:4] | RFU | |
ECS Writeback | R/W | OP[6] | 0B: Do not suppress writeback of Data and ECC Check Bits (Default) 1B: Suppress writeback of Data and ECC Check Bits (Optional) | 4 |
x4 Writes | R/W | OP[7] | 0B: Do not suppress writeback of Data during RMW (Default) 1B: Suppress writeback of Data during RMW (Optional) | 4 |
NOTE 1 MR14:OP[3:0] applies to CID[3:0] for 3DS-DDR5 and must be setup to indicate which slice in the 3DS-DDR5 stack is referenced in the MR14 through MR20 transparency data. NOTE 2 DDR5 performs Automatic ECS operation while in Self-Refresh mode either by enabling MR15:OP[3]=1B (Automatic ECS in Self-Refresh enable) or disabling MR14:OP[7]=0B (Automatic ECS mode enable). NOTE 3 If the Automatic ECS in Self-Refresh is enabled, transparency mode-registers updated cannot be controlled by the number of Manual ECS operation MPC command since the ECS counter is increased by both manual ECS command and the Automatic ECS Operation in Self-Refresh mode. NOTE 4 DDR5 SPD Byte 14 Bits[2:1] indicates if feature is supported and will also indicate whether to use MR9 or MR15 for enabling |
On-Die ECC
DDR5 devices shall implement internal Single Error Correction (SEC) ECC to improve the data integrity within the DRAM. The DRAM
shall use 128 data bits to compute the ECC code of 8 ECC Check Bits.
For a x4 DDR5 device, internal prefetch for on-die ECC is 128 bits even though a x4 is a 64-bit prefetch device. For each read or write
transaction in a x4 device, an additional section of the DRAM array is accessed internally to provide the required additional 64 bits
used in the 128-bit ECC computation. In other words, in a x4 device, each 8-bit ECC Check Bit word is tied to two 64-bit sections of
the DRAM. In the case of a x8 device, no extra prefetch is required, as the prefetch is the same as the external transfer size. For a
x16 device, two 128-bit data words and their corresponding 8 check bits are fetched from different internal banks (same external bank
address). Each 128 Data bits and the corresponding 8 check bits are checked separately and in parallel.
On reads, the DRAM corrects any single-bit errors before returning the data to the memory controller. The DRAM shall not write the
corrected data back to the array during a read cycle.
On writes, the DRAM computes ECC and writes data and ECC bits to the array. If the external data transfer size is smaller than the
128 data bits code word (x4 devices), then DRAM will have to perform an internal 'read-modify-write' operation. The DRAM will
correct any single-bit errors that result from the internal read before merging the incoming write data and then re-compute 8 ECC
Check bits before writing data and ECC bits to the array. In the case of a x8 and x16 DDR5, no internal read is required.
For a x16 device, two 136-bit code words are read from two internal banks(same external bank address), one code word is mapped
to DQ[0:7] and the other code word is mapped to DQ[8:15]
SEC Overview
The ECC blocks show in Figure 153 are the ECC Check Bit Generator, Syndrome Generator, Syndrome Decode and Correction. The
Check Bit Generator and Syndrome Generator blocks are fully specified by the H matrix.
The Syndrome Decode block executes the following function:
Zero Syndrome => No Error
Non-Zero Syndrome matches one of the columns of the H matrix => Flip Corresponding bit
Non-Zero Syndrome that does not match any of the columns in the H matrix => DUE
DUE: Detected Uncorrected
Figure 153 — On Die ECC Block Diagram
DDR5 ECC Transparency and Error Scrub
DDR5 ECC Transparency and Error Scrub incorporates an ECC Error Check and Scrub (ECS) mode with an error counting scheme
for transparency. The ECS mode allows the DRAM to internally read, correct single bit errors, and write back corrected data bits to the
array (scrub errors) while providing transparency to error counts. It is recommended that a full error scrub of the DRAM is
performed a minimum of once every 24 hours.
There are two options for ECS mode, set via Mode Register. The Manual ECS mode (MR14:OP[7] = 1B) allows for ECS operations
via the Multi-Purpose Command. The Automatic ECS mode (MR14:OP[7] = 0B, default setting) allows for the ECS to run internal to
the DRAM.
The ECS feature is available on all device configurations.
ECS mode implements two counters to track ECC code word errors detected during operation: Error Counter (EC) and Errors per
Row Counter (EpRC). The EC defaults to counting rows with errors; however, it may also be configured to count code words with
errors. In row mode (default), the EC tracks the number of rows that have at least one code word error detected subject to a threshold
filter. In the code word mode, the EC tracks the total number of code word errors, also subject to the threshold filter. The second
counter, EpRC, tracks the error count of the row with the largest number of code word errors along with the address of that row. EpRC
error reporting is also subject to a separate threshold filter. A general functional block diagram example of the ECS Mode operation is
shown in Figure 154 while the ECC Error Checking and Scrub mode, Mode Register (MR14), is shown in Table 153.
Table 153 — MR14 ECC Transparency and Error Scrub Mode Register Information
OP[7] | OP[6] | OP[5] | OP[4] | OP[3] | OP[2] | OP[1] | OP[0] |
ECS Mode | Reset ECS Counter | Row Mode/ Code Word Mode | RFU | CID3 | CID2 | CID1 | CID0 |
4.37.1 Mode Register and DRAM Initialization Prior to ECS Mode Operation
The ECC Transparency and Error Scrub counters are set to zero and the internal ECS Address Counters are initialized either by a
RESET or by manually writing MR14 OP[6]=1B. While MR14:OP[6]=1B, ECS counters are reset and no additional ECS operations
shall occur. If manual reset via mode register is utilized, mode register bit MR14 OP[6] shall be written back to a 0 before any
subsequent ECS operations will continue or a subsequent reset can be applied.
ECS mode selections, MR15 OP[3], Automatic ECS in Self-Refresh, MR14 OP[7], Manual/Automatic ECS Mode, and MR14 OP[5],
row/code word mode shall be programmed during DRAM initialization and shall not be changed once the first ECS operation occurs
unless followed by issuing a RESET or ECS Reset Counters, otherwise an unknown operation could result during subsequent ECS
operations.
An ECS Reset Counters operation requires setting MR14:OP[6]=1B to reset MR16 - MR20. Setting MR14:OP[6]=0B is then required
to re-enable Manual or Automatic ECS operations.
Manual ECS mode is enabled by MR14 OP[7] = 1B. A manual ECS operation requires an MPC command with OP[7:0]=0000 1100B.
The DRAM must have all array bits written to prior to executing ECS operations to avoid generating false failures.
4.37.2 ECS Operation
All banks shall be precharged and in an idle state prior to executing a manual ECS operation.
Executing a manual ECS operation, MPC command with OP[7:0]=0000 1100B, generates the following internally self-timed command
sequence: ACTRDWRPRE. ECS operation timing is shown in Figure 155.
Figure 155 — ECS Operation Timing Diagram
The minimum time for the ECS operation to execute is tECSc (tMPC_Delay + tRCD + WL + tWR + tRP + ntCK). ntCK is required to satisfy
tECSc.
Table 154 — ECS Operation Timing Parameter
Upon executing a manual ECS operation. DQ’s will remain in RTT_PARK and DQS in DQS_RTT_PARK. The only commands
allowed other than DES during tECSc for a manual ECS operation are ODT NT commands, which may change the DQ and DQS
termination state.
Any illegal usage of manual ECS mode (e.g. refresh or temperature violations) will result in operation not being guaranteed.
Parameter | Symbol | Min | Max | Unit | NOTE |
ECS Operation time | tECSc | Max(176nCK, 110ns) | - | ns |
CA[13:0]
CMD
CK_t,
CK_c
t0 ta ta+1 ta+2 ta+3 tb tb+1 tb+2 b+3 tc tc+1
CS0
tMPC_Delay DES DES DES VALID VALID VALID | VA | LID | ||||||||||||||
WL + tWR tRCD tRP + ntCK tECSc = max(45nCK,110ns) | ||||||||||||||||
Normal Mode ECS Mode Entry | ECS Mode | |||||||||||||||
Normal Mode ES CMD | D ES | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | ES D | D |
传统上,处理数据错误的主要方法之一是依赖 ECC(还有目前的On Die ECC)。ECC 需要额外的内存存储(RDIMM中有单独的DRAM颗粒用于ECC),ECC计算并将ECC data写入 DRAM ,这些Data将在读取时间与内存数据(64bit)一起被读回,并与数据核对以确保没有错误。典型的 ECC 方案使用汉明码,为每Burst提single bit错误校正和double bits 错误检测。此外,虽然前几代 DRAM 要求主机为 ECC 存储留出系统内存,但最新的 DRAM(如 LPDDR5 和 DDR5)引入了On Die ECC(通过ECC Error Check and Scrub (ECS)),可以使用模式寄存器启用。
不满足 DRAM 刷新要求是导致数据丢失的主要因素。这可能具有挑战性,因为 PVT 变化会导致刷新要求随时间变化,将 DRAM 置于自刷新模式可以帮助将刷新跟踪任务安排到 DRAM,但可能会阻止主机进行其它调度优化,应仔细考虑。
其他一些可能影响 DRAM 数据的因素包括:
- Row hammer ,其中相同或相邻的行一次又一次地被激活,导致未寻址的行中的数据内容丢失或更改。LPDDR5/DDR5 等最新 DRAM 支持刷新管理(包括 DRFM 和 ARFM),允许主机通过发出专用 RFM 命令来补偿这些问题,帮助 DRAM 处理由行锤攻击引起的潜在数据丢失问题。
- Device temperature 是需要注意的另一个重要因素,如果应用程序需要 DRAM 在高温下运行。用户需要与 DRAM 供应商确认 DRAM 可以运行的温度范围。无论刷新率如何,都无法保证阈值大于某个温度时的数据完整性,除非 DRAM 的制造能够承受这种情况。
- DRAM掉电 将导致 DRAM 丢失所有内容。如果这是系统设计人员真正关心的问题,他们应该考虑使用 NVDIMM-N 设备,该设备具有片上控制器和电源,在断电前足以将 DRAM 内容复制到备用非易失性存储器中。当电力恢复时,非易失性存储器中存储的内存内容将被写回DRAM,系统可以继续像断电事件发生之前一样运行。
对于传输和生成错误,DRAM 支持附加功能,例如 CRC、DFE、Pre-Emphasis和 PPR。