Reverse Engineering a NAND Flash Device Management Algorithm

最新推荐文章于 2022-11-25 09:45:00 发布

薛定谔机器猫

最新推荐文章于 2022-11-25 09:45:00 发布

阅读量1.9k

点赞数

分类专栏：研究转发

转发同时被 2 个专栏收录

126 篇文章 1 订阅

订阅专栏

研究

73 篇文章 1 订阅

订阅专栏

http://joshuawise.com/projects/ndfslave

Around June of 2012, I had gotten myself into a very bad habit. Instead ofcarrying my SD card in my camera, I left it sticking out of the side of mylaptop, presumably intending to do something with the photos on iteventually. On my flight home from Boston, the predictable thing happened:as I got up out of my seat, the machine fell out of my lap, and as themachine hit the ground, the SD card hit first, and was destroyed.

I was otherwise ready to write off the data stored on that device, butsomething inside me just wasn't happy with that outcome. Before I pitchedthe SD card in the trash, I took a look at what remained – as far as Icould tell, although the board was badly damaged, the storage IC itself wasfully intact (although with a few bent pins).

The following is a description of how I went about reverse-engineering theon-flash format, and of the conclusions that I came to. My efforts over thecourse of about a month and a half of solid work – and a “long tail” ofanother five months or so – resulted in a full recovery of all pictures andvideos that were stored on the SD card.

Introduction

It is probably fitting to start with a motivation for why this problem iscomplex; doing data recovery from a mass-production SD card seems like itshould be a trivial operation (especially given the interface that SD cardspresent), but as will become clear, it is not. From there, I will discussthe different parts of the problem in detail, both in terms of how theyphysically work, and in terms of what it means from the standpoint of adata recovery engineer.

I begin with a brief history of the field. In the past ten years,solid-state data storage has become increasingly complex. Although flash memory was originally commercialized in 1988, it only began taking off as consumer mass storage recently. In August of 2000,COMPAQ (and later, HP) began producing the iPAQ h3100/h3600 series of handheldcomputers, which had between 16 and 64MB of flash memory. This wasapproximately a standard capacity for the time period; the underlyingtechnology of the flash device was called ”NOR flash”, because of how thememory array was structured. NOR flash, in many regards, behaved like classicROM or SRAM memories: it had a parallel bus of address pins, and it wouldreturn data on the bus I/O pins in a fixed amount of time. The only spannerin the works was that writes could only change bits that were previously onesto zeroes; in order to change a zero back to a one, a whole block of bits(generally, around 16 KBytes) had to be erased at once.

This was okay, though, and we learned to work around these issues. NORflash had a limited erase life – perhaps only some millions of erases perblock – so filesystems on the device generally needed to be speciallydesigned to “wear-level” (i.e., scatter their writes around the device) inorder to avoid burning an unusable hole in the flash array. Even still, itstill appeared a lot like a block device that operating systems knew how todeal with; indeed, since it looked so much like SRAM, it was possible toboot from it on an embedded system, given appropriate connections of the buspins.

Around 2005, however, NOR flash ran into something of a problem – itstopped scaling. As the flash arrays became larger, the decode logic began to occupy more of the cell space; further, NOR flash is only about 60% as efficient (in terms of bits per surface area) as its successor. To continue a Moore's Law-type expansion ofbits per flash IC, flash manufacturers went to a technology called NANDflash. Unfortunately, as much as it sounds like the difference between NORflash and NAND flash would be entirely internal to the array, it isn't: theexternal interface, and characteristics of the device, changed radically.

As easy as NOR flash is to work with, NAND flash is a pain. NOR flash has aparallel bus in which reads can be executed on a word-level; NAND flash isoperated on with a small 8-bit wide bus in which commands are serialized,and then responses are streamed out after some delay. NOR flash has anamazing cycle life of millions of erases per block; modern NAND flashes maypermit only tens of thousands, if that. NOR flash has small erase blocks ofsome kilobytes; NAND flash can have erase blocks of some megabytes. NORflash allows arbitrary data to be written; NAND flash imposes constraints oncorrelating data between adjacent pages. Perhaps most distressingly of all,NOR flash guarantees that the system can read back the data that waswritten; NAND flash permits corruption of a percentage of bits.

In short, where NOR flash required simply a wear-leveling algorithm, modernNAND flash requires a full device-management algorithm. The basic premisesof any NAND device-management algorithm are three pieces: data decorrelation(sometimes referred to as entropy distribution), error correction, andfinally, wear-leveling. Oftentimes, the device management algorithms arebuilt into controllers that emulate a simpler (block-like) interface; theSilicon Motion SM2683EN that was in my damaged card is marketed as a“all-in-one” SD-to-NAND controller.

Data extraction

Of course, none of the device-management information is relevant if the datacan't be recovered from the flash IC itself. So, I started by building somehardware to extract the data. I had a Digilent Nexys-2 FPGA board lying around, which has a set of 0.1” headers on it; those headers are good toaround 20MHz, which means that with some care, I should be able to interfaceit directly with the NAND flash.

A bigger problem that I had facing me was that the NAND flash was physicallydamaged. The pins still had pads on them, ripped from the board; the pinswere also bent. Additionally, the part was in a TSSOP package, which wastoo small for me to solder directly to. I first experimented with doing a“dead-bug” soldering style – soldering AWG 36 leads directly to each pin –but this proved ultimately too painful to carry out for the whole IC.Ultimately, I settled on using a Schmartboard; I sliced it in half, andallowed each side to self-align. This meant that I didn't have to worryabout straightening both sides at the same time – as long as I got themeach individually, I could get a functional breakout from the flash IC. (The curious reader might enjoy some photos of my various attempts to re-assemble the NAND flash.)

The next question was the mechanics of using the FPGA to transfer data backto the host. I used Digilent's on-board EPP-like controller to communicate with the FPGA; ultimately, Icreated a mostly-dumb state machine interface on the FPGA, which could be instructed toperform a small number of tasks. It knew how to latch a command into theFPGA, it knew how to latch data in and out of the FPGA, and it knew how towait for the flash to become ready – beyond that, everything was handled ina wrapper library that I wrote on the host side. (Verilog source for this chunk of code, which I called ndfslave, is available; if you ever need some sample RTL to read for how to communicate with the Digilent EPP controller, this might be useful.)

To test that my interface was working correctly and use it, I wrote threetools:

id: The id tool interrogated the flash's identification registers, and decoded all of the registers as best it was able. This was a basic diagnostic proving that both commands worked and data reads worked. (source: flash-det.c)
iobert: IOBERT stands for “I/O Bit Error Rate Test”. Instead of simply reading the identification registers once, it spun in a tight loop repeatedly rereading them, and comparing them on each iteration. When I built the interface, the length of the wires (and lack of termination) concerned me, so I wanted to make sure that there weren't any timing or integrity marginality issues. I ran the BERT test overnight, transferring some 25GB, without error. (source: iobert.c)
flash: This was the main driver; its purpose in life was to dump the flash device. Since the flash device is split into two chip-enables, it dumps each in sequence into their own files. With some optimizations, the flash tool peaked around 1.2 MByte/second – a far cry from the limits of either the NAND flash or the USB interface, but still sufficiently performant to dump both chip selects in about 6.5 hours. (source: flash.c)

One curious challenge was that no datasheet existed for the chip that I had.I usedtheclosest available, but even that did not mention some importantcharacteristics of the chip – for instance, it did not discuss the factthat there is an “address hole” between page 0x80000 and page 0xFFFFF, andthat the rest of the data begins at 0x100000. (This comes from the factthat this flash IC is a “multi-level” flash chip, which means that each cellin the array stores three bits, not just one. The address hole exists,then, because three is not an even power of two...). The datasheet alsodoes not mention that the chip has multiple chip-selects.

The end result of this phase was a pair of 12GiB files, flash.cs0 and flash.cs1, which contained the raw data from the NAND flash medium.

Unwhitening

Modern NAND flash devices have a bunch of difficult constraints to work with, but one that might not be obvious is that it matters what data you write to it! It is possible to construct some “worst case” data patterns that can result in data becoming more easily corrupted. As it turns out, the “worst case” patterns are often when the data we write is very similar to other data nearby; to solve this, then, a stage in writing data to NAND flash is to “scramble” it first.

To understand how this can be, I've included an image to the right that shows a schematic of how the building blocks of NAND flash work internally (adapted from this drawing, from Wikipedia). This drawing shows one column of a NAND flash that has four pages; in total, four bits are represented here. To read from one page of flash, a charge is placed on the bitline, and then all of the wordlines (labeled 'WL') except for the one that we wish to sense are turned on. This means that charge can flow through the flash transistors that we're not trying to sense, and the one that we're trying to sense controls whether the charge stays on the bitline or whether it flows to ground. Then, some time later, we can sense whether that flash transistor was programmed or not by telling how much charge was left on the bit line. (This is a very simplified discussion, that may not make much sense without some background in the field. That's okay; this won't be on the final exam. The Wikipedia article on flash memory explains in some more detail, if you're interested.)

The problem with this is that the bypassed transistors (that is to say, the ones that we're skipping over – the ones that we're not sensing) are not perfect. Even though they are supposed to conduct when they are being bypassed, if they have been programmed, they might conduct slightly less well. If a few of the bits on that bitline have been programmed, that's not a problem; but if all of them except for the one that you're trying to read have been programmed, then it might take quite a lot longer for the bitline to become discharged, and the sense amp that reads the charge later could incorrectly read the transistor you're trying to sense as having been programmed.

This has only become a problem recently, as we've moved to more modern technologies. On larger processes, the effects that caused this to happen were apparently less likely; to compound the problem, multi-level flash chips, which use multiple voltage levels to encode multiple bits in a single transistor, require even more sensitivity in the sense amps. So, if other cells on the bitline are adding extra resistance, there no longer needs to be a flip all the way from one end to the other, but a small change in the middle can happen, which could still disrupt the read.

Even if you didn't understand the mechanics of it, what this means from the perspective of programming the device is that if many adjacent pages all have the same value, a page that has a differing value might be very hard to read, and might often flip to the “wrong” value when you try to read it. This affects each bit in the page individually, so if it happens to all of the bits, then even error correction will not be possible. The upshot of this is a new constraint on NAND flash: pages in the same block shouldn't have data that's excessively correlated.

The solution to this problem, then, is to add another step – I'll take the obvious name, “data decorrelation”. (Sometimes, other people call it “data scrambling”, “entropy distribution”, or “data whitening”.) The basic idea is to mix deterministic noise into the data on a per-page basis, and hope that the noise doesn't have the same pattern as the data that the user is trying to store. On flash, the data will look like garbage, so to get real data back out, you must then remove the noise – a process that I call “unwhitening”.

Mechanics

I started off without knowledge of the reason for scrambling the on-flash data, and began by looking for a one-byte XOR key that would find the FAT32 filesystem marker somewhere on the device. I did an exhaustive search of all possible one-byte XOR keys, while scanning the device for the FAT32 marker; one came up with a result, but not at a sensible offset, and no other plaintext that I would expect from the device (for instance, no DCIM or DSC0xxxxJPG). (I originally called this tool xor-me, but that tool eventually morphed into a generic tool to xor two files together.)

Since I wasn't recovering anything from a one-byte key, I wanted to know whether there was anything sensible to be recovered from an otherwise-short key. I figured that if there were real data around, the distribution of bytes would not be uniform, so I wrote a program that I called entropy to do a distribution analysis both on byte values and bits per byte. Irritatingly, I found that the distribution was pretty even, indicating that I was unlikely to find any real data around by simple methods.

Around this time, I was talking with Jeremy Brock, of A+ Perfect Computers. I was pasting him some samples of some pages on the device, and he recognized one of them as looking like an XOR key that he'd seen before. This was progress! I wrote a tool that looked for that pattern, matching on blocks in which it appears, and doing a probabilistic analysis on the rest of the block to figure out what bytes were likely to appear there (I reasoned that either zeroes or FFs were the most likely byte on the medium). Since the tool seemed to be looking for the known-pattern needle in the haystack of the image, I called it haystack-me.

It produced vaguely satisfactory output – I started to see directory structures – but some blocks didn't have the needle in them at all, and so I was still missing more than half of the medium. I was considering doing a more intelligent search, looking for common patterns (I didn't want to come up with patterns lost in the noise), but I decided to run an experiment first – I wanted to know if the most naïve thing possible could work. I wrote a tool to do a probabilistic analysis on a byte-by-byte basis in each row, without regard to every other byte in the row; from some of the results in haystack-me, I knew that the key repeated every 64 rows, so I picked the most common byte for each column in each row (mod 64). Since the algorithm was somewhat stupid, I called this tool dumb-me.

Both of these tools produced a “confidence” value for each byte in the extracted pattern, based on how probable the selected key byte was in relation to the second most-common byte. dumb-me produced satisfactory results with almost no further tweaking; it ended up being the method I chose to extract the key.

Once I had extracted the key, I ran the entire image through a new, revised xor-me, to produce a full flash image that had been unwhitened.

Future work

The probabilistic method for extracting a key, of course, kind of sucks. It's bad because it's not a sure thing (it could always be fooled by sufficiently strange data), but it's also bad because it's not how the flash controller does the job. The key that I extracted, I treated as an opaque blob – that is to say, a chunk of data that has no real relation, and nothing interesting used to generate it. That works okay for my applications, but it can't possibly be the way a tiny flash controller does this.

To understand why, it's important to look at the size of the key that I extracted – 512 KB per chip select. This is a completely unreasonable amount of memory to have in the controller; SRAM is very area-expensive, and only relatively large microcontrollers these days have that amount of memory. Those microcontrollers likely sell for some $7 a shot, primarily driven in cost by the die area for the SRAM, so it's not really possible to drive the price lower for that much SRAM in a different application. $7 would be up to half (if not more) of the bill-of-materials cost for this SD card; clearly, this is untenable.

I suspect that the key pattern is generated, instead, by a pseudo random number generator. The easiest such to implement in silicon is a LFSR – a Linear Feedback Shift Register. (LFSRs happen to be good for quite a lot of things, actually; a basic coding theory textbook should cover many of them, and Wikipedia's article isn't half bad, either.) Good future work for the unwhitening step, then, would be to reverse engineer the generating LFSR, and generalize that step to other flash controllers.

ECC recovery

Vendors manage to drive the cost of NAND flash lower by allowing the flash memory to be “imperfect”. At the scale at which these flash devices are produced, it's infeasible to expect a perfect yield – that is to say, every bit of every flash device fully functional – so allowing manufacture-time errors permits more of the NAND flash parts produced to be sold. Additionally, data errors come intrinsically with the scale of the device: as the physical size of each transistor becomes smaller, the number of electrons that can be trapped in the gate also becomes smaller, and so it takes even fewer electrons leaking in or out to cause a bit to be flipped.

All of these add up to the expectation of NAND flash being unreliable storage. Some NAND flash devices can have a bit error rate (BER) as high as 10^-5 mid-way through their lifetime¹⁾, which requires powerful error correcting schemes to bring the expected BER down to a reasonable level.

Basics of ECC

At first, the concept of improving a bit error rate without adding an extra low-error storage medium may seem somewhat far-fetched. Given an unreliable storage mechanism, how can any meaningful data be reliably extracted? The answer comes in the form of error correction codes – ECC, for short. For the sake of clarity, I will introduce some of the basic concepts of ECC here; sadly, modern high-performance codes require a deep understanding of linear algebra (indeed, one that I do not have!), so it is infeasible – and somewhat out of scope – to go into great detail in this article.

The underlying premise of any ECC is the concept of a codeword – a chunk of data that conveys additional error correcting information, on top of the user-visible data (the “input word”) that it is meant to encode. Needless to say, the codeword is always larger than the input word; ECC schemes are sometimes described as being (m,n) codes, where there are m bits in each codeword, conveying n bits of data. There is a bijection between valid codewords and input words.

These codes derive their functioning from the idea of a Hamming distance (named after Richard Hamming, one of the pioneers of modern information theory). The Hamming distance between two strings of bits is the number of bits that would need to be corrupted in order for one to be changed to the other. (By way of example, consider the strings 8'b01100110 and 8'b01010110; the Hamming distance between these two strings is 2, because two bits would need to be corrupted in order to change the first to the second.) In an error correction code, all codewords have at least a certain Hamming distance from any other valid codeword; this minimum distance specifies the properties of the code.

Let's take some examples. A code with a minimum Hamming distance of 1 is no better than simply the input – there exists a single bit flip that could result in having another valid codeword, with no way of distinguishing that there was even an issue to begin with. A code with a minimum Hamming distance of 2 begins to offer some protection: if there is a single bit flip, the result no longer lands on a valid codeword, but it is not “closer” to either, so it's not possible to correct the error. A code with a minimum Hamming distance of 3 now allows us to correct a single error – if there is a single bit flip, the result is not any longer on a valid codeword, and it is also “closer” to one codeword than to any other, so it is possible to correct the error. However, since the minimum distance is 3, there exists at least one codeword for which two bit flips will result in silently “correcting” the error to the incorrect codeword; this scheme can also be described as a “single bit correction” system. By way of one more example, a minimum Hamming distance of 4 allows both the detection of two bits flipped, and the correction of one; this is a “Single Bit Correction, Double Bit Detection” (SECDED) scheme. (Such codes are very popular for volatile ECC memory with low bit rates.)

A simple ECC

To make this somewhat more concrete, I'll provide two examples of codes that are simple enough to construct and verify by hand. The first code is extremely simple, and has been used more or less since the dawn of time: for all values of n bits, we can produce a (n+1,n) code that has a minimum Hamming distance of 2 between valid codewords. The code works by taking all of the bits in the input word, and XORing them ²⁾ together; then, take the single output bit from the XOR operation, and append it to the end of the input. This is called a “parity” code; the bit at the end is sometimes referred to as the parity bit.

You can see an example of this at right. In the first example, I've provided a valid codeword, with the parity check bit highlighted in blue. When I introduced a single bit error in the second example – placing the codeword at a Hamming distance of 1 to the original – a simple calculation will show that the parity bit no longer matches the data within, and so the code can detect that the codeword is incorrect. However, the code can detect only a single-bit error: in the third example, I have introduced only one more bit worth of error (for a total of two error bits), and the codeword once again appears to be valid. (This matches with our intuition for a minimum Hamming distance of 2; proof that 2 is in fact the minimum is left to the reader.)

Another simple ECC: row-column parity

The second code that I'll present is what I call a “row-column parity code”.In this version, I will give a (25,16) code: codewords are 25 bits long,and they decode to 16 bits of data. The code has a minimum Hamming distanceof 4, which means that it can detect all two-bit errors, and correct allone-bit errors.

As you might expect from the name, this code is constructed by setting up amatrix, and operating on the rows and columns. To start, we will write thedata word that we wish to encode – for this example, 1011 0100 0110 1111– in four groups of four, and arrange the groups in a matrix. We will thencompute nine parity operations: four for the rows, four for the columns, andone across all of the bits. (This is shown in the diagram at left.)

Although not terribly efficient, this “row-column” scheme provides astraightforward algorithm for decoding errors. At right, I show the samecode, but with a single bit flipped; the bit that was flipped is highlightedin red. To decode, we recompute the parity bits on the received dataportion, and compare them to the received parity bits. The bits that differare called the error syndrome – to be specific, the syndrome refers tothe codeword that is produces by XORing the received parity bits with thecomputed parity bits. To decode, we can go case-by-case on the errorsyndrome:

If the syndrome is all zeroes – that is to say, the received parity bits perfectly match the computed parity bits – then there is nothing to do; we have received a correct codeword (or something indistinguishable from one).
If the syndrome contains just one bit set – that is to say, only one parity bit differs – then there is a single bit error, but it is in the parity bits. The coded data is intact, and needs no correction.
If the syndrome has exactly three bits set, and one is a row, one is a column, and one is the all-bits parity, then there is a single bit error in the coded word. The row and column whose parity bits mismatch point to the data bit that is in error; flip that bit, and then the error syndrome should be zero once again.
If the syndrome is anything else, then there is an error of (at least) two bits, and we cannot correct it.

(It might not be immediately obvious why the all-bits parity bit is needed;I urge you to consider it carefully to understand what situations wouldcause this code to be unable to detect or correct an error without theaddition of that bit.)

Linear codes

Before we put these codes aside, I wish to make one more observation aboutthem. The codes above both share an important property: they arelinear codes. Definitionally, a linear code is a code for which any twovalid codewords can be added (XORed) together to produce another validcodeword. This has a handful of implications; for instance, if we considerthe function ECC(d) to mean “create the parity bits associated with thedata d, and return only those”, then we can derive the identity ECC(d1XOR d2) = ECC(d1) XOR ECC(d2).

(This will become very important shortly, so make sure you understand whythis is the case.)

NAND flash implementation

Of course, neither of the codes above are really “industrial-strength”. Inorder to be able to recover data from high-bit-error-rate devices likemodern NAND flash, we need codes that are substantially more capable – andwith along that capability comes a corresponding increase in encoding anddecoding complexity. Most flash devices these days demand the capabilitiesof a class of codes called BCH codes (although LDPC is also becomingpopular); these codes have sizes on the order of (1094,1024) or larger,and work on those amounts of bytes at a time, rather than bits! Suchcodes have the capability of correcting many many bad elements in acodeword, not just one or two; the downside, however, is that theirfoundations are rooted deep in the theory of linear algebra.

One thing to note is that even a BCH code with a codeword size of 1094 bytesis not capable of filling the quantum of storage in NAND flash (a “page”). On the NAND device that I worked with, the page size was as large as 8832 bytes – nominally, 8192 bytes of user data, and 640 bytes of “out-of-band” data. So, to figure out the page format on my SD card, I started by looking at a page that had content that I knew – in particular, a certain page in my data dump, once it was unwhitened, appeared to have bits of a directory structure in it. In this page, I saw the pattern:

Start	End	Contents
0	1023	data
1024	1093	noise
...
6564	7587	noise (zeroes)
7588	7657	noise (common)
7658	8681	noise (zeroes)
8682	8751	noise (common)
8752	8777	more different garbage
8778	8832	filled with `0xFF`

It seemed pretty obvious: the code that was being used was the (1094,1024) code that I described above. Seeing the same data – the zeroes at the end of the page – encoding to the same codeword gave me some hope, too. (If you'd like to follow along, my original notes discussing this are in notes/page-format, through about line 45.) The garbage at the end of the page wasn't immediately clear, but I figured that it was for block mapping, and decided that I'd come back to that later.

Decoding

Now that I knew where the codeword boundaries were, I was beginning to actually stand a chance of decoding the ECC! I started off by making some assumptions – I decided that it probably had to be a BCH code. Since my grounding in finite field theory is not quite strong enough to write a BCH encoder and decoder of my own, I grabbed the implementation from the Linux kernel, removed some of the kernelisms, and dropped it in as ecc/bch.c.

The next problem that I was about to face was that I wasn't sure what the whitening pattern for the ECC regions was. Although it was easy enough to guess by statistical analysis what the data regions would be, I didn't even know what the ECC regions were “supposed” to look like, so the same scheme as unwhitening the data region was unlikely to be fruitful. I pondered over this for a few days, and eventually came across a solution that I considered to be very elegant.

I discovered, in fact, that the ECC scheme didn't need any knowledge of whitening at all! As long as most pages were intact – which they were – I could come up with a scheme of correcting ECC errors without even running unwhitening. To come across this, it's helpful to think of a whitened data region as being represented by data_raw_n = data_n XOR d_whitener(n), and a whitened ECC region as being represented by ECC_raw_n = ECC_n XOR e_whitener(n). (We can leave the functions as opaque; and, of course, since XOR is commutative, the identities work the other way around as well.) From there, we can make the following transformations:

Start with the basic ECC identity –
ECC_n = ECC(data_n)
Next, unwrap into raw data –
ECC_raw_n XOR e_whitener(n) = ECC(data_raw_n XOR d_whitener(n))
Next, hoist the whitener out of the ECC function, by the identities for linear codes given above –
ECC_raw_n XOR e_whitener(n) = ECC(data_raw_n) XOR ECC(d_whitener(n))
Merge XOR functions that don't depend on the data –
ECC_raw_n = ECC(data_raw_n) XOR (ECC(d_whitener(n)) XOR e_whitener(n))
And then finally make the XOR functions that don't depend on the data into a constant –
ECC_raw_n = ECC(data_raw_n) XOR k_n

This is incredibly powerful! We have now reduced a function that wraps a whitener in complex and unpredictable ways into an XOR with a function of the block number. Since the whiteners that that function in turn depends on repeat, we can do a similar statistical analysis of ECC_raw_n XOR ECC(data_raw_n) in order to come up with an appropriate k_n series. (My raw notes from the time live in notes/ecc.)

A curious property emerged when I did something like this. I found that the k_ns were all identical for every offset into the whitening pattern! This lends credence to the theory that the whitening pattern generator is somehow linked to the linear-feedback shift register internal to the BCH code.

Utilities

The tools for this were relatively straightforward, as these things go, but had a few hiccups. For instance, some pages were completely erased; attempting to correct them would not be terribly meaningful. Some pages had data, but none of it met with any sort of format expectations; I assumed that those were firmware blocks of some sort, and were not subject to either the same whitening scheme, or the same error correction scheme, as the rest of the device.

The largest complication was one that any expert in error correction codes would have foreseen by my explicit failure to mention it so far. One of the parameters that defines a BCH code is one that I haven't mentioned already – a “primitive polynomial” for the code. There can be many possible primitive polynomials for a given code size; in the Linux BCH implementation, the default primitive polynomial for a size-1024 code is 0x402b, but in the SD controller's implementation, the polynomial is 0x4443. If the polynomial differs, the error correcting code simply will not work at all on the stored data, and indeed, it didn't! This was a fairly baffling issue that took quite some staring before I figured out what was going on. Sadly, I didn't find a better mechanism than trial and error to find the appropriate polynomial; I found a list of polynomials of the correct size, and plugged in a few until I found one that worked. I suspect that it may be possible to figure this out deterministically with a modification of the BCH decode routine, but my knowledge of linear codes is not strong enough to implement such a thing myself.

The most interesting guts of the tool to decode this live in ecc/bch-me.c. My notes from above also showed how I extracted the ECC XOR constants.

¹⁾ See, for instance, http://www.snia.org/sites/default/files/SSSI_NAND_Reliability_White_Paper_0.pdf – there are other remarkable things there, such as a 100-fold increase in bit error rate just by executing read cycles!

²⁾ Recall that XOR is defined as: 0 XOR 0 = 0; 0 XOR 1 = 1; 1 XOR 0 = 1; 1 XOR 1 = 0