Solaris x86, Device DMA, and the DDI-CSDN博客

本文链接：https://blog.csdn.net/hotsolaris/article/details/1774569

I'm going to start a monthly blog entry on a DDI subject of your choice... Now that OpenSolaris is available, I can get into some decent detail referencing kernel code when needed.

So I'm seeking a topic for July... Anyone interested, submit your DDI topic of interest in the comments and I'll pick one for July....

For June, I'm going to talk about Solaris x86, device DMA, and the DDI.. Mostly because that's what I have been spending some of my time on lately... I write code, not documents, so don't expect too much. I can't spell worth a damn... Grammar is poor. Use the wrong words a lot. I'll probably jump around a lot too :-). Hopefully, you'll still get something meaningful out of this ;-)

I'm assuming for this entry you already know a little bit about the DDI DMA interfaces in Solaris. If not, you can look at the following manpages for a little background...

ddi_dma_alloc_handle(9M)
ddi_dma_free_handle(9M)
ddi_dma_addr_bind_handle(9M)
ddi_dma_buf_bind_handle(9M)
ddi_dma_unbind_handle(9M)
ddi_dma_sync(9M)
ddi_dma_getwin(9M)
ddi_dma_nextcookie(9M)
ddi_dma_attr(9S)
ddi_dma_cookie(9S)

The implementation of these routines live in sunddi.c where sunddi.o resides in genunix (/kernel/amd64/genunix). You'll see when you look at this code, that most of these routines are just simple wrappers which will eventually end up in architecture specific code.

Jumping ahead a little, on x86, we end up in the rootnex driver. The rootnex driver is the x86 root nexus driver. Nexus drivers implement the busops interface in dev_ops. Basically, drivers are hierarchical where nexus drivers can have children which could be either other nexus drivers or leaf drivers. A leaf driver is the last driver in a branch (i.e. can't have any children). The root nexus driver is the root driver, similar to the root of a filesystem. Anyway, that's a subject of another entry. For now, just trust me that we end up in rootnex for x86 :-)

So a quick mapping of code is:

ddi_dma_alloc_handle => rootnex_dma_allochdl
ddi_dma_free_handle => rootnex_dma_freehdl
ddi_dma_addr_bind_handle => rootnex_dma_bindhdl
ddi_dma_buf_bind_handle => rootnex_dma_bindhdl
ddi_dma_unbind_handle => rootnex_dma_unbindhdl
ddi_dma_sync => rootnex_dma_flush
ddi_dma_getwin => rootnex_dma_win

ddi_dma_nextcookie stays in genunix...

Now let me be the first to say this is pretty rough code... This should be changing soon, but for now, you have been warned... So now that you know where the code is, I'm going to jump back up to a higher level...

If your still interested, you probably already know the normal sequence is to alloc a dma handle, bind a buffer, get your cookies (physical addresses to DMA into), sync if reading from memory, do your DMA, etc...

When a dma handle is allocated, the rootnex driver will do some validation on the ddi_dma_attr and pre-allocate some state for you. Nothing very exciting... The fun stuff happens in the bind code which will be the topic for the rest of this entry.. Instead of walking through the code, I'll walk through the concepts... The code should be changing soon, so I don't want to spend a lot of time on code which may not be the same by the time you read this. But first some terminology I'll be using which doesn't always match up with other folks terminology. Sometimes I like to redefine things too :-).

Scatter/Gather List (SGL) - a list of physically contiguous buffers
Cookie - single physically contiguous buffer i.e. a SGL element.
SGL Length - The maximum number of cookies/SGL elements the DMA engine supports
Copy Buffer - bounce buffer/intermediate buffer. Used as a temporary buffer to DMA to/from when the DMA engine can't reach the physical address we are supposed to MDA into.

Jumping to the fun stuff, the first concept in the bind, is how the buffer is passed down to bind. It can be a kernel virtual address (KVA) w/ size, a linked list of physical pages (without a kernel address), or an array of physical pages (with a kernel virtual address [shadow I/O]). For each page in the buffer, the rootnex driver has to make sure that the dma engine can reach the physical address. There is a DMA engine low address limit and a high address limit passed in the ddi_dma_attr during ddi_dma_alloc_handle(9M) which the rootnex driver uses to do this.

For every page which can't be reached, the rootnex driver will use part of a copy buffer. For these pages, the device will DMA into the copy buffer, and not the actual buffer. The data will be copied to/from the copy buffer when the driver calls ddi_dma_sync(9F). So the driver better make sure they have syncs in the right place and have the direction correct! Continuing... The copy buffer has a fixed maximum size. Each bind will get its own copy buffer if needed. If the amount of copybuf required in a single bind is greater than the maximum size of a copy buffer, the bind will need to be a partial bind and will require multiple windows. This is a concept I'll talk about further down..

What happens when a linked list of physical pages w/o a KVA comes down you asked? Good question! Well, currently, the rootnex driver will allocate some KVA space (vmem_alloc) without physical memory to back it up and then maps it to the physical page on the fly during sync. Not pretty. This should be changing for the 64-bit kernel in the near future (homework: what is seg kpm). How come the DMA engine can reach the copy buffer, but can't reach the original DMA buffer you ask? You guys are good... Well, most DMA buffers originate from userland or from a kernel stack which has no idea what the constraints of the DMA engine are (and it shouldn't since there may be multiple DMA engines with different constraints). The copy buffer is allocated from the same underlying routines that ddi_dma_mem_alloc(9F) uses, which takes into account the DMA engines constraints. i.e. the copy buffer is allocated specifically for the DMA engine we are using...

The copybuf code path got, and is still getting, a lot of usage in s10 and above once we went to a 64-bit kernel on x86. The number of x86 machines with > 4G of memory has gone up tremendously since you can actually use the memory more efficiently these days. OK, maybe efficiently isn't the right word, but you get my point...;-) A lot of devices only support 32-bit DMA addresses, so they correctly set their DMA high address to 0xFFFF.FFFF. Any physical address above this will require a copy buffer on x86 (On SPARC, we have an IOMMU so it doesn't have this problem, but that's a different entry).

Jumping to the side for a sec... don't confuse 64-bit DMA addresses with a 64-bit card. You may have a 32-bit/33MHz PCI card which supports 64-bit address via dual address cycles (DAC), you may have a 64-bit/66Mhz PCI card which only supports 32-bit DMA addresses, or you could have a x8 PCI Express card which only supports 32-bit DMA addresses. The speed of the card and the number of bytes that can be transfered in a clock have nothing to do with the DMA address width. If a device only supports a 32-bit DMA address, it will not be able to reach memory above 4G and will require a copy buffer.

Jumping back. It gets more interesting from here. Memory organization on SMP Opteron systems is very similar to our SPARC systems. The memory controller is in the CPU chip (which could have multiple cores). So if I have a two chip Opteron based system, I have at least 2 memory controllers. Solaris is smart, and will allocate memory closest to the core you are running on. Going back to the 2 chip Opteron system. If the system has 16G of memory, and I am a process running on chip 2, when I allocate memory, it's physical address will be above 8G (0 - 8G is attached to chip 1). So all I/O on chip 2 will need a copy buffer for a DMA engine with a high address limit of 0xFFFF.FFFF. Lessons learned, if you want performance on this type of system, use a device which supports 64-bit DMA addresses. And make sure if your device supports 64-bit DMA addresses, the driver supports 64-bit DMA addresses!

OK, enough about copy buffers. Jumping back to ddi_dma_attr for a moment. dma_attr_align is used during ddi_dma_mem_alloc(9F), don't expect it to do anything for you in the bind. dma_attr_count_max and dma_attr_seg limit the size of a cookie. If I have a 1M buffer which is physically contiguous, normally I would get a sgl length of 1 and the single cookie would be 1M in size. If I set seg or count_max to 256K-1, I would get a sgl length of 4 or 5 (depending on if the start address was page aligned) where each cookie would be <= 256K in size. Why do we have both seg and count_max? don't know...

OK, we finally arrive at windows... The fun stops here. Basically, a window is supposed to be a piece of a DMA bind that fits within the DMA constraints. i.e. if I have a bind for which the DMA engine cannot handle in a single transfer, and the driver/device supports partial mappings, the DDI is supposed to break it into multiple windows where each window can be handled by the DMA engine. Again, jumping back to ddi_dma_attr . There are three things which should require the use of multiple windows during a bind:

We need more copybuf space then the maximum copy buffer size allowed
The number of cookies required to bind the buffer is greater than the maximum number of cookies the H/W can handle (dma_attr_sgllen)
The size of the bind is greater than the maximum transfer size of the DMA engine (dma_attr_maxxfer)

But, from a historical note, that's not the way it was original implemented on the original x86 port. At the time this was written, the only time you will get multiple windows is when we need more copybuf space then the maximum copy buffer size allowed. This should be fixed shortly, but you will still have to handle how the current implementation works for the driver to operate correctly on s10 and before (I'll explain what that behavior is shortly). Don't worry though, once this is fixed, a driver which handles the old behavior will still work great with the correct behavior.

Once we need multiple windows, the rootnex now has to worry about the granularity of the device (dma_attr_granular). A device can only transfer data in even multiples of the granularity. e.g. if the granularity is set to 512, the size of a window must be an even multiple of 512. So when the rootnex gets to the end of a window, it sometimes has to subtract some data from the current window and put it into the next window to ensure the current window size is a multiple of granularity. This is referred to as trimming in the code. This gets pretty complicated with the way the rootnex DMA code is currently architected, and was the source of a fair number of bugs for which I had to put some not so obvious hacks in there to fix..

And last, but not least, what happens today in a bind when the driver supports a partial bind and one of the two conditions are hit:

The number of cookies required to bind the buffer is greater than the maximum number of cookies the H/W can handle (dma_attr_sgllen)
The size of the bind is greater than the maximum transfer size of the DMA engine (dma_attr_maxxfer)

Tune in next week..

Sorry couldn't resist.. Well, I don't think there's an official word for it, but I'm going to make up something as I type, because, remember, I like to do that sort of thing ;-). We get a superwindow, where a superwindow is a window larger than the DMA engine can handle. However, a superwindow is properly trimmed at the conditions mentioned above. So when the driver is going through the cookies, if the next cookie puts it over the DMA engines sgllen or maxxfer size, it can consider that cookie the start of the next window. So it puts more work back on the driver writer. Of course, if you've already written a Solaris driver for x86 which supports partial mappings, you have probably already figured that out :-/.

Well, that's enough for this month. I have code to finish up and putback. Don't forget to submit your DDI topic of interest in the comments section for next month...