General SCSI Docs

http://www.andante.org/scsidoc/index.html

Introduction

This document is an attempt to fully describe the SCSI subsystem in the Linux kernel. At the time of this writing, the document is incomplete in the sense that there are many sections of this document that are not yet written.

The long-term goal is that this document will be a definitive description of how the kernel code should work. Any discrepancies between the actual kernel code and this document indicate a bug in either the kernel or in the documentation.

Another self-serving goal of mine in writing all of this stuff down is that it will make it easier for other people to understand what the SCSI subsystem is doing, and in the event that anyone feels the need to change anything, that they have better odds of being able to do something productive.

2. Data Structures

 

2.0 Introduction

 

Before it is possible to explain how the actual SCSI subsystem works, it is essential that you first understand some of the key data structures which are used all throughout the subsystem. The purpose of this section of the document is to explain what those structures are, and then explain in detail what the structure elements are used for.

The other sections of the Linux SCSI documentation will be linked to this section, as appropriate.

In addition to explaining the purpose of each element of the structure, it is also important to mention exactly who should be allowed to use it.

When originally designed, the data structures were intended to be quasi-object-oriented. Given that the Linux kernel is compiled with the C compiler instead of C++, we have to fake it to some degree - one unfortunate side effect of this is that there is no way to declare private members of the class, and in addition it isn't possible to declare private member functions which are not globally accessible.

Complicating this is the fact that some structure elements are created on behalf of different sections of the SCSI subsystem. For example, some elements may be used entirely by the low-level drivers and not touched anywhere else. Other elements might be used by the mid-level, and shouldn't be touched elsewhere. At some point, it might be a good idea to try and split some of these structures to group out the elements by who is allowed to use them - in the mean time, this documentation will have to suffice.

Alternatively, some of the structure elements in the structures created on behalf of low-level drivers should be abstracted into something like a hostdata pointer which one of the structures already has.

Another factor is that some fields in the structures are used by obsolete parts of the SCSI subsystem, and could in principle be removed once those obsolete portions of the subsystem have been removed.

2.1 Host

The Scsi_Host structure is used to describe each instance of a host adapter on the system. In other words, if you have two Adaptec 1542 cards in your machine, there will be two instances of Scsi_Host, one for each.

The Scsi_Host structure is unique amongst all of the other structures in the sense that it is a structure name, not a typedefed data type. There is no good technical reason for this - I just never got around to it.

2.1.1 Structure definition

 

struct Scsi_Host
{

};
} Scsi_Host_Template;
} Scsi_Device;
};
} Scsi_Cmnd;
SCSI_STATE_TIMEOUT
SCSI_STATE_FINISHED
SCSI_STATE_FAILED
SCSI_STATE_QUEUED
SCSI_STATE_UNUSED
SCSI_STATE_DISCONNECTING
SCSI_STATE_INITIALIZING
SCSI_STATE_BHQUEUE
SCSI_STATE_MLQUEUE
SCSI_OWNER_HIGHLEVEL
SCSI_OWNER_MIDLEVEL
SCSI_OWNER_LOWLEVEL
SCSI_OWNER_ERROR_HANDLER
SCSI_OWNER_BH_HANDLER
SCSI_OWNER_NOBODY
SUCCESS
FAILED
SCSI_STATE_TIMEOUT
SCSI_STATE_QUEUED
WAS_RESET
WAS_TIMEDOUT
WAS_SENSE
IS_RESETTING
IS_ABORTING
ASKED_FOR_SENSE
The status byte is the status that is returned from the device itself.
The host_byte is the status that is returned from the host adapter.
The driver_byte is the status returned by the low-level driver.
The msg_byte is simply the message byte that comes back.
} Scsi_Pointer;

2.1.2 Element descriptions

 

2.1.2.1 Scsi_Host::next

 

All instances of Scsi_Host on the system are arranged in a linked list - the next pointer is used to traverse the list.

2.1.2.2 Scsi_Host::host_queue

The host_queue field is a linked list of all of the Scsi_Device objects (i.e. physical devices) associated with this instance of Scsi_Host. This field probably should not be used by low-level drivers - this field can be useful in debug dump situations, however.

2.1.2.3 Scsi_Host::pending_commands

The pending_commands field is a linked list of Scsi_Cmnd objects which were rejected by the low-level driver (host or device busy). Commands generally won't remain in this list very long. This field should not be used by low-level drivers.

2.1.2.4 Scsi_Host::ehandler

The ehandler field is a pointer to the kernel thread acting as the error handler for this host. This field should only be used by the SCSI mid-layer, and will be NULL in the event that the associated low-level driver still uses the old error handling code.

2.1.2.5 Scsi_Host::eh_wait

The eh_wait field is used internally by the error handler thread, and is a semaphore that is used to indicate that there is work (i.e. some command has failed).

2.1.2.6 Scsi_Host::eh_notify

The eh_notify field is used internally by the error handler thread, and is a semaphore that is used at the time of thread startup or shutdown for proper synchronization.

2.1.2.7 Scsi_Host::eh_action

The eh_action field is used internally by the error handler thread, and is a semaphore that is used during error recovery. The typical use is so that the error handler thread can wait until an interrupt indicates that some requested action (command abort, for example) is completed.

2.1.2.8 Scsi_Host::eh_active

The eh_active field is a flag that indicates the error handler thread is running. Should not be used by low-level drivers.

2.1.2.9 Scsi_Host::host_wait

The host_wait field is used when we need to block access to a device or host for some reason. There are typically two reasons why we would want to do this - error recovery is in progress is one such reason. This is also used by the broken SCSI_BLOCK "feature". This field should never be used by low-level drivers.

2.1.2.10 Scsi_Host::hostt

The hostt field is a pointer to the instance of Scsi_Host_Template for this Scsi_Host. This field may be read by low-level drivers.

2.1.2.11 Scsi_Host::host_active

This is used by the broken SCSI_BLOCK feature to indicate the host in the block list that is currently active. This field should not be used by low-level drivers.

2.1.2.12 Scsi_Host::host_busy

This field indicates the number of Scsi_Cmnd structures that have been passed into the SCSI middle layer for processing. When a command gets passed back up to the upper layer, the count is decremented. This field should not be used by low-level drivers.

2.1.2.13 Scsi_Host::host_failed

This field is the number of busy commands which have failed. By failed we mean that they require the error handler thread to take some corrective action. This field should not be used by low-level drivers.

2.1.2.14 Scsi_Host::extra_bytes

This field indicates the number of bytes allocated in the hostdata field. This is used internally by the SCSI mid-layer only when allocating and deallocating Scsi_Host structures. The value is set from the second argument passed to scsi_register().

2.1.2.15 Scsi_Host::host_no

This is the host number for this host.

2.1.2.16 Scsi_Host::last_reset

This is the time of the last reset for this host. This field is only used by the old error handling code, and will go away eventually.

2.1.2.17 Scsi_Host::max_id

This is the maximum SCSI ID for devices attached to this bus. This will typically be either 8 or 16, where the default value is 8. This is mainly used when we scan the bus so that we can bound the search.

2.1.2.18 Scsi_Host::max_lun

This is the maximum SCSI LUN for devices attached to this bus. This will typically be 8, but hosts can adjust this as required. This is mainly used when we scan the bus so that we can bound the search.

2.1.2.19 Scsi_Host::max_channel

This is the maximum channel for devices attached to this bus. Hosts that only support one channel will leave this field set to the default value of 0. This is mainly used when we scan the bus so that we can bound the search.

2.1.2.20 Scsi_Host::block

 

This field is used by the broken SCSI_BLOCK feature - typically this will be a circularly linked of hosts that desire the "blocked" feature.

The element points to a circularly linked list of Scsi_Host objects.

2.1.2.21 Scsi_Host::wish_block

This field is set by low-level drivers that perform ISA DMA. By setting this field to TRUE, the host is requesting that it be added to the block linked list. Part of the broken SCSI_BLOCK feature.

2.1.2.22 Scsi_Host::base

This field is set by low-level drivers, and is the base address for memory mapped I/O. This field is here as a convenience to low-level drivers that require this - it isn't used by the middle or upper layers.

2.1.2.23 Scsi_Host::io_port

This field is set by low-level drivers, and is the base io_port used by this host. This field is here as a convenience to low-level drivers that require this - it isn't used by the middle or upper layers, with one exception. At the time of module unload, and if the driver hasn't specified a release entrypoint, the middle layer will attempt to call release_region() if the io_port field is non-zero.

2.1.2.24 Scsi_Host::n_io_port

This field is set by low-level drivers, and is the number of I/O ports used by this host. This field is here as a convenience to low-level drivers that require this - it isn't used by the middle or upper layers, with the one exception noted above.

2.1.2.25 Scsi_Host::dma_channel

This field is set by low-level drivers, and is the dma channel number used by this host. This field is here as a convenience to low-level drivers that require this - it isn't used by the middle or upper layers, with one exception. At the time of module unload, and if the driver hasn't specified a release entrypoint, the middle layer will attempt to call free_dma() if the dma_channel field is non-zero.

2.1.2.26 Scsi_Host::irq

This field is set by low-level drivers, and is the IRQ number used by this host. This field is here as a convenience to low-level drivers that require this - it isn't used by the middle or upper layers, with one exception. At the time of module unload, and if the driver hasn't specified a release entrypoint, the middle layer will attempt to call free_irq() if the irq field is non-zero.

2.1.2.27 Scsi_Host::unique_id

The theory is that low-level drivers were supposed to set this field to be a unique ID of some sort - this could be used as required by user level programs that need to somehow identify the host. An example might case where users want to generate SVr4 style device names. Unfortunately very few drivers ever bother to set this to anything meaningful, and the user community has instead adopted the host_no field as the identifier for this instance of the host.

2.1.2.28 Scsi_Host::this_id

This is the SCSI ID of the host adapter itself. This value is initialized from the host template, but low-level drivers are free to modify it as required. This field is only used during the bus scan so that we don't try and send commands out on the bus for the host adapter itself.

2.1.2.29 Scsi_Host::can_queue

This is the maximum number of commands that this host can have active on the bus at any given time. This is initialized from the host template, and can be modified as required by the low-level driver.

2.1.2.30 Scsi_Host::cmd_per_lun

This is the maximum number of outstanding commands per lun that are allowed at one time. This field is initialized from the host template, and can be modified as required by the low-level driver. Hosts that don't support tagged queueing will always set this to 1.

2.1.2.31 Scsi_Host::sg_tablesize

For hosts that support scatter-gather I/O (nearly all do), this is the maximum size of the scatter-gather table that the host can deal with. Hosts that don't support scatter-gather will typically set this to 0. This field is initialized from the host template, but can be modified as required by the low-level driver.

2.1.2.32 Scsi_Host::in_recovery

In the event that the host uses the new error recovery code, this flag will be set when there errors that require the error handler thread take action of some sort.

2.1.2.33 Scsi_Host::unchecked_isa_dma

This flag is set if the host uses ISA DMA. Typically this is used to decide whether bounce buffers should be allocated for the I/O request. This field is initialized from the host template, and can be modified as required by the low-level driver.

2.1.2.34 Scsi_Host::use_clustering

This flag indicates that we should attempt to cluster blocks in an I/O request into larger blocks. Typically this is a win in the event that the host has a limited scatter-gather table size. This is initialized from the host template, but can be modified as required by the low-level driver.

2.1.2.35 Scsi_Host::loaded_as_module

This flag is set if the host is present as result of being loaded as a module. The flag is clear if the low-level driver was compiled into the kernel.

2.1.2.36 Scsi_Host::host_blocked

This flag is set if the host adapter rejected a command (host busy, device busy). The basic purpose is to make sure we don't try and send other commands to the host until the rejected commands have completed. See the pending_commands field for more information.

2.1.2.37 Scsi_Host::select_queue_depths

This is a function pointer into the low-level driver which is used to adjust the data structures for the low-level driver as required to prepare for tagged queueing. FIXME - this needs to be moved to the host template.

2.1.2.38 Scsi_Host::hostdata

This field is used so that hosts can define host-specific data. The extra_bytes field says how large this is. Typically low-level drivers will cast this to some internal structure and use it as appropriate.

2.2 Host template

 

The Scsi_Host_Template can be viewed like the base class from which Scsi_Host is derived. Given that the kernel is compiled with a C compiler, it also exists as a simple structure, and the hostt field of Scsi_Host points to the appropriate template.

The Scsi_Host_Template structure differs in one other way from the Scsi_Host structure in that there will always be at most one instance of the host template for a given driver. In other words, using the previous example, if you had two Adaptec 1542 cards in your machine, there would be two instances of the Scsi_Host structure, but there would be only one instance of the Scsi_Host_Template structure.

2.2.1 Structure definition

 

typedef struct SHT
{

2.2.2 Element descriptions

 

2.2.2.1 Scsi_Host_Template::next

This element is used for chaining the Scsi_Host_Template structures together into a linked list.

2.2.2.2 Scsi_Host_Template::module

This element is used when the associated low-level driver is loaded as a module - this will point to the variable "__this_module" in this case. Otherwise it is a NULL pointer. The main reason we need this is so that the __MOD_INC_USE_COUNT and __MOD_DEC_USE_COUNT macros can be used to help prevent the module being unloaded while file descriptors are open for a SCSI device.

2.2.2.3 Scsi_Host_Template::proc_dir

This entry is used to describe the directory entry underneath /proc/scsi that will be used for this host adapter. By holding a pointer to the structure here, it is possible for the SCSI core to sweep through all of the loaded drivers and build the actual /proc/scsi directory.

2.2.2.4 Scsi_Host_Template::proc_info

This entry is an entrypoint into the low-level driver which is used to generate the file contents for files underneath /proc/scsi/driver/hostno, where driver is the driver name (established through the proc_dir entry above), and hostno is the host number. Drivers are not required to supply anything here, however if the default NULL pvalue is used, there isn't any useful information available from the host entry in /proc/scsi.

2.2.2.5 Scsi_Host_Template::name

This is the name of the driver. This is for informational purposes only and is displayed at boot time. The name can also be obtained by querying a device with the IOCTL_PROBE_HOST ioctl. Also note that if the info field is non-NULL, that the display name will instead be generated using that function.

In retrospect, it is silly to have two fields that provide the same information. The info method is more general, and thus at some point in the future this interface should go away.

2.2.2.6 Scsi_Host_Template::detect

This is the entrypoint that is called either at boot time or at the time of module load to probe for the presence of a board which the driver itself can drive.

2.2.2.7 Scsi_Host_Template::release

This is the entrypoint which is called at the time of module unload. The purpose is to allow the driver to release whatever resources it may have allocated.

It is possible, but not recommended to omit this entrypoint - the Scsi_Host structure contains information about any DMA channels, IO ports or IRQs are used by the host in question, and if the release entrypoint is not specified, any resources specified in Scsi_Host will be released.

It is possible that at some point in the future, it will be mandatory for drivers to supply this entrypoint for them to be used in a modular fashion.

2.2.2.8 Scsi_Host_Template::info

This field is an entrypoint into the driver which represents an alternate method of determining the name of the scsi driver. The name field above also provides this capability. In the event that both are specified, the info method takes precedence.

2.2.2.9 Scsi_Host_Template::ioctl

This entrypoint is used for direct ioctl calls in the event that the ioctl code does not apply to the device.

2.2.2.10 Scsi_Host_Template::command

This entrypoint is used for sending commands to the low-level driver. This flavor of the interface is depreciated, and should rarely be used. The only case where this is the appropriate interface to use would be in the case of a driver for a card that has no interrupt capability, and where the driver must instead poll the card at regular intervals. There is only one such driver that I know of - a peculiar parallel port scsi adapter. If it weren't for this one oddball case, this entrypoint would probably be removed entirely. In all other cases, drivers should instead use the queuecommand interface instead.

2.2.2.11 Scsi_Host_Template::queuecommand

This is the normal entrypoint by which a low-level driver receives a command that should be queued to the specified device. The command is not guaranteed to be complete when the function returns - in fact, it would tend to be a rarity for the command to be complete when this interface returns.

2.2.2.12 Scsi_Host_Template::eh_strategy_handler

This is the entrypoint that is used by the error handler thread to actually perform error recovery. If this is NULL, a default function is used instead - the default should be sufficient for most situations, but the capability is present for drivers to override it if they wish. This entrypoint only needs to be specified in the event that the new error handling is in use.

2.2.2.13 Scsi_Host_Template::eh_abort_handler

This is the entrypoint that is called by the default error handler function during error recovery. This entry point should attempt to abort the command, if possible. This entrypoint only has meaning if the new error handling is in use.

2.2.2.14 Scsi_Host_Template::eh_device_reset_handler

This is the entrypoint that is called by the default error handler function during error recovery. This entry point should attempt to perform a device reset, if possible. This entrypoint only has meaning if the new error handling is in use.

2.2.2.15 Scsi_Host_Template::eh_bus_reset_handler

This is the entrypoint that is called by the default error handler function during error recovery. This entry point should attempt to perform a bus reset, if possible. This entrypoint only has meaning if the new error handling is in use.

2.2.2.16 Scsi_Host_Template::eh_host_reset_handler

This is the entrypoint that is called by the default error handler function during error recovery. This entry point should attempt to reset the host adapter, if possible. This entrypoint only has meaning if the new error handling is in use.

2.2.2.17 Scsi_Host_Template::abort

This is the entrypoint that is used by the old error recovery code in order to attempt a SCSI abort. The old error recovery code is obsolete and will be removed from the kernel once no drivers use it, and once this happens, this field will be removed from the structure.

2.2.2.18 Scsi_Host_Template::reset

This is the entrypoint that is used by the old error recovery code in order to attempt a SCSI reset. The old error recovery code is obsolete and will be removed from the kernel once no drivers use it, and once this happens, this field will be removed from the structure.

2.2.2.19 Scsi_Host_Template::slave_attach

This entrypoint is not used, and the field may be removed at some point in the future.

2.2.2.20 Scsi_Host_Template::bios_param

This entrypoint can be used to obtain the C/H/S for a disk drive. Typically this information is of interest to bootloaders (such as lilo), and perhaps also by programs which would wish to modify the partition table. The C/H/S information that is returned by this function *MUST* return identical information to what the BIOS on the board would return - this is the only way that a bootloader will be able to load the correct files.

2.2.2.21 Scsi_Host_Template::can_queue

This field specifies the maximum number of active SCSI commands which the driver is capable of handling at the same time. This represents a default value which can be overridden in the Scsi_Host structure.

2.2.2.22 Scsi_Host_Template::this_id

This field specifies the SCSI ID of the host adapter itself. This represents a default value which can be overridden in the Scsi_Host structure.

2.2.2.23 Scsi_Host_Template::sg_tablesize

This field specifies the maximum size of a scatter-gather list. Some host adapters have hardware limits to the size of the table - in this case the limit must be specified here.

In the event that there is no upper bound in the driver for the size of the scatter-gather table, then a value of SG_ALL can be specified.

Host adapters that don't support scatter-gather must specify 0. Please note that I/O performance for drivers that do not support scatter-gather is absolutely horrible.

2.2.2.24 Scsi_Host_Template::cmd_per_lun

This field specifies the maximum number of outstanding commands that can be specified per logical unit. Before the days of tagged queueing it didn't make a whole lot of sense to set this above 1. For host adapters that specify a select_queue_depths() function, the value specified here might not be used at all.

The place where this setting comes into play is where the Scsi_Cmnd structures are allocated. The cmd_per_lum field is used to ensure that a sufficient number of structures are allocated so that it is possible to queue this many commands.

2.2.2.25 Scsi_Host_Template::present

This field indicates how many of this type of card were found to be present on the machine.

2.2.2.26 Scsi_Host_Template::unchecked_isa_dma

Boolean value set to TRUE in the event that the card uses DMA over the ISA bus to perform I/O. When true, there are cases where we need to allocate bounce buffers prior to scheduling an I/O request.

2.2.2.27 Scsi_Host_Template::use_clustering

Boolean value set to TRUE in the event that it is a win to try and merge requests for adjacent blocks of memory into larger requests. This is usually a big win for host adapters with a limited maximum scatter-gather table size.

2.2.2.28 Scsi_Host_Template::use_new_eh_code

Boolean value set to TRUE if this host adapter is prepared to use the new error handling code.

2.2.2.29 Scsi_Host_Template::emulated

Boolean value set to TRUE for emulated SCSI host adapters. ATAPI is the only driver which should set this field.

2.3 Device

The Scsi_Device structure is used to describe a single SCSI device. There will be a single instance of the structure allocated for each physical device that is detected.

Any device (such as a CD changer) that responds to multiple LUNs will be treated a separate instances of independent devices.

These structures are allocated at the time the bus is scanned.

2.3.1 Structure definition

 

typedef struct scsi_device
{

2.3.2 Element descriptions

2.3.2.1 Scsi_Device::next

Pointer used to maintain forward links in the doubly linked list of all devices attached to a given host adapter.

2.3.2.2 Scsi_Device::prev

Pointer used to maintain reverse links in the doubly linked list of all devices attached to a given host adapter.

2.3.2.3 Scsi_Device::device_wait

This field is used in the event that we need to wait for a free Scsi_Cmnd structure for this device. This should be changed to a semaphore.

2.3.2.4 Scsi_Device::host

Pointer to the instance of Scsi_Host to which this device is attached.

2.3.2.5 Scsi_Device::device_busy

Usage counter representing the number of commands that have been received by the middle layer and for which the upper layer is waiting for a response.

2.3.2.6 Scsi_Device::scsi_request_fn

This entrypoint is a hack, pure and simple. The basic problem is that an ioctl may have blocked normal disk I/O, and once the ioctl is done, we need to jumpstart the normal disk queue to ensure that the request is queued immediately. This will undoubtably go away in the Linux 2.3 timeframe.

2.3.2.7 Scsi_Device::device_queue

This field points to the head of a linked list of Scsi_Cmnd objects that can be used for this device. All command structures (either in-use or idle) are always in the list.

2.3.2.8 Scsi_Device::id

The SCSI ID for this device.

2.3.2.9 Scsi_Device::lun

The SCSI logical unit number for this device.

2.3.2.10 Scsi_Device::channel

The SCSI channel number for this device. This will always be 0 for devices attached to host adapters that only have a single channel.

2.3.2.11 Scsi_Device::manufacturer

This field is unused.

2.3.2.12 Scsi_Device::attached

The number of upper level drivers that have attached to this device. Normally at most 2 (generics, plus one other).

2.3.2.13 Scsi_Device::access_count

Number of open file descriptors for this device. This is mainly used so that we know when to lock and unlock the door for removable media.

2.3.2.14 Scsi_Device::hostdata

This is a void * that can be used by the low-level driver to store additional state related to the device. I cannot tell at the moment whether any low-level drivers actually use this field or not.

2.3.2.15 Scsi_Device::type

The type of the device (i.e. TYPE_DISK, TYPE_SCANNER, etc). See table 7-17 in the SCSI standards for the values that this can take.

2.3.2.16 Scsi_Device::scsi_level

Set based upon INQUIRY data to be SCSI_1, SCSI_1_CCS, SCSI_2, or SCSI_UNKNOWN.

2.3.2.17 Scsi_Device::vendor

The manufacturer ID, as returned from the INQUIRY command.

2.3.2.18 Scsi_Device::model

The model number, as returned from the INQUIRY command.

2.3.2.19 Scsi_Device::rev

The firmware revision level, as returned from the INQUIRY command.

2.3.2.20 Scsi_Device::current_tag

Not 100% sure on this one. I believe the purpose of this is to track how many tagged commands have been issued without a ORDERED_QUEUE_TAG (basically to flush the queue of any pending commands).

2.3.2.21 Scsi_Device::sync_min_period

Not 100% sure on this one. I believe that this is used to keep track of synchronous I/O characteristics of this device.

2.3.2.22 Scsi_Device::sync_max_offset

Not 100% sure on this one. I believe that this is used to keep track of synchronous I/O characteristics of this device.

2.3.2.23 Scsi_Device::queue_depth

The maximum tagged queue depth for this device. The default value is the maximum number of commands per lun for the host itself, and can be overridden by the select_queue_depths() function associated with the host.

2.3.2.24 Scsi_Device::online

Boolean value indicating the device is online. Devices can be taken offline during error recovery in the event that the device is completely toasted.

2.3.2.25 Scsi_Device::writeable

Boolean value that indicates whether the device is writable. Initialized from the INQUIRY data.

2.3.2.26 Scsi_Device::removable

Boolean value that indicates whether the device is removable. Initialized from the INQUIRY data.

2.3.2.27 Scsi_Device::random

Boolean value that indicates whether the device is random-access. Initialized from the INQUIRY data.

2.3.2.28 Scsi_Device::has_cmdblocks

Boolean flag that indicates whether we have allocated Scsi_Cmnd structures for this device.

2.3.2.29 Scsi_Device::changed

Boolean flag that indicates that a media change has taken place. This may cause buffers to be flushed, to remove bogus cached data.

2.3.2.30 Scsi_Device::busy

Boolean value to indicate that the device is busy with something or another (usually rereading partition table). This really needs to be changed to a spinlock instead.

2.3.2.31 Scsi_Device::lockable

Boolean that indicates that it is possible to lock the door of a device that has removable media.

2.3.2.32 Scsi_Device::borken

Boolean flag that indicates we shouldn't attempt synchronous transfers.

2.3.2.33 Scsi_Device::tagged_supported

Boolean that indicates that tagged queuing is supported for this device.

2.3.2.34 Scsi_Device::tagged_queue

Boolean that indicates that tagged queuing is enabled for this device. Can be turned on and off with ioctls.

2.3.2.35 Scsi_Device::disconnect

This device is capable of disconnect. Any device that doesn't answer this one yes should be treated carefully.

2.3.2.36 Scsi_Device::soft_reset

Boolean value which indicates that this device uses the soft reset option.

2.3.2.37 Scsi_Device::sync

Cannot tell for sure. This might be used to indicate that we can do sync transfers, or that we have already checked to see if the device supports sync transfers. Only a small handful of drivers use this.

2.3.2.38 Scsi_Device::wide

Cannot tell for sure. This might be used to indicate that we can do wide transfers, or that we have already checked to see if the device supports wide transfers. Only a small handful of drivers use this.

2.3.2.39 Scsi_Device::single_lun

Another hack. If this boolean value is set, we only allow a single outstanding command to this SCSI Id, and block commands to other luns until the one that is outstanding is complete. Mainly of use for things like changers, where you would thrash the thing by trying to access devices on different luns at the same time.

2.3.2.40 Scsi_Device::was_reset

Indicates that the device was reset. For removable devices, it means we need to relock the door, as it will no longer be locked.

2.3.2.41 Scsi_Device::expecting_cc_ua

Indicates that we are expecting a CHECK_CONDITION/UNIT_ATTENTION as result of a reset.

2.3.2.42 Scsi_Device::device_blocked

Boolean flag that is part of the SCSI_BLOCK hack. As far as I can tell, this is never set to TRUE, and as a result it can be removed.

2.4 Device template

The Scsi_Device_Template is used to describe a single type of SCSI device. In the current implementation there will be an absolute maximum of 4 instances of this structure on any given system. One for disk, one for CDROM, one for tapes, and one for the generics pass-through. If other driver types are added (e.g. scanner), a new device template would be created.

There is no direct mapping between a Scsi_Device_Template and a Scsi_Device. The reason for this is that a single physical device can be attached to more than one device template (usually the generics pass-through, plus one more).

Many of the elements of this structure are required abstractions to make modules possible. The fundamental issue is that adding or removing a low-level driver will add/remove new devices to the kernel which in turn must be driven by the upper level drivers. We needed an abstraction of the upper level drivers to make it possible for module load/unload of low-level drivers to do the proper attach/detach.

2.4.1 Structure definition

struct Scsi_Device_Template
{

2.4.2 Element descriptions

 

2.4.2.1 Scsi_Device_Template::next

Used for linking together all Scsi_Device_Template instances into a linked list.

2.4.2.2 Scsi_Device_Template::name

ASCII Description of the type of device. For example, "disk", "tape".

2.4.2.3 Scsi_Device_Template::tag

Canonical ASCII name of the device. For example, "sd", or "st".

2.4.2.4 Scsi_Device_Template::module

This element is used when the associated upper-level driver is loaded as a module - this will point to the variable "__this_module" in this case. Otherwise it is a NULL pointer. The main reason we need this is so that the __MOD_INC_USE_COUNT and __MOD_DEC_USE_COUNT macros can be used to help prevent the module being unloaded while file descriptors are open for a SCSI device.

2.4.2.5 Scsi_Device_Template::scsi_type

The SCSI type for devices that this driver will handle. For example, TYPE_DISK, TYPE_TAPE. Note - as far as I can tell this isn't used for anything. It would probably be wise to remove this field at some later date.

2.4.2.6 Scsi_Device_Template::major

The major number which this driver will be responsible for.

2.4.2.7 Scsi_Device_Template::nr_dev

The number of physical devices that are attached to the driver.

2.4.2.8 Scsi_Device_Template::dev_noticed

During the bus scan phase, all detected devices are passed to the detect interface, which decides whether the driver is capable of handling the device in question. The dev_noticed field is a counter that tracks the total number of devices which the driver is in fact capable of driving. This does not physically attach the device to the driver - it merely bumps the counter.

2.4.2.9 Scsi_Device_Template::dev_max

This field indicates the absolute maximum number of devices that this driver is capable of driving. As a practical matter, this indicates the allocation size of some of the internal data structures.

2.4.2.10 Scsi_Device_Template::blk

Boolean flag which indicates whether this driver is for a block device or for a character device.

2.4.2.11 Scsi_Device_Template::detect

This interface is used at the time of bus scan. A Scsi_Device structure is passed in, and this function should decide whether the driver is capable of driving the device or not. Generally, this consists of checking the device type. If the driver is capable of driving the device, then the dev_noticed counter should be incremented, and the function should return 1. Otherwise the function should return 0.

This function will only be called at either boot time, or when SCSI modules are being loaded.

2.4.2.12 Scsi_Device_Template::init

This function is called once the bus scan is complete, and we have knowledge of the total number of devices that the driver will need to drive in the dev_noticed field. Typically this function will allocate the internal structures, and register the block or character device major number, if required. Note that if none of the physical devices can be drived by the driver, this interface can simply return 0 and not do anything else. Otherwise it should return 1.

Most of the internal data structures are tables that are indexed into by the minor number to obtain the size of the device, blocksize, a pointer to the Scsi_Device structure, and other device specific bits of information.

This function will only be called at either boot time, or when SCSI modules are being loaded.

2.4.2.13 Scsi_Device_Template::finish

This interface is called once all of the devices are fully attached. This has the job of initializing the internal data structures that were allocated by the init entrypoint. It also has the job of checking media capacity, spinning up disks, preparing datastructures related to partition tables, etc.

2.4.2.14 Scsi_Device_Template::attach

After the bus scan, and after the init entry point is called, a second scan over all devices is made. The general purpose is to insert a pointer to the Scsi_Device structure into the internal data structures used by the upper level driver.

2.4.2.15 Scsi_Device_Template::detach

This interface is used in the event that a low-level driver module is being unloaded, and we need to disconnect the physical devices from the upper level drivers.

2.5 Command

 

The Scsi_Cmnd structure is probably the most commonly used structure throughout the SCSI subsystem. The structure is used to represent a single command which in some stage of being queued or processed by some driver or perhaps even the physical device. All of the context associated with the actual running command is stored in this structure such that when the command actually finishes, all we need is the pointer to this structure to initiate post-processing. The closest analogy is sort of like a thread, I guess, but the analogy doesn't really work all that well.

There are a number of fields in the structure that are present purely as a convenience for low-level drivers. This should be recoded at some point, and use an abstract hostdata element at the end to contain the host-specific data.

These structures are all allocated at the time of bus scan (or shortly thereafter). For each device, we look at the queue depth for the device (or the maximum number of commands per lun) and allocate up to this number.

At runtime allocation and deallocation of these structures doesn't involve any communication with the memory manager. We simply look through the list and find a command block that is not in use and return it. When we free a block, we merely set the flag saying the command block is not in use.

2.5.1 Structure definition

typedef struct scsi_cmnd
{

2.5.2 Element descriptions

2.5.2.1 Scsi_Cmnd::host

Pointer to the Scsi_Host associated with the device.

2.5.2.2 Scsi_Cmnd::state

Indicates the current status of this command block. It can take one of these values:

2.5.2.3 Scsi_Cmnd::owner

Indicates what part of the SCSI subsystem currently owns the command block. It can take one of the following values:

2.5.2.4 Scsi_Cmnd::device

Pointer to the Scsi_Device object associated with this command block.

2.5.2.5 Scsi_Cmnd::next

Used to link together Scsi_Cmnd structures together into a linked list.

2.5.2.6 Scsi_Cmnd::reset_chain

This field is used only by the BusLogic driver. Ask Leonard what it is used for. This should be abstracted into a hostdata field.

2.5.2.7 Scsi_Cmnd::eh_state

This field is used by the error handler to record the internal state. It can currently take one of three values:

This field should not be used by low-level drivers, although I notice that some low-level drivers are attempting to use this. If a low-level driver needs to know why a command has failed, the state field should probably be used instead. It depends a bit upon what information the driver really needs.

2.5.2.8 Scsi_Cmnd::done

The done field is a pointer to the upper-level completion function for the command. Low-level drivers should not attempt to use this field.

2.5.2.9 Scsi_Cmnd::serial_number

A sequence number for the command. Each new command that comes through will have a unique sequence number. Note - the pid field serves an identical function, and one of the two (probably the pid) should be eliminated and merged with the serial number.

2.5.2.10 Scsi_Cmnd::serial_number_at_timeout

When a command times out, the serial number is saved in this field. This can be of use if the command might have later completed before error recovery had started, however in the new error handling code this should not be a problem.

2.5.2.11 Scsi_Cmnd::retries

The number of times we have retried the command. Only used during error recovery.

2.5.2.12 Scsi_Cmnd::allowed

The maximum number of retries allowed during error recovery.

2.5.2.13 Scsi_Cmnd::timeout_per_command

The amount of time (in jiffies) that we will wait before we decide that a command has timed out.

2.5.2.14 Scsi_Cmnd::timeout_total

This field is unused and should be removed from the structure.

2.5.2.15 Scsi_Cmnd::timeout

This field is unused and should be removed from the structure.

2.5.2.16 Scsi_Cmnd::internal_timeout

This field is used by the old error handling code to keep track of the current state of the command. Once the old error handling code is gone, this field can be removed.

2.5.2.17 Scsi_Cmnd::bh_next

This field is used for the linked list of commands that are in the queue for the bottom half handler.

2.5.2.18 Scsi_Cmnd::target

The SCSI ID of the device. This is somewhat redundant, in that we can also get the same information from device->id.

2.5.2.19 Scsi_Cmnd::lun

The SCSI logical unit number of the device. This is somewhat redundant, in that we can also get the same information from device->lun.

2.5.2.20 Scsi_Cmnd::channel

The channel number for the device. This is somewhat redundant, in that we can also get the same information from device->channel.

2.5.2.21 Scsi_Cmnd::cmd_len

The length (number of bytes) of the SCSI command. Typically either 6, 10, or 12.

2.5.2.22 Scsi_Cmnd::old_cmd_len

Saved copy of the command length. Used during error recovery, as we sometimes need to issue other commands (REQUEST_SENSE, or TEST_UNIT_READY) before we finally retry the command itself.

2.5.2.23 Scsi_Cmnd::cmnd

The SCSI command that needs to be sent to the device. This is normally initialized by the upper layer, however error recovery sometimes modifies the field. As far as low-level drivers are concerned, this is the command that the driver needs to try and run.

2.5.2.24 Scsi_Cmnd::request_bufflen

This is the saved buffer length associated with the request. This is a copy of the bufflen field, and is needed because we may need to restore the original settings during error recovery.

2.5.2.25 Scsi_Cmnd::eh_timeout

Timer structure used by error recovery. Should not be manipulated by low-level drivers.

2.5.2.26 Scsi_Cmnd::request_buffer

This is the saved buffer pointer associated with the request. This will either point to the actual buffer, or in the event that scatter-gather is in use it will point to the the scatter-gather table. This is a copy of the buffer field, and is needed because we may need to restore the original settings during error recovery.

2.5.2.27 Scsi_Cmnd::data_cmnd

This is a saved copy of the cmnd array, and is needed because we may need to restore the original settings during error recovery.

2.5.2.28 Scsi_Cmnd::old_use_sg

This is a saved copy of use_sg, and is needed because we may need to restore the original settings during error recovery.

2.5.2.29 Scsi_Cmnd::use_sg

This field indicates whether the command should use scatter-gather or not. If 0, it means that the buffer field indicates the actual I/O buffer, if non-zero, the buffer field indicates the address of the scatter-gather table, and the use_sg field will indicate the number of entries in the scatter-gather table.

2.5.2.30 Scsi_Cmnd::sglist_len

Amount of memory allocated for the scatter-gather list. Needed so that we know how much memory to free, when it comes time to release it.

2.5.2.31 Scsi_Cmnd::abort_reason

Used by the old error handling code - indicates the reason that we are tring to abort a command. Either DID_TIME_OUT, or DID_RESET.

2.5.2.32 Scsi_Cmnd::bufflen

Indicates the number of data bytes to be transferred by the I/O request.

2.5.2.33 Scsi_Cmnd::buffer

Indicates the buffer to be used for the data transfer. In the event that scatter-gather is in use, this field indicates the address of the scatter-gather table.

2.5.2.34 Scsi_Cmnd::underflow

Indicates the minimum number of bytes for data transfer - useful for underflow detection. Very few drivers actually use this. It is considered an error if fewer bytes than this are transferred. Set by the upper-level.

2.5.2.35 Scsi_Cmnd::transfersize

Amount of data that we are guaranteed to transfer with each SCSI transfer (between disconnect/reconnects). Usually the sector size. Generally only really dumb drivers need to worry about this.

2.5.2.36 Scsi_Cmnd::request

A copy of the request structure that this command is processing. This is useful during final post-processing when we need to mark buffers uptodate.

2.5.2.37 Scsi_Cmnd::sense_buffer

In the event of errors, this buffer contains the sense data.

2.5.2.38 Scsi_Cmnd::flags

This is only used by the old error handling code, and indicates additional information about the command. Allowable values are:

2.5.2.39 Scsi_Cmnd::host_wait

For commands in the Scsi_Host pending command queue, this bit indicates if the command is in the queue because the host or driver did not have sufficient resources to queue the command right away.

2.5.2.40 Scsi_Cmnd::device_wait

For commands in the Scsi_Host pending command queue, this bit indicates if the command is in the queue because the device did not have sufficient resources to queue the command right away. Usually because of a QUEUE_FULL message.

2.5.2.41 Scsi_Cmnd::this_count

In theory this field should be the number of sectors being transferred. In practice it isn't used, and thus this field should be removed to simplify the code. It was supposed to be the number of 512 byte sectors. For larger sector sizes, it is just a multiplier.

For devices with smaller sector sizes (256 bytes), it currently appears as if it will be impossible to request odd numbers of sectors.

2.5.2.42 Scsi_Cmnd::scsi_done

This field holds the completion routine in the SCSI middle layer. In all cases, the function scsi_done() in the middle layer is this function, and thus it seems a little silly to obscure things in this way. This field shouldn't be removed until the rewrite of the upper layer is complete for Linux 2.3.

2.5.2.43 Scsi_Cmnd::SCp

This field is a pointer to a Scsi_Pointer structure - this is used as a scratchpad for hosts that need to walk the scatter-gather list themselves.

2.5.2.44 Scsi_Cmnd::host_scribble

This field is used as a pointer to a scratchpad area that low-level drivers can use as they wish.

2.5.2.45 Scsi_Cmnd::result

Low-level drivers will fill in this field with the status of the command prior to calling the mid-level completion routine. The status consists of the status_byte, the msg_byte, the host_byte, and the driver_byte.

2.5.2.46 Scsi_Cmnd::tag

Not 100% sure on this one. I believe the purpose of this is to track how many tagged commands have been issued without a ORDERED_QUEUE_TAG (basically to flush the queue of any pending commands).

2.5.2.47 Scsi_Cmnd::pid

A sort of sequence number for the command. Each new command that comes through will have a unique pid. Note - the serial_number field serves an identical function, and one of the two (probably the pid) should be eliminated and merged with the serial number.

2.6 Pointer

2.6.1 Structure definition

 

typedef struct scsi_pointer
{

2.6.2 Element descriptions

2.6.2.1 Scsi_Pointer::ptr

2.6.2.2 Scsi_Pointer::this_residual

2.6.2.3 Scsi_Pointer::buffer

2.6.2.4 Scsi_Pointer::buffers_residual

2.6.2.5 Scsi_Pointer::Status

2.6.2.6 Scsi_Pointer::Message

2.6.2.7 Scsi_Pointer::have_data_in

2.6.2.8 Scsi_Pointer::sent_command

2.6.2.9 Scsi_Pointer::phase

 

3. Upper layer

 

3.0 Introduction

The upper layer of the SCSI subsystem has the job of taking requests that come from outside of the SCSI subsystem, and turning them into actual SCSI requests. The requests are in turn passed down to the middle layer. Once the command processing is complete, the upper layer receives the status from the middle layer, and in turn the upper layer will notify the external layer of the status.

Requests originate from 3 different sources. For block devices, requests originate from the ll_rw_blk layer. For character devices, the requests effectively originate directly from the filesystem code as users attempt to operate on the devices. Finally, the third source of requests is via ioctl - to a large degree this is similar to how character device requests are originated, however ioctls can also be issued to block devices.

There is near universal agreement that the current state of the command queuing is inadequate, and it needs to be completely redone. In addition, changes will be needed in the ll_rw_blk layer. The design is not yet complete for the rewrite, but some of the ideas will be discussed in the Future directions section. No matter what happens, the current state of the upper layer is essentially valid for all of the 2.0-2.2 series kernels.

While the specific details vary a little bit, the basic tasks of the upper layer are:

Translate incoming requests into SCSI commands (i.e. READ_10).
Create scatter-gather lists for request.
If the low-level host requires bounce buffers, these must be allocated.
Track usage counts as file descriptors are opened and closed.
Maintain externally visible arrays for device size and block size.
Finally, there is some amount of common glue that is required to make it possible for an upper level driver to be a module.
It merges requests for adjacent disk blocks into larger blocks.
It also uses an elevator sorting algorithm to try and arrange in the order of sector number. There is an assumption here that seek times scale with the number of tracks that a disk must move the heads, so the idea is that we try and minimize seek overhead by arranging requests in the order of sector number.
There is one big assumption in the ll_rw_blk layer - it basically assumes that when merging two smaller requests into one larger request that it should never let a request grow to be more than about 200 sectors.
The ll_rw_blk layer assumes that the request at the head of the list is in the process of being handled by the driver itself, so that it will never attempt to merge a new I/O request into the request at the head of the list.
The ll_rw_blk layer assumes that any request past the head of the list is inactive (in that only one request can be queued at one time).
The ll_rw_blk layer attempts to queue a command after each request is inserted into a request, despite the fact that it is known that more requests (which are probably adjacent) still have not been merged into the request lists.
To begin with, a "plug" hack was added. This was an attempt to solve the last point above, by preventing any requests from being passed down to the upper SCSI layer until all of the incoming blocks have been merged into existing requests, or added to requests of their own.
Secondly, a hack was added to the ll_rw_blk layer whereby new requests can be merged into the request at the head of the list for some major numbers (i.e. SCSI), and not others (i.e. floppy).
A device-specific operation of some sort.
A device-independent type of operation (that wouldn't depend upon the device being a disk, tape, etc).
A special operation that requests information about the host itself.
To begin with, the ll_rw_blk layer needs to be redone. The final design will have some of the following properties:
There will be multiple queues per major number. An API function (one per major) will be used which can quickly look up a pointer to the appropriate queue.
Each queue will have it's own function for queue insertion, which will decide if a new request can be merged into any existing requests. The goal is to prevent the need for splitting requests later on - each request in the queue should be small enough that it can be queued in one shot.
Each queue will have it's own request_fn function, instead of there being one per major number.
On the SCSI end of things, each Scsi_Host object will have it's own queue, queue insertion routine, and request_fn.
Default values for the request_fn will be supplied - the default will be chosen by examining the settings in the Scsi_Host and Scsi_Host_Template objects (such as the scatter-gather size, whether the host does ISA DMA, etc). By supplying a handful of request_fns that can be used, I believe it will be possible to continue to use all of the existing drivers with little or no changes. Note that driver authors that wish to will be able to write custom versions of the request_fn for the queue.
Character device and Ioctl requests will still be generated in a way similar to the way they are done now - the requests will still be passed to scsi_do_cmd or something like it. The difference is that scsi_do_cmd will turn it into a request that can be inserted into the regular queue, and the request can be started via request_fn.
Much of the guts of the block device queuing and the middle layer will be turned around so that the request_fn can act as the driver, and the middle layer will act as passenger.

3.1 Command queuing

Queuing a command is a relatively simple operation. We get a pointer to a Scsi_Cmnd object, fill in a command block with the SCSI command that we want to perform, and then call scsi_do_cmd to queue the request. Once this happens, we merely need to wait for an interrupt to indicate that the request is done. The devil is in the details.

3.2 Block devices

3.2.1 ll_rw_blk

To understand how block device requests are handled, it is first required that you understand the ll_rw_blk layer. To begin with, the ll_rw_blk layer has the job of accumulating I/O requests that originate from filesystems themselves. In order to optimize I/O performance, the ll_rw_blk layer has a number of properties which are of interest to us:

Many of these properties don't suit us well. To work around these problems, a couple of poor design decisions (i.e. hacks) were made.

3.2.2 do_sd_request/do_sr_request

The SCSI upper layer receives a call from the ll_rw_blk layer - there isn't a specific request attached to this, but it is a general suggestion that there might be something in the queue which can be started. Thus the first thing the SCSI upper layer must do is examine the request queue and see if the request at the head can be queued immediately. If it is queueable, then we allocate a Scsi_Cmd structure which will be used for the request, copy the request into the request field. At the same time, we completely release the original request block so that the ll_rw_blk layer doesn't need to know about which requests in the list might already be active.

There is one important caveat here, however. At the time we set up the command block, we look to see how many scatter-gather segments would be required to queue the entire command. If this number exceeds the maximum scatter-gather tablesize, then instead of completely removing the request from the queue, we instead split it in two so that we bite off just as much as we can chew at the moment.

Once we have the SCpnt, it gets passed to requeue_sd_request.

No matter what, the do_sd_request function will keep looping back to search for more commands that can be queued. The point of this operation is to keep as many devices as possible active at the same time.

The decision of whether a command is queueable or not largely depends upon the scsi_allocate_device() or scsi_request_queueable() functions.

3.2.3 requeue_sd_request/requeue_sr_request

This is the function that actually calculates the physical sectors which the request belongs to. Attempts to read past the end of the disk or attempts to access offline devices should be rejected at this point.

The same goes for attempts to access a device for which the media has been changed - we won't allow access until the partition tables have been re-read.

The next major task of this function is to make sure that the buffers are suitable for I/O. This will involve generating the scatter-gather table, if required, and also to allocate bounce buffers for host adapters that do DMA over an ISA bus.

Finally, the actual SCSI command is generated, and the mid-layer scsi_do_cmd function is called via the scsi_do_cmd function which will actually pass the command down to the actual host adapter. The upper layer will not handle this command again until the interrupt handler calls back up to the rw_intr function for post-processing.

3.2.4 rw_intr

The post processing has a number of jobs. In the event that there were no errors, then it is just a matter of deallocating any bounce buffers, and deallocating memory for the scatter-gather table. Once this has taken place, the actual buffer status for the buffers that belong to the request must be updated.

Once this is done, then the queuing side is called again to see if a new command can now be started.

In the event that an error was detected, we attempt to find out how much of the request actually succeeded. Any part that has succeeded is handled in the appropriate way by marking buffers uptodate.

The buffers representing blocks after this point are marked not uptodate and then are unlocked.

3.3 Character devices

Character devices tend to be quite a bit simpler than the block device drivers. The major difference is that there are different types of entrypoints. For a tape drive, we have a separate entrypoint for read and for write, for example. In addition, the requests always originate from a user process rather than from some intermediate buffer cache or filesystem.

The general approach is fairly simple. We start by requesting a pointer to a Scsi_Cmnd object with the scsi_allocate_device function, call scsi_do_cmd and then block waiting for an interrupt. The interrupt handler has the job of waking up the process that queued the request.

The interrupt handler could, if it wanted to, examine the sense data to see if the command completed normally. I should point out that this task could just as easily be done by the process that queued the request, and by doing it this way we could reduce the interrupt latency somewhat.

One problem with the current generation of SCSI character device drivers is that they tend to allocate large buffers so as to be guaranteed a DMA-safe location from which data can be transferred. While this does work, it increases the kernel memory footprint. None of the character device drivers attempt to use scatter-gather to reduce the need for the large buffer - in fact, it is entirely possible that for some requests the I/O can be done directly into the pages of user memory which correspond to the buffer.

At the time of this writing, an effort is being made to fit the generics driver with the ability to use scatter-gather instead of the "big" buffer that it currently uses.

3.4 Ioctl

There isn't a whole lot about an ioctl that is different from the way that requests are handled for character devices. The difference is that an ioctl on a file descriptor might be requesting any one of the following:

This is implemented by having each level (in the order specified above) examine the ioctl command code, process the command if it is something that it recognizes, and pass it down to the next level if the command is something that is not recognized. Other than this, ioctl tends to be done just about the same as a character device request.

3.5 Disk

There is one feature that makes a disk drive different from a cdrom or any other type of device. Disks tend to have partition tables. This means that at bootup time we must hook into the data structures that genhd uses to inform it that a new device exists which may have a partition table.

In addition, when a media change is detected, we must disallow access to the device until the fop_revalidate entrypoint is called - this will typically happen as result of the disk being remounted. Note that we do attempt to lock the door for drives with removable media, so cases where the media is changed while the disk is still mounted are rare. More often than not the situation is that the disk is unmounted, the media is changed, and then mounted again. During the mount procedure, the media change is detected, the buffers flushed, and the new partition table is read. The revalidation code lives in revalidate_scsidisk.

There is also a module related complication - when a device is removed from the system, the genhd datastructures must be cleaned up, and if the disk driver itself is unloaded, the sd_gendisk structure itself must be removed from the gendisk linked list. This is all handled by the cleanup_module entrypoint.

3.6 CDROM

A CDROM has an entirely different set of problems associated with it. They tend to be slow, which isn't really our concern here. If the user is using a changer type of device, they can be *really* slow as the media needs to be flipped in and out of the drive itself.

Most of the differences relate to the fact that a CDROM has a lot of additional IOCTL functions which can be used for purposes such as playing audio. In addition, CDROMs have a table of contents (newer ones do, anyways), which is how multi-session discs are supported. Some drives have the ability to read/write audio data directly. Finally, some drives are also CDROM writers, which have the ability to burn a disc as well as reading it.

In fact many of these oddball capabilities don't directly have anything to do with the CDROM drive itself. For example, burning a disc with a cd writer tends to involve using the generics interface rather than the cdrom interface.

The table of contents does enter into the picture for the CDROM driver, however. The reason for this is that the iso9660 filesystem must have the ability to find the volume descriptor for the last session on the disc, as this is the one which contains the pointer to the root directory that we need to use.

The support for reading the table of contents internally uses the ioctl interface, but the filesystem gains access to this through an entry in the cdrom_device_ops table.

Some of the ioctl functions that the CDROM driver supports are vendor specific. This is due to the fact that earlier drafts of the standard didn't provide this functionality, but newer drafts do. This there isn't likely to be a code explosion as more and more new drives come on the market.

3.7 Tape

In theory, the tape driver should be simple. Convert a user request into a SCSI command, issue the command, wait for the result and then return. In theory. In practice, it takes a few thousand lines of code to accomplish it all.

Much of the reason for this is that there are tons of ioctls related to setting things like blocksize, density, and compression. Also, there are ioctls which skip the tape forward and backward, count filemarks, and so forth.

Beyond this, I don't know that much about how the tape driver works. If more information is needed here, perhaps Kai Makisara can fill it in.

3.8 Generics

The generics driver is yet another case of something which was badly done in the original version, and then patched like crazy to make it somewhat workable. At the moment, there are plans to rewrite portions of the thing in the Linux-2.3 timeframe. Until that time, the thing is going to be incrementally patched to fix some of the remaining problems.

Other than this, the generics driver is another classic character mode driver. The difference is that the user specifies the SCSI command that is to be executed, and the user gets the error status back again.

3.9 Future directions

As you can see from some of the discussion above, a lot about command queuing (especially for block devices) leaves something to be desired. What we have now seems to be fairly stable, but it is very hard to maintain, and I believe that there are performance bottlenecks introduced by some of the design decisions in the current implementation.

Here are some of the rough ideas for what we would like for the new version - at the moment, it seems likely that these changes will be available in the early Linux-2.3 timeframe. A lot of this is just preliminary ideas, so don' take any of this as set in stone.

4. SCSI Middle layer

 

4.0 Introduction

 

The middle layer is sort of a nebulous name for an object that has lots of different functions. I tend to view the middle layer as containing everything that goes into the SCSI core module that is loaded when the entire SCSI subsystem is loaded as a module. As such, it can be described as having the following subcomponents:

The module-related glue itself. Only used when SCSI is used as a module, and even then only when modules are being added and removed.
Proc filesystem support.
Bus scan support.
Other misc initializations, both boot time and module related.
Error handling.
Queueing of commands.
Bottom half handler.
Utility functions.

Each one of these will be discussed in turn - some of these are important enough that they will be discussed in their own section.

4.1 Boot time Initialization

At boot time, the SCSI core must be initialized before any other parts of the SCSI subsystem - in fact, the SCSI core itself initializes the other components at the appropriate times.

The main entrypoint is scsi_dev_init, which is called from outside of the SCSI subsystem when the rest of the kernel is prepared for SCSI to come online.

Initially we start with some simple stuff. Registering /proc/scsi. Registering the bottom half handler.

4.1.1 scsi_init

Once this is done, then the function scsi_init is called - this function has the job of running through all of the low-level drivers that were compiled into the system, and calling the detect function to see whether the cards are actually present. During this stage the Scsi_Host objects are allocated and initialized. The error handler threads are launched at this point, if the drivers use the new error handling code, and finally the information is displayed to the console about which cards have been located.

This function also registers the top level devices that are compiled into the system, mainly by sticking them in the scsi_devicelist linked list.

Once this is complete, we move on to the bus scan (discussed below).

After the bus scan, the init entrypoint is called for all upper level drivers - this will allocate internal datastructures (mainly arrays indexable by minor number). Finally, for all detected devices the attach entrypoint is called to insert the pointer to the Scsi_Device into the internal tables of the upper level driver. We also call scsi_build_commandblocks which will allocate all of the Scsi_Cmnd objects that will be used.

At this point we have a complete list of all devices attached to the system. Then we call resize_dma_pool to allocate memory for the DMA pool - this will be used for bounce buffers during actual I/O transactions.

The last step of initialization is to call the finish method for all of the registered upper level drivers. This will go through and do spin-up, read-capacity, and other initializations.

4.2 Module load Initialization

This is a more restricted version of the initialization of the boot time initialization. In the case where we are loading the SCSI core as a module, we know that there are not yet any low-level or upper-level drivers present in the system, so initialization merely consists of setting up enough information such that those modules can be loaded later.

4.3 Bus scan

The bus scan is where commands are sent out to each host adapter on the system to inquire as to what devices, if any, are present on each bus. The general theory is simple - loop over all SCSI id's, all of the channels, and see what's out there.

This is done by first sending a TEST_UNIT_READY. Any response which indicates the presence of a real device is taken as an affirmative. Note that simply testing for success/failure isn't good enough as there might be disks out there which aren't spun up and hence not ready. We really need to look at the sense data and see what the actual response is.

Next, an inquiry is performed. This SCSI command retrieves a lot of information about the device - device type, manufacturer, plus a whole lot more.

When we get a response back from the inquiry, then the Scsi_Device object is initialized, largely based upon data returned from the INQUIRY command. This is the point where the printed message appears on the console describing the device.

Once we have found a device, we check to see what high level drivers might be willing to drive it. This is accomplished by calling the detect entrypoint for all of the registered upper level drivers.

One of the ugly hacks in the kernel is the blacklist - this is essentially a hardcoded list in the kernel of all of all known devices that misbehave in one way or another. Many people express shock that such an ugly hack is used, however firmware bugs are common enough that there is no other acceptable option. One of the more annoying firmware bugs is where a device will respond to queries on all luns, thereby making it look like there are 8 devices when in fact there is only one.

The implementation of the bus scan code is in scan_scsis and scan_scsis_single.

4.4 Command queueing

4.5 Bottom half handler

4.6 Utility functions

4.6.1 scsi_allocate_device

4.6.2 scsi_request_queueable

4.6.3 scsi_release_command

4.6.3 scsi_do_cmd

4.6.3 internal_cmnd

4.6.3 scsi_done

4.6.3 scsi_bottom_half_handler

4.6.3 scsi_finish_command

5. SCSI Low level drivers

 

5.0 Introduction

 

5.1 Initialization

 

5.2 Queueing

 

5.3 Completion

 

5.4 Old error handling entrypoints

 

5.5 New error handling entrypoints

 

5.6 Proc filesystem support

 

5.7 Template issues

6. Error handling

 

6.0 Introduction

 

6.1 Lower layer

 

6.2 Upper layer

 

6.3 Old middle layer

 

6.4 New middle layer

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值