PHYSICAL STRUCTURES

PHYSICAL STRUCTURES

"You have a point there," said Trurl,"We will have to figure some way around that......Themain thing, however, is the algorithm."
"Any child knows that! What's abeast without an algorithm?"
  • Stanislaw Lem,The Cyberiad
Click here for Part A of audio-text lecture and feed it to the speech agent
Click here for Part B of audio-text lecture and feed it to the speech agent

Click here for Part A of an audio lecture that can be played using RealPlayer
Click here for Part B of an audio lecture that can be played using RealPlayer

Access Methods

Accessingis the way of obtaining data stored in storage devices.

Access methods are generalized I/O routines which conceal most device-dependent details from the file management or database management system.Some existing access routines also include functions generally attributed toDBMS, such as keyword compression, sequencing, sorting, secondary index generation, etc.



Buffering

The purpose ofbufferingis to allow the operating system to transfer one buffer of data, while the application program is processing datain another buffer. An example is illustrated below.


Blocking

Blockingrefers to the practice of placing several logical recordswithin one physical record. The purposes of blocking are: (a) to reduceaccess time by reading/writing more data in one I/O access, and (b) to increase theamount of data stored on I/O devices. The first reason is quite evident, because I/O devices such as disks and tapes usually have long latency times,and relatively short data transmission times. The second reason also becomes evident, when we consider the inter-record gaps that are wasted in I/O devices.For tapes and disks, it is common practice to block many logical records into a single physical record, in orderto reduce this inter-record gap.


Sequential Access Method

The logical records are accessed in a sequential order, either according to their natural sequence of entry (or entry-sequence),or according to a fixed sequence based upon the sorted value of a key (or key-sequence).

Direct Access Method

If the records are stored in a direct-access storagedevice, then with the correct formatting and blocking, the records could be accessed directly by their physical position, or address, or by a key. A symbolic key may or may not be used directly to locate a logical record in the storage device. If there exists a key-to-address transformationwhich could beimplemented by a computer program to convert a symbolic key to the physical address of the record, then we can use symbolic keys for direct access of records.


Indexed Access Method

The indexed access method can be used for either sequential or direct processing and accessing of records in a file. The records are logically arranged with respect to a certain collating sequence,according to a key field in each record. A directory,(sometimes called anindex),is maintained for either sequential or direct access.

Conceptually, the concept of key-to-address transformation is analogousto using a mathematical function to specify a mapping from keys to addresses. The conceptof index is analogous to using a table to specify the mapping.

An example is illustrated. Using this table (or index, directory),we can find out the address of any record, given the key value. For example,the record with key equal to '00120' has the address "1".


The index above is searched sequentially. When the number of recordsbecomes large, such sequential searching becomes uneconomical. A search tree(also called tree structure index, multilevel index, or multilevel directory) can be used, as shown below. A search tree, or a multilevel directory, is a generalization of the binary search tree.

To find the address of record with key '04800', we start with the first levelindex, and compare the key '04800' with keys stored in this index. We findthe first key which is greater than or equal to the search key(the key for the record to be searched) in the collating sequence, and follow the pointerto the second level index. In this example, since '06127' is larger than'04800', the second level index B is entered. We again compare '04800' againstthe stored keys, and find a match. Since there are no more levels of indices,the address for record with key '04800' can be found, which is '5'.

In the above example, the directory contains entry for every record. Such an index is called a dense index.If a dense index is used, the indexed accessmethod is sometimes called the Indexed Direct Access Method (IDAM).

For Indexed Sequential Access Method (ISAM), the index is usually nondense.As an example, if the second level tables are removed, then it becomes anondense index. When nondense index is used, the addresses in the directory are addresses of first records in blocks of records, as shown below. In order to locate the correct records, a sequential scan of records in a block has to be made.


Key-To-Address Transformation

Key-to-address transformation algorithmsenable the retrieval of records in one access, by directly computing the record address from the given search key. An example is illustrated below.

Hashing Technique

In hashing schemes, the key is converted into a near-random number, and this number is used to determine the record address, or the addressof a group of records, called a bucket.The number of logical records storedin this area is referred to as the bucket capacity.The hashing scheme is illustrated below.

When the file is initially loaded, the location in which the records are storedis determined as follows:

  • (1) The key is converted into a near-random number n, that lies between 1 and M,where M is the number of storage buckets.
    (2) The logical bucket address n is converted to the physical bucket address,and the physical record constituting that bucket is read.
    (3) If the bucket is full, the record is stored in an overflow area.

When records are read from the file, the method of finding them is as follows:
  • (1) The key is converted into a near-random number n, using the same hashing algorithm.
    (2) The logical bucket address n is converted to the physical bucket address,and the bucket is searched to find the desired logical record, by sequentialscanning, block-search, or binary-search techniques.
    (3) If the record is not in the bucket, then the overflow area is searched.
The factors affecting the performance of a hashing algorithm are the following:

(1) The bucket size,or how many logical records can be stored per bucket. Smallerbuckets cause higher proportion of overflows, which will necessitate additional bucket reads and possibly additional seeks. If a direct-accessstorage device is used which has a long access time, it is desirable to minimizethe numbers of overflows, by choosing a bucket size of 10 or more. If theentire file is stored in main memory, then a bucket size of 1 can be chosen,so that the bucket-searching operation can be more efficient. In practice,bucket capacity is often tailored to the hardware characteristics. For example, one track of a disk unit, or half a track, may be made one bucket.

(2) The packing density,or the ratio of the number of records in a file, and the total number of records that can be stored in the buckets. Because of the random nature of the hashing algorithm, it is not possible to achieve100% packing density. A higher packing density means more overflows, andmore efficient storage. A lower packing density means less overflows,and less efficient storage. Trade-off between storage and access time canbe investigated, by adjusting the two parameters - bucket size and packingdensity.

(3) The hashing algorithm to perform key-to-address transformation. The idealtransform is one which can distribute the key set uniformly across theaddress space. The best way to choose a transform is to take the key set for the file in question and simulate the behavior of many possible transforms.For each transform, all the keys in the file will be converted, and the numbers of records in each bucket counted. The percentage of overflowsfor given bucket sizes can be evaluated. The best hashing algorithm can bechosen then.

(4) The method of handling overflows, or the way of storing overflowed records in overflow areas. The commonly used techniques are illustrated below.Overflowed records can be organized into overflow chainsin a separate overflowarea (a). Overflow area can also be organized with overflow buckets,so that the number of additional disk seeks can bereduced (b). In the latter case, when a record is deleted, a deletion flagis set. Another overflow may fill it at a later time.However, if there are many deletions, the overflow area should be reorganizedperiodically, to remove the deleted records.

The overflow buckets can also be distributed among the buckets in prime area, as shown below (a). The method has the advantagethat the overflow bucket is physically close to the overflowed bucket. Noadditional disk seek is necessary, and no chains have to bee maintained. Incase an overflow bucket overflows, the next consecutive overflow bucket is used.

Instead of reserving overflow buckets, the easiest way to handle overflows is the consecutive spillmethod, where overflowed records are stored in the nextconsecutive bucket (b). If that bucket is full, it is stored in the next bucket,and so forth. The advantage of the consecutive spillmethod is that a lengthy seek is usually not needed for overflows. However, when neighborhood buckets become crowded, many buckets may have to be examined,before a free slot is found. The method is efficient if the bucket size is large, and inefficient otherwise.

The method of handling overflows should be selected, in considerationof disk access speeds, percentage of overflows, frequencies of insertion anddeletion of records, bucket size, file size, etc.


Type of Indices

An index is denseif it includes all key values of a file, and nondenseotherwise. Indexed sequential access method uses nondenseindex, and indexed direct access method uses dense index.

An index can be built on primary key (or primary composite key). It can alsobe built on secondary key (or secondary composite key). The keys can be compressed by the various techniques. The primary index can be either dense or nondense, depending whether ISAM or IDAM is used.The pointers in lowest-level index can be absolute pointers, logical pointers,or symbolic pointers. The pointers can point to an individual record, or ablock of records, again depending upon the access method used.

There is another kind of index which is called a secondary index.A secondary index is built upon a non-key attribute, instead of a key or a composite key.A secondary index is usually employed as an auxiliary index for informationretrieval from a file. Since there is another primary index for record accessbased upon a primary key, the secondary index is an auxiliaryindex. However, we must note that there is no on-to-one address-key correspondence.

The figure below illustrates some of the useful index types. Whether any index should beconstructed, depends upon the access patterns exhibited by the queries, and theapplications intended.

The file is shown in (a). The primary key is Ao, and the attributes are A1, A2, and A3. a non-dense primary index is shown in (b). (c)illustrates a secondary index on A1, where the pointers are logical addresses.The same secondary index can also be represented by (d), where symbolicpointers are used (so that the primary index must be used to find the records);or by (e), where the logical addresses are coded into a bit-string, andthe entire matrix is called a bit matrix. (f) illustrates a nondenseindex on A2. This attribute has values between 0 and 30. To compress information,we have retained only six quantized levels for the attribute A2, and the logical block addresses are represented by a bit matrix. Finally, a secondary index on A3 is shown in (g), where only the logical addressesof records with A3 = "F" are retained. a composite secondary index can also be formed, as illustrated in (h), where the composite attribute {A2, A3} isused.

Construction of Primary and Secondary Indices

As an example, the file shown below has six records, with PNO as the primary key, and SEX as the descriptor. The generalized loader creates anISAM file and an ISAM index, plus the file containing (SEX,PNO) pairs as shown.After sorting on SEX, a secondary index (or descriptor index) is created.


Inverted File Structure

In a regular file structure, the records are ordered according to theprimary key. Regular file structures are suitable for answeringattribute query.For example, the RESIDENT file containssix records as shown below.

The inverted file structure is shown below.


Multilist File Structure

An example of a multilist file structure is illustrated below.It should be noted that it is not necessary to have lists for every attribute, and for some attributes, quantized levels are used to reduce the number of lists and size of indices.In the directory for multilist file, the list length for every (attribute, value)pair is also stored, which can be used to decide which list to search first,in query processing.


B-Trees

An example of the construction of B-tree is illustrated below.

Whenever one index table overflows, we create two half-full index tablesto replace the overflowing table. With every split, one more entry is addedto the next level up. When the top level table splits, a new master level is createdbeginning with just two entries.

For example, a new record with key "21" is to be inserted. Since theoriginal index table T21 overflows, it is split into two new tablesT21 and T22, as shown in (b). Similarly, in (b), after the insertion of recordwith key "41", the original index table T23 overflows.The restructuring of the tree results in the creation of a new master table,as illustrated in (c).

The IBM VSAM file structure is illustrated below, which employs a variation of Knuth's B+ tree.


Extendible Hashing

This approachis similar to B-tree technique of dynamically reorganizingthe indexing tree. An example of extendible hashing is illustrated below.Suppose the relation R(K1,K2) has two domains K1and K2,andK1= {A, B, C, D, E, F, G, H, I, J}, K2= {1, 2, 3, 4, 5, 6}. The relation R is:

K1 K2
A 1
B 1
C 1
D 2
E 1
F 3
G 2
H 4
I 4
J 6


The hash function f1:K1-> I is:

K1 I
A 0000
B 0001
C 0010
D 0011
E 0100
F 0101
G 0110
H 0111
I 1000
J 1001

The hash function f1 is a one-to-onefunction mapping keys to binary hash codes. However, to construct the hash file,not the entire hash code string is used.Suppose there are five records with keys 'A', 'B', 'E', 'G', and 'I', wemay use only two bits of the hash code. The situation is illustrated in(a), where d is the length of the truncated hash code, in this case2. The hash file contains pointers to blocks, each capable of holding twological records. Attached to each block, there is anothernumber indicating how many bits of the hash code are needed to distinguishrecords stored in this block. In (a), the first two blocks contain2 records each, and the third block contains 1 record.

Suppose record (F, 3) is to be inserted. Since the truncated hash code ofkey "F" is "01", it should be inserted to the block containing records(E, 1) and (G, 2), causing a split because the block can accommodate only tworecords. After the split, one more block is added, and records in theoverflowed block are distributed to the two blocks using the new hash code,which has been extended by 1 bit, as illustrated in (b).Since one more bit is used, the depth d is increased to 3, and the hashtable is doubled in size. To avoid doubling the hash table, it is possibleto use a tree structure toencode the hash table. The resultant dynamic hashing technique is evencloser to the B-tree technique.


Key and Data Compression/Encoding Techniques

A straight forward technique for compression of character strings is to truncatestrings into fixed length strings.

The construction of an index with truncated keys is shown below.

Rear-end truncationis another commonly used technique for key compression. The keys are first sorted. The truncated key is obtained by using the leastnumber of characters, starting from the left, that will be required to completelydistinguish the key from the one immediately preceding it. An example isillustrated in column 2.In order to be able to use truncated keys in an index, a bidirectional read-end truncation techniqueshould be used. In column 3, a truncatedkey is obtained by using the least number of characters, starting from the left, that will be required to completely distinguish the key from the oneimmediately succeeding it. In column 4, the truncated key is obtained from the longer one in column 2 and column 3.Bidirectional truncated keys have the property that no truncated key is theinitial segment of another truncated key, unless the truncated key is identicalto the original untruncated key, in which case it is marked by a flag (an asterisk in columns 3 and 4). In making comparisons, such keys are treated as having a blank character as therightmost character. (Thus, BELL* is treated as a five-character string, where the symbol * indicates a blank). The bidirectional truncated keys can be used in a tree index without causing confusion.

Further compression can be obtained, when the index is constructed as a parsing treein one or more levels. With the parsing tree, the highest level of index containsa portion of the key. The next level of index contains the following characteror characters, but does not repeat those which were in the higher index. Anexample is illustrated below.It can be seen that this technique combines key compression and tree structure methods.

Sometimes a bit matrixcan be used to store a variable-length list of valuesof the same descriptor. For example, an employee may serve both as anaccountant and as a secretary. If the job category is encoded in code#1then the above information can be stored as {001, 101}. If the jobcategory is encoded in code#2, then it is possible to combine the codes for ACCOUNTANT and SECRETARY into 10001, which is the logical disjunctionor logical OR of the codes 00001 and 10000, as shown below.



Hoffman codeis an example of a variable-length code which resultsin the smallest average code-word length. The procedure of generating a binary Hoffman code is illustrated below. First, the code-words are sorted intoa descending sequence according to their probabilities of occurrence. Then,the two smallest probabilities are added together, leading to a higher node inthe coding tree. This node has the probability as the sum of the two descendantnodes. Again, the two smallest probabilities are combined. The processrepeats, until the highest node with probability 1 is reached, and a codingtree has been constructed. For each branch in this coding tree, a 0 or a 1 isassigned, as shown below. Finally, each original code-word is encodedinto the binary sequence shown in the coding tree. The average code-word length in the above example is 2 bits. It can be shown thatHoffman code is theoretically optimal in the sense of having minimum average code-word length.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值