Use External Storage Process Big Data(1)

Problem:

   We discussed big data is that data can not fit in main memory(often called RAM, for Random Access Memory) all at once, how would you handle this situation?

Solution:

  We can use Divide-Conquer algorithm to solve big problem by dividing it into small problems, then solving every small problem with the same method, and finally merge every results. In this case a different kind of storage is necessary. Disk files generally have a much larger capacity than main memory, but we should clearly know that external storage is much slower than main memory. This speed difference means that different techniques must be used to handle it efficiently.

  Here we suppose our big data(suppose holds many records) are in a file. We can divide the file into blocks(data is stored on the disk in chunks called blocks,pages,allocation units; the disk drive always reads or writes a minimum of one block of data at a time; here block can be the biggest size your main memory can afford;Data is read from and written to disk in units known as blocks. The Block Size property specifies the number of bytes per block.) , then we can read the block what we want into main memory. But the problem is how can you find the block quickly.


Problem:How can you find the block quickly?

Solution:

  We must keep in mind a fact that the time to access a block is much larger than any internal processing on data in main memory, so the overriding consideration in devising an external storage strategy is minimizing the number of block accesses.  The usually techniques to handle this problem are hashing, index and B-tree.

1 Hashing and External Storage

  The  central feature in external hashing is a hash table containing block numbers, which refer to block in external storage. The hash table is sometimes called an index (in the sense of a bool's index). It can be stored in main memory or, if it is too large, stored externally on disk, with only part of it being read into main memory at a time.

1)Firstly, all records with keys that hash to the same value are located in the same block.

2)Secondly, to find a record with a particular key, the search algorithm hashes the key, uses the hash value as an index to the hash table, gets the block number at that index, and reads the block.


  To implement this scheme, we must choose the hash function and the size of the hash table with some care so that a limited number of keys hash to the same value.

For example: 

  We can put all the blocks  in a catalog, and the hash values are the bock files names. So you can find the block file according the block file name. For instance, if your search key's hash value is 2, then you can find the 2.txt file and read it into main memory because all the keys with the same hash value are in the same block.


  You may confused the 11.txt in the above figure. 11.txt is the overflow bock file of 1.txt if the 1.txt is full. This is the separate chaining method to handle the full blocks, of course, you can use other methods to find the overflow blocks. In seperate chaining, special overflow blocks are made available; when a primary block is found to be full,the new record in the overflow block.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值