顺序访问 VS 随机访问

hdfs不能对文件做修改,但可以append,hbase的做法是对旧版本数据做disable处理,将新数据追加到hfile里面。hive也是只能对自己的orc存储格式才能做update,底层实际也是追加操作。对于txt,lzo这种就没办法了。

  • 顺序访问的数据是连续的。硬盘的磁头是按一定的顺序访问磁片,磁头不做频繁的寻道,这样带来的结果是速度很快。因为寻道时间是影响磁盘读写速度的主要原因。在平常的应用中顺序访问的应用很少。大文件的连续备份,是顺序读写的。dd就是典型的顺序读写,
  • 随机访问主要是磁头在做频繁的移动,原因是数据在磁盘的不连续性,这和数据存放到磁盘的过程有关系,随机访问的速度要比顺序访问慢很多。原因也是因为磁头频繁的寻道,定位,磁头的移动消耗掉很多时间。大部分的应用在磁盘上的读写是随机的。
    • 因为在实际应用中,以LINUX为例子,在写数据的时候,OS会预读8个block,也就是你刚开始写文件的时候OS会努力让数据在磁盘上是连续的,但在宏观上是做不到的。我们假如磁盘是新的,写300K的一个文件。这时候是连续的。写完后,其他文件又往硬盘里写,又是连续的。过一段时间,已经写了很多文件,当然文件会经常被修改的。我们可以看到,如果修改一个文件,会发现被修改文件附近的block已经被其他文件占用了。磁头只好把变化的block写在磁盘的其他位置,过一段时间。磁盘上的文件就会大部分不是连续的,分散在磁盘的各个位置。当你的程序读文件的时候,对硬盘来说,磁头就是在不停的寻道,把分散在磁盘不同位置的数据找出来,看上去没有丝毫的规律。当然磁头移动到什么位置是根据INODE来确定的。这时候程序对磁盘的访问就是随机的。
  • Sequential Access pattern is when you read your data in sequence (often from start to finish).
    • Consider a book example. When reading a novel, you use sequential order: you start with page 1, then move to page 2 and so on.
  • The other common pattern is called Random Access. This is when you jump from one place to another, and possibly even backwards when reading data.
    • For a book example, consider a dictionary. You don't read it like you read a novel. Instead, you search for your word in the middle somewhere. And when you're done looking up that word, you may perhaps go look for another word that is located hundreds of pages away from where you have your book open to at the moment. That searching of where you should start reading from is called a "seek".
    • When you access sequentially, you only need to seek once and then read until you're done with that data. When doing random access, you need to seek every time you want to switch to a different place in your file. This can be quite a performance hit on hard drives, because seeking is really expensive on magnetic drives.

HDFS

 

  • Hadoop uses blocks to store a file or parts of a file. A Hadoop block is a file on the underlying filesystem. Since the underlying filesystem stores files as blocks, one Hadoop block may consist of many blocks in the underlying file system. Blocks are large. They default to 64 megabytes each and most systems run with block sizes of 128 megabytes or larger.
    Hadoop is designed for streaming or sequential data access rather than random access. Sequential data access means fewer seeks, since Hadoop only seeks to the beginning of each block and begins reading sequentially from there.
  • HDFS is append only. To modify any portion of a file that is already written, one must rewrite the entire file and replace the old file.
  • 对一个超大文件的读操作,只会有fileSize / blockSize次寻道,block内部是顺序访问的。
  • HDFS is not POSIX compliant.
  • Hadoop works best with very large files. The larger the file, the less time Hadoop spends seeking for the next data location on disk, the more time Hadoop runs at the limit of the bandwidth of your disks.
  • HDFS有以下advantages:
    • fixed in size. This makes it easy to calculate how many can fit on a disk.
    • by being made up of blocks that can be spread over multiple nodes, a file can be larger than any single disk in the cluster.
    • HDFS blocks also don't waste space. If a file is not an even multiple of the block size, the block containing the remainder does not occupy the space of an entire block. 【疑问:appendToFile操作怎么优化?
  • 2
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值