Data Format: SquenceFiles

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

In HDFS

  • SequenceFile is one of the solutions to small file problem in Hadoop.
  • Small file is significantly smaller than the HDFS block size(128MB).
  • Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
  • 10 million files, would use about 3 gigabytes of memory of NameNode.
  • A billion files is not feasible.

In MapReduce

  • Map tasks usually process a block of input at a time (using the default FileInputFormat).

  • The more the number of files is, the more number of Map task need and the job time can be much more slower.

Small file scenarios

  • The files are pieces of a larger logical file.
  • The files are inherently small, for example, images.

These two cases require different solutions.

  • For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
  • For the second one, some kind of container is needed to group the files in some way.

Solutions in Hadoop

HAR files

  • HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
  • HARs are probably best used purely for archival purposes.

SequenceFile

  • The concept of SequenceFile is to put each small file to a larger single file.
  • For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.

    SequenceFile File Layout

  • Some benefits:

    1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
      • Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
      • After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
    2. SequenceFile is splittable, so is suitable for MapReduce.
    3. SequenceFile is compression supported.
  • Supported Compressions, the file structure depends on the compression type.

    1. Uncompressed
    2. Record-Compressed: Compresses each record as it’s added to the file.
      record_compress_seq

    3. Block-Compressed
      这里写图片描述

      • Waits until data reaches block size to compress.
      • Block compression provide better compression ratio than Record compression.
      • Block compression is generally the preferred option when using SequenceFile.
      • Block here is unrelated to HDFS or filesystem block.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
As an AI language model, I do not have access to the code or data you are working with, so I cannot provide a specific solution for your problem. However, I can explain what the error message "RuntimeError: Unrecognized data format" usually means and suggest some possible solutions. This error occurs when the program is unable to recognize the format of the data it is trying to process. It can happen in many different contexts, such as reading input files, parsing data, or decoding messages. Some possible causes of this error include: - Using the wrong file extension or encoding when reading data files. - Passing data in the wrong format or structure to a function or method. - Using an outdated or unsupported data format or library. - Corrupted or incomplete data that cannot be parsed correctly. To fix this error, you should try the following steps: - Check the documentation and specifications of the data format you are using to ensure you are using it correctly. - Verify that the data files you are reading have the correct file extension and encoding. - If you are passing data to a function or method, check that it is in the expected format and structure. - Try using a different data format or library that is more up-to-date or better suited for your needs. - If the data is corrupted or incomplete, try to recover or recreate it if possible. If none of these solutions work, you may need to seek help from a more experienced programmer or expert in the specific field of data processing you are working with.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值