Hive文件格式

最新推荐文章于 2024-05-04 16:41:47 发布

李元乐

最新推荐文章于 2024-05-04 16:41:47 发布

阅读量2k

点赞数

本文链接：https://blog.csdn.net/hugolyl/article/details/48049405

版权

数据存储专栏收录该内容

16 篇文章 0 订阅

订阅专栏

数据库是用来保存数据的，废话，那么数据是怎么保存起来的，肯定每种数据库都有自己的存储格式。商业的数据库外人都不知道里面是怎么保存的。我们知道Mysql就有好几种不同的引擎，如ISAM、MyISAM、HEAP、InnoDB和Berkley（BDB）等等。 Hive 支持多种格式的文件，包括文本，SeqFile,RCFile,AvroFile,ORCFile ParquetFile等，还可以自定义文件格式。下面我们来说说这几种文件格式都是神马。

Text File
SequenceFile
RCFile
Avro Files
ORC Files
Parquet
Custom INPUTFORMAT and OUTPUTFORMAT

如果在Create table或者Alter Table声明的时候没有指定文件格式，配置项hive.default.fileformat则决定了使用哪种文件格式。默认是使用TextFile. 下面列出了和文件格式相关的几个配置项。

<property>
    <name>hive.default.fileformat</name>
    <value>TextFile</value>
    <description>
      Expects one of [textfile, sequencefile, rcfile, orc].
      Default file format for CREATE TABLE statement. Users can explicitly override it by CREATE TABLE ... STORED AS [FORMAT]
    </description>
</property>
<property>
    <name>hive.default.fileformat.managed</name>
    <value>none</value>
    <description>
      Expects one of [none, textfile, sequencefile, rcfile, orc].
      Default file format for CREATE TABLE statement applied to managed tables only. External tables will be
      created with format specified by hive.default.fileformat. Leaving this null will result in using hive.default.fileformat
      for all tables.
    </description>
</property>
<property>
    <name>hive.query.result.fileformat</name>
    <value>TextFile</value>
    <description>
      Expects one of [textfile, sequencefile, rcfile].
      Default file format for storing result of the query.
    </description>
</property>
<property>
    <name>hive.fileformat.check</name>
    <value>true</value>
    <description>Whether to check file format or not when loading data files</description>
</property>

1.Text File

Hive默认格式，数据不做压缩，磁盘开销大，数据解析开销大。
可结合Gzip、Bzip2、Snappy等使用（系统自动检查，执行查询时自动解压），但使用这种方式，hive不会对数据进行切分，从而无法对数据进行并行操作。

2.SequenceFile

SequenceFile是Hadoop API 提供的一种二进制文件，它将数据以<key,value>的形式序列化到文件中。这种二进制文件内部使用Hadoop 的标准的Writable 接口实现序列化和反序列化。它与Hadoop API中的MapFile 是互相兼容的。Hive 中的SequenceFile 继承自Hadoop API 的SequenceFile，不过它的key为空，使用value 存放实际的值，这样是为了避免MR 在运行map 阶段的排序过程。

3.RCFile

RCFile是Hive推出的一种专门面向列的数据格式。它遵循“先按列划分，再垂直划分”的设计理念。当查询过程中，针对它并不关心的列时，它会在IO上跳过这些列。需要说明的是，RCFile在map阶段从远端拷贝仍然是拷贝整个数据块，并且拷贝到本地目录后RCFile并不是真正直接跳过不需要的列，并跳到需要读取的列，而是通过扫描每一个row group的头部定义来实现的，但是在整个HDFS Block 级别的头部并没有定义每个列从哪个row group起始到哪个row group结束。所以在读取所有列的情况下，RCFile的性能反而没有SequenceFile高。

4.Avro Files

5.ORC Files

6.Parquet

7.Custom INPUTFORMAT and OUTPUTFORMAT

李元乐

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hive文件格式

数据库是用来保存数据的，废话，那么数据是怎么保存起来的，肯定每种数据库都有自己的存储格式。商业的数据库外人都不知道里面是怎么保存的。我们知道Mysql就有好几种不同的引擎，如ISAM、MyISAM、HEAP、InnoDB和Berkley（BDB）等等。 Hive 支持多种格式的文件，包括文本，SeqFile,RCFile,AvroFile,ORCFile ParquetFile等，还可以自定义文件
复制链接

扫一扫