Version 0.1DRAFTThis document is the authoritative specification of a file format.Its intent is to permit compatible, independent implementations thatread and/or write files in this format.IntroductionData sets are often described as a <table> composed of <rows> and<columns>. Each record in the dataset is considered a row, witheach field of the record occupying a different column. Writingrecords to a file one-by-one as they are created results in a<row-major> format, like Hadoop’s SequenceFile or Avro data files.In many cases higher query performance may be achieved if the datais instead organized in a <column-major> format, where multiplevalues of a given column are stored adjacently. This documentdefines such a column-major file format for datasets.To permit scalable, distributed query evaluation, datasets arepartitioned into row groups, containing distinct collections ofrows. Each row group is organized in column-major order, while rowgroups form a row-major partitioning of the entire dataset.Rationale* GoalsThe format is meant satisfy the following goals:[[1]] Maximize the size of row groups. Disc drives are used mostefficiently when sequentially accessing data. Consider a drive thattakes 10ms to seek and transfers at 100MB/second. If a 10-columndataset whose values are all the same size is split into 10MB rowgroups, then accessing a single column will require a sequence ofseek+1MB reads, for a cost of 20ms/MB processed. If the samedataset is split into 100MB row groups then this drops to 11ms/MBprocessed. This effect is exaggerated for datasets with largernumbers of columns and with columns whose values are smaller thanaverage. So we’d prefer row groups that are 100MB or greater.[[1]] Permit random access within a row group. Some queries willfirst examine one column, and, only when certain relatively rarecriteria are met, examine other columns. Rather than iteratingthrough selected columns of the row-group in parallel, one mightiterate through one column and randomly access another. This iscalled support for WHERE clauses, after the SQL operator of thatname.[[1]] Minimize the number of files per dataset. HDFS is a primaryintended deployment platform for these files. The HDFS Namenoderequires memory for each file in the filesystem, thus for a formatto be HDFS-friendly it should strive to require the minimum numberof distinct files.[[1]] Support co-location of columns within row-groups. Row groupsare the unit of parallel operation on a column dataset. Forefficient file i/o, the entirety of a row-group should ideallyreside on the host that is evaluating the query in order to avoidnetwork latencies and bottlenecks.[[1]] Data integrity. The format should permit applications todetect data corruption. Many file systems may prevent corruption,but files may be moved between filesystems and be subject tocorruption at points in that process. It is best if the data in afile can be validated independently.[[1]] Extensibility. The format should permit applications to storeadditional annotations about a datasets in the files, such as typeinformation, origin, etc. Some environments may have metadatastores for such information, but not all do, and files might bemoved among systems with different metadata systems. The ability tokeep such information within the file simplifies the coordination ofsuch information.[[1]] Minimal overhead. The column format should not make datasetsappreciably larger. Storage is a primary cost and a choice to usethis format should not require additional storage.[[1]] Primary format. The column format should be usable as aprimary format for datasets, not as an auxiliary, acceleratedformat. Applications that process a dataset in row-major ordershould be able to easily consume column files and applications thatproduce datasets in row-major order should be able to easilygenerate column files.* DesignTo meet these goals we propose the following design.[[1]] Each row group is a separate file. All values of a column ina file are written contiguously. This maximizes the row group size,optimizing performance when querying few and small columns.[[1]] Each file occupies a single HDFS block. A larger than normalblock size may be specified, e.g., ~1GB instead of the typical~100MB. This guarantees co-location and eliminates network use whenquery processing can be co-located with the file. This alsomoderates the memory impact on the HDFS Namenode since no smallfiles are written.[[1]] Each column in a file is written as a sequence of ~64kBcompressed blocks. The sequence is prefixed by a table describingall of the blocks in the column to permit random access within thecolumn.[[1]] Application-specific metadata may be added at the file,column, and block levels.[[1]] Checksums are included with each block, providing data integrity.* DiscussionThe use of a single block per file achieves the same effect as thecustom block placement policy described in the {{CIF}} paper,but while still permitting HDFS rebalancing and not increasing thenumber of files in the namespace.Format SpecificationThis section formally describes the proposed column file format.* Data ModelWe assume a simple data model, where a record is a set of namedfields, and the value of each field is a sequence of untyped bytes.A type system may be layered on top of this, as specified in theType Mapping section below.* Primitive ValuesWe define the following primitive value types:* Signed 64-bit <<long>> values are written using a variable-lengthzig-zag coding, where the high-order bit in each byte determineswhether subsequent bytes are present. For example:*--------------*------*decimal value | hex bytes*--------------*------*0 | 00*--------------*------*-1 | 01*--------------*------*1 | 02*--------------*------*...*--------------*------*-64 | 7f*--------------*------*64 | 80 01*--------------*------*...*--------------*------** <<bytes>> are encoded as a <long> followed by that many bytes of data.* a <<string>> is encoded as a <long> followed by that many bytes ofUTF-8 encoded character data.For example, the three-character string "foo" would be encoded asthe <long> value 3 (encoded as hex 06) followed by the UTF-8encoding of 'f', 'o', and 'o' (the hex bytes 66 6f 6f): 06 66 6f 6f* Type NamesThe following type names are used to describe column values:* <<null>>, requires zero bytes. Sometimes used in array columns.* <<int>>, like <long>, but restricted to 32-bit signed values* <<long>> 64-bit signed values, represented as above* <<fixed32>> 32-bit values stored as four bytes, little-endian.* <<fixed64>> 64-bit values stored as eight bytes, little-endian.* <<float>> 32-bit IEEE floating point value, little-endian* <<double>> 64-bit IEEE floating point value, little-endian* <<string>> as above* <<bytes>> as above, may be used to encapsulate more complex objects[]Type names are represented as <strings> (UTF-8 encoded, length-prefixed).* Metadata<<Metadata>> consists of:* A <long> indicating the number of metadata key/value pairs.* For each pair, a <string> key and <bytes> value.[]All metadata properties that start with "trevni." are reserved.** File MetadataThe following file metadata properties are defined:* <<trevni.codec>> the name of the default compression codec used tocompress blocks, as a <string>. Implementations are required tosupport the "null" codec. Optional. If absent, it is assumed tobe "null". Codecs are described in more detail below.* <<trevni.checksum>> the name of the checksum algorithm used in thisfile, as a <string>. Implementations are required to support the"crc-32” checksum. Optional. If absent, it is assumed to be"null". Checksums are described in more detail below.[]** Column MetadataThe following column metadata properties are defined:* <<trevni.codec>> the name of the compression codec used to compressthe blocks of this column, as a <string>. Implementations arerequired to support the "null" codec. Optional. If absent, it isassumed to be "null". Codecs are described in more detail below.* <<trevni.name>> the name of the column, as a <string>. Required.* <<trevni.type>> the type of data in the column. One of the type namesabove. Required.* <<trevni.values>> if present, indicates that the initial value of eachblock in this column will be stored in the block’s descriptor.Not permitted for array columns or columns that specify a parent.* <<trevni.array>> if present, indicates that each row in this columncontains a sequence of values of the named type rather than just asingle value. An integer length precedes each sequence of valuesindicating the count of values in the sequence.* <<trevni.parent>> if present, the name of an <array> column whoselengths are also used by this column. Thus values of this columnare sequences but no lengths are stored in this column.[]For example, consider the following row, as JSON, where all valuesare primitive types, but one has multiple values.---{"id"=566, "date"=23423234234"from"="foo@bar.com","to"=["bar@baz.com", "bang@foo.com"],"content"="Hi!"}---The columns for this might be specified as:---name=id type=intname=date type=longname=from type=stringname=to type=string array=truename=content type=string---If a row contains an array of records, e.g. "received" in the following:---{"id"=566, "date"=23423234234"from"="foo@bar.com","to"=["bar@baz.com", "bang@foo.com"],"content"="Hi!""received"=[{"date"=234234234234, "host"="192.168.0.0.1"},{"date"=234234545645, "host"="192.168.0.0.2"}]}---Then one can define a parent column followed by a column for eachfield in the record, adding the following columns:---name=received type=null array=truename=date type=long parent=receivedname=host type=string parent=received---If an array value itself contains an array, e.g. the "sigs" below:---{"id"=566, "date"=23423234234"from"="foo@bar.com","to"=["bar@baz.com", "bang@foo.com"],"content"="Hi!""received"=[{"date"=234234234234, "host"="192.168.0.0.1","sigs"=[{"algo"="weak", "value"="0af345de"}]},{"date"=234234545645, "host"="192.168.0.0.2","sigs"=[]}]}---Then a parent column may be defined that itself has a parent column.---name=sigs type=null array=true parent=receivedname=algo type=string parent=sigsname=value type=string parent=sigs---** Block MetadataNo block metadata properties are currently defined.* File FormatA <<file>> consists of:* A <file header>, followed by* one or more <columns>.[]A <<file header>> consists of:* Four bytes, ASCII 'T', 'r', 'v', followed by 1.* a <fixed64> indicating the number of rows in the file* a <fixed32> indicating the number of columns in the file* file <metadata>.* for each column, its <column metadata>* for each column, its starting position in the file as a <fixed64>.[]A <<column>> consists of:* A <fixed32> indicating the number of blocks in this column.* For each block, a <block descriptor>* One or more <blocks>.[]A <<block descriptor>> consists of:* A <fixed32> indicating the number of rows in the block* A <fixed32> indicating the size in bytes of the block before thecodec is applied (excluding checksum).* A <fixed32> indicating the size in bytes of the block after thecodec is applied (excluding checksum).* If this column’s metadata declares it to include values, the firstvalue in the column, serialized according to this column's type.[]A <<block>> consists of:* The serialized column values. If a column is an array column thenvalue sequences are preceded by their length, as an <int>. If acodec is specified, the values and lengths are compressed by thatcodec.* The checksum, as determined by the file metadata.[]* Codecs[null] The "null" codec simply passes data through uncompressed.[deflate] The "deflate" codec writes the data block using thedeflate algorithm as specified in RFC 1951.[snappy] The "snappy" codec uses Google's Snappy compression library.* Checksum algorithms[null] The "null" checksum contains zero bytes.[crc-32] Each "crc-32" checksum contains the four bytes of an ISO3309 CRC-32 checksum of the uncompressed block data as a fixed32.* Type MappingsWe define a standard mapping for how types defined in variousserialization systems are represented in a column file. Recordsfrom these systems are <shredded> into columns. When records arenested, a depth-first recursive walk can assign a separate columnfor each primitive value.** Avro** Protocol Buffers** ThriftImplementation NotesSome possible techniques for writing column files include:[[1]] Use a standard ~100MB block, buffer in memory up to the blocksize, then flush the file directly to HDFS. A single reduce taskmight create multiple output files. The namenode requires memoryproportional to the number of names and blocks*replication. Thiswould increase the number of names but not blocks, so this shouldstill be much better than a file per column.[[1]] Spill each column to a separate local, temporary file then,when the file is closed, append these files, writing a single fileto HDFS whose block size is set to be that of the entire file. Thiswould be a bit slower than and may have trouble when the local diskis full, but it would better use HDFS namespace and further reduceseeks when processing columns whose values are small.[[1]] Use a separate mapreduce job to convert row-major files tocolumn-major. The map output would output a by (row#, column#,value) tuple, partitioned by row# but sorted by column# then row#.The reducer could directly write the column file. But the columnfile format would need to be changed to write counts, descriptors,etc. at the end of files rather than at the front.[](1) is the simplest to implement and most implementations shouldstart with it.* References{CIF} {{{http://arxiv.org/pdf/1105.4252.pdf}<Column-Oriented StorageTechniques for MapReduce>}}, Floratou, Patel, Shekita, & Tata, VLDB2011.{DREMEL} {{{http://research.google.com/pubs/archive/36632.pdf}<Dremel:Interactive Analysis of Web-Scale Datasets>}}, Melnik, Gubarev, Long,Romer, Shivakumar, & Tolton, VLDB 2010.
Trevni: A Column File Format
最新推荐文章于 2023-05-09 23:49:46 发布