Introduction 1.1. Purpose The purpose of this specification is to define a lossless compressed data format that: * is independent of CPU type, operating system, file system, and character set; hence, it can be used for interchange. * can be produced or consumed, even for an arbitrarily long, sequentially presented input data stream, using only an a priori bounded amount of intermediate storage; hence, it can be used in data communications or similar structures, such as Unix filters. * compresses data with a compression ratio comparable to the best currently available general-purpose compression methods, in particular, considerably better than the gzip program. * decompresses much faster than current LZMA implementations. The data format defined by this specification does not attempt to: * allow random access to compressed data. * compress specialized data (e.g., raster graphics) as densely as the best currently available specialized algorithms. This document is the authoritative specification of the brotli compressed data format. It defines the set of valid brotli compressed data streams and a decoder algorithm that produces the uncompressed data stream from a valid brotli compressed data stream. 1.2. Intended Audience This specification is intended for use by software implementers to compress data into and/or decompress data from the brotli format. The text of the specification assumes a basic background in programming at the level of bits and other primitive data representations. Familiarity with the technique of Huffman coding is helpful but not required. Alakuijala & Szabadka Informational [Page 3]
RFC 7932 Brotli July 2016 This specification uses (heavily) the notations and terminology introduced in the DEFLATE format specification [RFC1951]. For the sake of completeness, we always include the whole text of the relevant parts of RFC 1951; therefore, familiarity with the DEFLATE format is helpful but not required. The compressed data format defined in this specification is an integral part of the WOFF File Format 2.0 [WOFF2]; therefore, this specification is also intended for implementers of WOFF 2.0 compressors and decompressors. 1.3. Scope This document specifies a method for representing a sequence of bytes as a (usually shorter) sequence of bits and a method for packing the latter bit sequence into bytes. 1.4. Compliance Unless otherwise indicated below, a compliant decompressor must be able to accept and decompress any data set that conforms to all the specifications presented here. A compliant compressor must produce data sets that conform to all the specifications presented here. 1.5. Definitions of Terms and Conventions Used Byte: 8 bits stored or transmitted as a unit (same as an octet). For this specification, a byte is exactly 8 bits, even on machines that store a character on a number of bits different from eight. See below for the numbering of bits within a byte. String: a sequence of arbitrary bytes. Bytes stored within a computer do not have a "bit order", since they are always treated as a unit. However, a byte considered as an integer between 0 and 255 does have a most and least significant bit (lsb), and since we write numbers with the most significant digit on the left, we also write bytes with the most significant bit (msb) on the left. In the diagrams below, we number the bits of a byte so that bit 0 is the least significant bit, i.e., the bits are numbered: +--------+ |76543210| +--------+ Alakuijala & Szabadka Informational [Page 4]
RFC 7932 Brotli July 2016 Within a computer, a number may occupy multiple bytes. All multi- byte numbers in the format described here are stored with the least significant byte first (at the lower memory address). For example, the decimal number 520 is stored as: 0 1 +--------+--------+ |00001000|00000010| +--------+--------+ ^ ^ | | | + more significant byte = 2 * 256 + less significant byte = 8 1.5.1. Packing into Bytes This document does not address the issue of the order in which bits of a byte are transmitted on a bit-sequential medium, since the final data format described here is byte rather than bit oriented. However, we describe the compressed block format below as a sequence of data elements of various bit lengths, not a sequence of bytes. Therefore, we must specify how to pack these data elements into bytes to form the final compressed byte sequence: * Data elements are packed into bytes in order of increasing bit number within the byte, i.e., starting with the least significant bit of the byte. * Data elements other than prefix codes are packed starting with the least significant bit of the data element. These are referred to here as "integer values" and are considered unsigned. * Prefix codes are packed starting with the most significant bit of the code. In other words, if one were to print out the compressed data as a sequence of bytes, starting with the first byte at the *right* margin and proceeding to the *left*, with the most significant bit of each byte on the left as usual, one would be able to parse the result from right to left, with fixed-width elements in the correct msb-to-lsb order and prefix codes in bit-reversed order (i.e., with the first bit of the code in the relative lsb position). As an example, consider packing the following data elements into a sequence of 3 bytes: 3-bit integer value 6, 4-bit integer value 2, prefix code 110, prefix code 10, 12-bit integer value 3628. Alakuijala & Szabadka Informational [Page 5]
RFC 7932 Brotli July 2016 byte 2 byte 1 byte 0 +--------+--------+--------+ |11100010|11000101|10010110| +--------+--------+--------+ ^ ^ ^ ^ ^ | | | | | | | | | +------ integer value 6 | | | +---------- integer value 2 | | +-------------- prefix code 110 | +---------------- prefix code 10 +----------------------------- integer value 3628 2. Compressed Representation Overview A compressed data set consists of a header and a series of meta- blocks. Each meta-block decompresses to a sequence of 0 to 16,777,216 (16 MiB) uncompressed bytes. The final uncompressed data is the concatenation of the uncompressed sequences from each meta- block. The header contains the size of the sliding window that was used during compression. The decompressor must retain at least that amount of uncompressed data prior to the current position in the stream, in order to be able to decompress what follows. The sliding window size is a power of two, minus 16, where the power is in the range of 10 to 24. The possible sliding window sizes range from 1 KiB - 16 B to 16 MiB - 16 B. Each meta-block is compressed using a combination of the LZ77 algorithm (Lempel-Ziv 1977, [LZ77]) and Huffman coding. The result of Huffman coding is referred to here as a "prefix code". The prefix codes for each meta-block are independent of those for previous or subsequent meta-blocks; the LZ77 algorithm may use a reference to a duplicated string occurring in a previous meta-block, up to the sliding window size of uncompressed bytes before. In addition, in the brotli format, a string reference may instead refer to a static dictionary entry. Each meta-block consists of two parts: a meta-block header that describes the representation of the compressed data part and a compressed data part. The compressed data consists of a series of commands. Each command consists of two parts: a sequence of literal bytes (of strings that have not been detected as duplicated within the sliding window) and a pointer to a duplicated string, which is represented as a pair <length, backward distance>. There can be zero literal bytes in the command. The minimum length of the string to be Alakuijala & Szabadka Informational [Page 6]
RFC 7932 Brotli July 2016 duplicated is two, but the last command in the meta-block is permitted to have only literals and no pointer to a string to duplicate. Each command in the compressed data is represented using three categories of prefix codes: 1) One set of prefix codes are for the literal sequence lengths (also referred to as literal insertion lengths) and backward copy lengths. That is, a single code word represents two lengths: one of the literal sequence and one of the backward copy. 2) One set of prefix codes are for literals. 3) One set of prefix codes are for distances. The prefix code descriptions for each meta-block appear in a compact form just before the compressed data in the meta-block header. The insert-and-copy length and distance prefix codes may be followed by extra bits that are added to the base values determined by the codes. The number of extra bits is determined by the code. One meta-block command then appears as a sequence of prefix codes: Insert-and-copy length, literal, literal, ..., literal, distance where the insert-and-copy length defines an insertion length and a copy length. The insertion length determines the number of literals that immediately follow. The distance defines how far back to go for the copy and the copy length determines the number of bytes to copy. The resulting uncompressed data is the sequence of bytes: literal, literal, ..., literal, copy, copy, ..., copy where the number of literal bytes and copy bytes are determined by the insert-and-copy length code. (The number of bytes copied for a static dictionary entry can vary from the copy length.) The last command in the meta-block may end with the last literal if the total uncompressed length of the meta-block has been satisfied. In that case, there is no distance in the last command, and the copy length is ignored. There can be more than one prefix code for each category, where the prefix code to use for the next element of that category is determined by the context of the compressed stream that precedes that element. Part of that context is three current block types, one for Alakuijala & Szabadka Informational [Page 7]
RFC 7932 Brotli July 2016 each category. A block type is in the range of 0..255. For each category there is a count of how many elements of that category remain to be decoded using the current block type. Once that count is expended, a new block type and block count is read from the stream immediately preceding the next element of that category, which will use the new block type. The insert-and-copy block type directly determines which prefix code to use for the next insert-and-copy length. For the literal and distance elements, the respective block type is used in combination with other context information to determine which prefix code to use for the next element. Consider the following example: (IaC0, L0, L1, L2, D0)(IaC1, D1)(IaC2, L3, L4, D2)(IaC3, L5, D3) The meta-block here has four commands, contained in parentheses for clarity, where each of the three categories of symbols within these commands can be interpreted using different block types. Here we separate out each category as its own sequence to show an example of block types assigned to those elements. Each square-bracketed group is a block that uses the same block type: [IaC0, IaC1][IaC2, IaC3] <-- insert-and-copy: block types 0 and 1 [L0, L1][L2, L3, L4][L5] <-- literals: block types 0, 1, and 0 [D0][D1, D2, D3] <-- distances: block types 0 and 1 The subsequent blocks within each block category must have different block types, but we see that block types can be reused later in the meta-block. The block types are numbered from 0 to the maximum block type number of 255, and the first block of each block category is type 0. The block structure of a meta-block is represented by the sequence of block-switch commands for each block category, where a block-switch command is a pair <block type, block count>. The block- switch commands are represented in the compressed data before the start of each new block using a prefix code for block types and a separate prefix code for block counts for each block category. For the above example, the physical layout of the meta-block is then: IaC0 L0 L1 LBlockSwitch(1, 3) L2 D0 IaC1 DBlockSwitch(1, 3) D1 IaCBlockSwitch(1, 2) IaC2 L3 L4 D2 IaC3 LBlockSwitch(0, 1) L5 D3 where xBlockSwitch(t, n) switches to block type t for a count of n elements. In this example, note that DBlockSwitch(1, 3) immediately precedes the next required distance, D1. It does not follow the last Alakuijala & Szabadka Informational [Page 8]
RFC 7932 Brotli July 2016 distance of the previous block, D0. Whenever an element of a category is needed, and the block count for that category has reached zero, then a new block type and count are read from the stream just before reading that next element. The block-switch commands for the first blocks of each category are not part of the meta-block compressed data. Instead, the first block type is defined to be 0, and the first block count for each category is encoded in the meta-block header. The prefix codes for the block types and counts, a total of six prefix codes over the three categories, are defined in a compact form in the meta-block header. Each category of value (insert-and-copy lengths, literals, and distances) can be encoded with any prefix code from a collection of prefix codes belonging to the same category appearing in the meta- block header. The particular prefix code used can depend on two factors: the block type of the block the value appears in and the context of the value. In the case of the literals, the context is the previous two bytes in the uncompressed data; and in the case of distances, the context is the copy length from the same command. For insert-and-copy lengths, no context is used and the prefix code depends only on the block type. In the case of literals and distances, the context is mapped to a context ID in the range 0..63 for literals and 0..3 for distances. The matrix of the prefix code indexes for each block type and context ID, called the context map, is encoded in a compact form in the meta-block header. For example, the prefix code to use to decode L2 depends on the block type (1), and the literal context ID determined by the two uncompressed bytes that were decoded from L0 and L1. Similarly, the prefix code to use to decode D0 depends on the block type (0) and the distance context ID determined by the copy length decoded from IaC0. The prefix code to use to decode IaC3 depends only on the block type (1). In addition to the parts listed above (prefix code for insert-and- copy lengths, literals, distances, block types, block counts, and the context map), the meta-block header contains the number of uncompressed bytes coded in the meta-block and two additional parameters used in the representation of match distances: the number of postfix bits and the number of direct distance codes. A compressed meta-block may be marked in the header as the last meta- block, which terminates the compressed stream. A meta-block may, instead, simply store the uncompressed data directly as bytes on byte boundaries with no coding or matching strings. In this case, the meta-block header information only Alakuijala & Szabadka Informational [Page 9]
RFC 7932 Brotli July 2016 contains the number of uncompressed bytes and the indication that the meta-block is uncompressed. An uncompressed meta-block cannot be the last meta-block. A meta-block may also be empty, which generates no uncompressed data at all. An empty meta-block may contain metadata information as bytes starting on byte boundaries, which are not part of either the sliding window or the uncompressed data. Thus, these metadata bytes cannot be used to create matching strings in subsequent meta-blocks and are not used as context bytes for literals. 3. Compressed Representation of Prefix Codes 3.1. Introduction to Prefix Coding Prefix coding represents symbols from an a priori known alphabet by bit sequences (codes), one code for each symbol, in a manner such that different symbols may be represented by bit sequences of different lengths, but a parser can always parse an encoded string unambiguously symbol-by-symbol. We define a prefix code in terms of a binary tree in which the two edges descending from each non-leaf node are labeled 0 and 1, and in which the leaf nodes correspond one-for-one with (are labeled with) the symbols of the alphabet. The code for a symbol is the sequence of 0's and 1's on the edges leading from the root to the leaf labeled with that symbol. For example: /\ Symbol Code 0 1