protocol buffer编码原理，让你理解pb是如何实现的

最新推荐文章于 2022-08-27 23:48:39 发布

dddengyunjie

最新推荐文章于 2022-08-27 23:48:39 发布

阅读量2.7k

点赞数 2

分类专栏： protocol buffer 文章标签： protocol buffer 编码

本文链接：https://blog.csdn.net/dyj5841619/article/details/94717419

版权

protocol buffer 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

这篇文章对pb的编码原理进行翻译，原文地址https://developers.google.com/protocol-buffers/docs/encoding

先看一个简单的例子

message Test1 {
  optional int32 a = 1;
}

在一个应用中，我们创建一个Test1消息并且设置a为150。我们序列化消息为一个输出流，如果你能抓取编码的流，你就可以看到三个字节：

08 96 01

这个消息看去很短，具体含义是什么呢？继续往下看

Base 128 Varints

为了理解pb的编码，你要先理解varints，varints是一个用一个或更多字节来序列化整型的方法。数值越小占用的字节数越少。varint的每一个字节（除了最后一个字节），都有最重要的位（msb）集（has the most significant bit (msb) set）。这表示后面还有更多的字节，即流未结束。每个字节的低7位用来存储以7位为一组的数字的两个补码表示，权值低的在前，相当于小端字节序（The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.）

所以，例如有一个数字1，它是一个单字节，所以msb没有被设置（也就是没有下一个字节的意思）：

0000 0001

然后是数字300，这就有点复杂了

1010 1100 0000 0010

这是怎么表示成数字300的呢？你要先把每个字节的最后一位（msb）去掉，它就是告诉我们这个字节是否是这个数字的最后一个字节（你可以在第一个字节中看到设置了msb，所以表示后面还有字节）。

1010 1100 0000 0010
→ 010 1100  000 0010

先将两组的7个字节倒转一下，因为它是权值低的在前，然后将他们连接起来，得到最终的数值。

000 0010  010 1100
→  000 0010 ++ 010 1100
→  100101100
→  256 + 32 + 8 + 4 = 300

消息结构

如您所知，pb消息就是一系列的键值对，消息的二进制版本只使用了字段的编号作为键，每个字段的名称和类型只能通过引用消息类型的定义（例如.proto文件）在编码结束时来确定。

当对消息进行编码时，键和值被连接到一个字节流中。当消息被解码时，解析器需要能够跳过它不认识的字段。通过这种方式，可以在不破坏我们不了解的旧程序的情况下将新字段添加到消息中。为此，wire-format消息中每个键值对的“键”实际上是两个值——.proto文件中的字段号，加上一个wire type，该类型提供的信息刚好可以找到以下值的长度。在大多数语言实现中，这个键被称为标记。

可用的wire type如下：

Type	Meaning	Used For
0	Varint	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	64-bit	fixed64, sfixed64, double
2	Length-delimited	string, bytes, embedded messages, packed repeated fields
3	Start group	groups (deprecated)
4	End group	groups (deprecated)
5	32-bit	fixed32, sfixed32, float

流消息中的每个键都是一个varint，其值为(field_number << 3) | wire_type——换句话说，数字的最后三位存储了wire type。

现在让我们再看一遍这个简单的例子。现在您知道流中的第一个数字总是varint键，先看第一个字节08，或者(删除msb):

000 1000

您使用最后3位获得wire type(0)，然后右移3位得到字段号(1)，现在您知道字段号是1，下面的值是一个varint类型。使用上一节的varint解码知识，您可以看到接下来的两个字节存储值150。

96 01 = 1001 0110  0000 0001
       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)
       → 10010110
       → 128 + 16 + 4 + 2 = 150

这就是08 96 01解码成150的过程。

更多的值类型

有符号整型

正如您在前一节中看到的，所有wire type为0的pb类型都被编码为varint。然而，在编码负数时，带符号的int类型(sint32和sint64)和“标准”int类型(int32和int64)之间有一个重要的区别。如果使用int32或int64作为负数的类型，得到的varint总是10字节长——实际上，它被当作一个非常大的无符号整数。如果您使用其中一种带符号的类型，得到的varint将使用ZigZag，这将大大提高效率。

ZigZag编码将带符号整数映射到无符号整数，因此绝对值较小的数字(例如-1)也具有较小的varint编码值。它的方式是在正整数和负整数之间来回“之字形”，因此-1被编码为1,1被编码为2，-2被编码为3，以此类推，如下表所示:

Signed Original	Encoded As
0	0
-1	1
1	2
-2	3
2147483647	4294967294
-2147483648	4294967295

换句话说，每个值n都编码为

(n << 1) ^ (n >> 31) 这是sint32类型

(n << 1) ^ (n >> 63) 这是64位版本

注意，第二个移位(n >> 31)部分是算术移位。换句话说，移位的结果要么是一个全部为0的数(如果n是正数)，要么是一个全部为1的数(如果n是负数)。

当解析sint32或sint64时，它的值被解码回原始的签名版本。（When the sint32 or sint64 is parsed, its value is decoded back to the original, signed version）

非varint的数字类型

非varint数字类型很简单——double和fixed64都有wire type为1，这告诉解析器期望一个固定的64位数据块;类似地，float和fixed32也有wire type 5，这告诉它期望是32位。在这两种情况下，值都是以小端字节顺序存储的。

strings类型

wire type为2(以长度分隔)意味着该值是varint编码的长度，后跟指定的数据字节数。

message Test2 {
  optional string b = 2;
}

设置b的值为 "testing" :

12 07 74 65 73 74 69 6e 67

红色部分是testing的UTF8的编码，键是 0x12可以得出字段号 = 2, 类型 = 2。值中的varint 长度是7 ，我们在它后面找到了7个字节，也就是红色部分。

Embedded Messages

下面是带有我们示例类型Test1的嵌入消息的消息定义:

message Test3 {
  optional Test1 c = 3;
}

这是编码后的版本，同样，Test1的a字段设置为150:

1a 03 08 96 01

正如您所看到的，最后三个字节与我们的第一个示例(08 96 01)完全相同，它们的前面是数字3——嵌入式消息的处理方式与字符串完全相同(wire type = 2)。

后面的这些就暂时不翻译了，附上原文吧

Optional And Repeated Elements

If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same field number. These repeated values do not have to appear consecutively; they may be interleaved with other fields. The order of the elements with respect to each other is preserved when parsing, though the ordering with respect to other fields is lost. In proto3, repeated fields use packed encoding, which you can read about below.

For any non-repeated fields in proto3, or optional fields in proto2, the encoded message may or may not have a key-value pair with that field number.

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. For embedded message fields, the parser merges multiple instances of the same field, as if with the Message::MergeFrom method – that is, all singular scalar fields in the latter instance replace those in the former, singular embedded messages are merged, and repeated fields are concatenated. The effect of these rules is that parsing the concatenation of two encoded messages produces exactly the same result as if you had parsed the two messages separately and merged the resulting objects. That is, this:

MyMessage message; message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2; message.ParseFromString(str1); message2.ParseFromString(str2); message.MergeFrom(message2);

This property is occasionally useful, as it allows you to merge two messages even if you do not know their types.

Packed Repeated Fields

Version 2.1.0 introduced packed repeated fields, which in proto2 are declared like repeated fields but with the special [packed=true] option. In proto3, repeated fields of scalar numeric types are packed by default. These function like repeated fields, but are encoded differently. A packed repeated field containing zero elements does not appear in the encoded message. Otherwise, all of the elements of the field are packed into a single key-value pair with wire type 2 (length-delimited). Each element is encoded the same way it would be normally, except without a key preceding it.

For example, imagine you have the message type:

message Test4 { repeated int32 d = 4 [packed=true]; }

Now let's say you construct a Test4, providing the values 3, 270, and 86942 for the repeated field d. Then, the encoded form would be:

22 // key (field number 4, wire type 2) 06 // payload size (6 bytes) 03 // first element (varint 3) 8E 02 // second element (varint 270) 9E A7 05 // third element (varint 86942)

Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".

Note that although there's usually no reason to encode more than one key-value pair for a packed repeated field, encoders must be prepared to accept multiple key-value pairs. In this case, the payloads should be concatenated. Each pair must contain a whole number of elements.

Protocol buffer parsers must be able to parse repeated fields that were compiled as packed as if they were not packed, and vice versa. This permits adding [packed=true] to existing fields in a forward- and backward-compatible way.

Field Order

Field numbers may be used in any order in a .proto file. The order chosen has no effect on how the messages are serialized.

When a message is serialized, there is no guaranteed order for how its known or unknown fields should be written. Serialization order is an implementation detail and the details of any particular implementation may change in the future. Therefore, protocol buffer parsers must be able to parse fields in any order.

Implications

Do not assume the byte output of a serialized message is stable. This is especially true for messages with transitive bytes fields representing other serialized protocol buffer messages.
By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
- Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.
The following checks may fail for a protocol buffer message instance foo.
- foo.SerializeAsString() == foo.SerializeAsString()
- Hash(foo.SerializeAsString()) == Hash(foo.SerializeAsString())
- CRC(foo.SerializeAsString()) == CRC(foo.SerializeAsString())
- FingerPrint(foo.SerializeAsString()) == FingerPrint(foo.SerializeAsString())
Here're a few example scenarios where logically equivalent protocol buffer messages foo and bar may serialize to different byte outputs.
- bar is serialized by an old server that treats some fields as unknown.
- bar is serialized by a server that is implemented in a different programming language and serializes fields in different order.
- bar has a field that serializes in non-deterministic manner.
- bar has a field that stores a serialized byte output of a protocol buffer message which is serialized differently.
- bar is serialized by a new server that serializes fields in different order due to an implementation change.
- Both foo and bar are concatenation of individual messages but with different order.