proto buffer编码详解



This document describes the binary wire format for protocol buffer messages. You don't need to understand this to use protocol buffers in your applications, but it can be very useful to know how different protocol buffer formats affect the size of your encoded messages.

这篇文档描述了protocol buffer消息的二进制wire(wire怎么翻译呢)格式。在你的应用中应用protocol buffer 不需要了解这些,但是这篇文档对于你了解protol buffer消息格式如何影响消息的大小非常的做用。

A Simple Message

Let's say you have the following very simple message definition:

message Test1 {
  required int32 a = 1;
In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you'd see three bytes:
08 96 01

Base 128 Varints

To understand your simple protocol buffer encoding, you first need to understand varints. Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.
想要理解protocol buffer的编码,你首先需要理解varints(可变长整形)。Varints是一种应用一个或者多个字节的序列号整数的方法。小的整数需要更少的字节。

Each byte in a varint, except the last byte, has the most significant bit (msb最高有效位) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

So, for example, here is the number 1 – it's a single byte, so the msb is not set:
0000 0001

And here is 300 – this is a bit more complicated:

1010 1100 0000 0010

How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint):
1010 1100 0000 0010
→ 010 1100  000 0010

You reverse the two groups of 7 bits because, as you remember, varints store numbers with the least significant group first. Then you concatenate them to get your final value:

000 0010  010 1100  (10 010 1100即是十进制300)
→  000 0010 ++ 010 1100
→  100101100
→  256 + 32 + 8 + 4 = 300

Message Structure

As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).
protocol buffer message其实是一系列的key-value的对。消息的二进制版本仅仅是用字段的number来标示key - 每个字段的名字和其声明的类型。。。。

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value.

The available wire types are as follows:

Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.

Now let's look at our simple example again. You now know that the first number in the stream is always a varint key, and here it's 08, or (dropping the msb):

000 1000

不直接翻译了(field number向左移动3位后和wire_type所代表的值进行或运算),即最后3个bit表示wire type。如下下过程(000 1000文章开头举的例子,第一个字节08表示key,后两个字节表示value 150,值得注意的是后两个字节 96 01 同样也是以varint进行编码的,本来一个字节就可以表示150,这里浪费了一个字节):

0000 0001 向左移动3位
0000 1000
和wire type做二进制的或运算



A wire type of 2 (length-delimited) 长度界定的means that the value is a varint encoded length followed by the specified number of bytes of data.
type 2 (长度界定的类型)意味着其value的长度可以被varint编码。

message Test2 {
  required string b = 2;
Setting the value of b to "testing" gives you:

12 07 74 65 73 74 69 6e 67
The red bytes are the UTF8 of "testing". The key here is 0x12 → tag = 2, type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.


