proto buffer编码详解

工作内容需要尽可能小的应用传输格式。本来以为网上很多的。。。。


官方文档:

https://developers.google.com/protocol-buffers/docs/encoding


This document describes the binary wire format for protocol buffer messages. You don't need to understand this to use protocol buffers in your applications, but it can be very useful to know how different protocol buffer formats affect the size of your encoded messages.

这篇文档描述了protocol buffer消息的二进制wire(wire怎么翻译呢)格式。在你的应用中应用protocol buffer 不需要了解这些,但是这篇文档对于你了解protol buffer消息格式如何影响消息的大小非常的做用。


A Simple Message

Let's say you have the following very simple message definition:

message Test1 {
  required int32 a = 1;
}
In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you'd see three bytes:
在你的应用中,你create一个Test1类型的消息并把字段a设置为150.然后你把这条消息序列化。如果你可以去检查这条编码消息的话,你可以看到3个字节:
08 96 01


Base 128 Varints

To understand your simple protocol buffer encoding, you first need to understand varints. Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.
想要理解protocol buffer的编码,你首先需要理解varints(可变长整形)。Varints是一种应用一个或者多个字节的序列号整数的方法。小的整数需要更少的字节。

Each byte in a varint, except the last byte, has the most significant bit (msb最高有效位) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.
varint中的每一个字节(除了最后一个字节),都有一个用于标示是否有后续字节的最高有效位。其余7为用于存储真实的数字(这段就不怎么翻译了,之前的一篇博客对该算法有过介绍)

So, for example, here is the number 1 – it's a single byte, so the msb is not set:
如把1标示如下:
0000 0001

And here is 300 – this is a bit more complicated:
300标示如下:

1010 1100 0000 0010


How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint):
那么你是如何知道这两个字节代表300呢?首先把每个字节的首位bit去掉
1010 1100 0000 0010
→ 010 1100  000 0010


You reverse the two groups of 7 bits because, as you remember, varints store numbers with the least significant group first. Then you concatenate them to get your final value:
然后按字节(每个字节内的bit不改变顺序)反序


000 0010  010 1100  (10 010 1100即是十进制300)
→  000 0010 ++ 010 1100
→  100101100
→  256 + 32 + 8 + 4 = 300


Message Structure

As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).
protocol buffer message其实是一系列的key-value的对。消息的二进制版本仅仅是用字段的number来标示key - 每个字段的名字和其声明的类型。。。。

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value.
当一个消息被编码,key们和value们在字节流中被联系起来。当消息被解码的时候,解析器会忽略到她不认识的字段。通过这种方法,新增加的fields可以被添加到消息中而不必担心旧的程序不认识他们。因此,每个键值对中的'key'实际上有两个value,分别为你的.proto文件中定义的数字和wire的类型。

The available wire types are as follows:
可用的wire类型如下:










Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.

Now let's look at our simple example again. You now know that the first number in the stream is always a varint key, and here it's 08, or (dropping the msb):

000 1000

不直接翻译了(field number向左移动3位后和wire_type所代表的值进行或运算),即最后3个bit表示wire type。如下下过程(000 1000文章开头举的例子,第一个字节08表示key,后两个字节表示value 150,值得注意的是后两个字节 96 01 同样也是以varint进行编码的,本来一个字节就可以表示150,这里浪费了一个字节):

0000 0001 向左移动3位
0000 1000
和wire type做二进制的或运算


下面在官方文档中列举了很多种类型,我只写我经常用到的,或者是用的最多的

Strings

A wire type of 2 (length-delimited) 长度界定的means that the value is a varint encoded length followed by the specified number of bytes of data.
type 2 (长度界定的类型)意味着其value的长度可以被varint编码。

message Test2 {
  required string b = 2;
}
Setting the value of b to "testing" gives you:


12 07 74 65 73 74 69 6e 67
The red bytes are the UTF8 of "testing". The key here is 0x12 → tag = 2, type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.

12表示key,07表示value的长度,后面的字节分表表示每一个字符。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值