ProtoBuffer文档

Developer Guide

Welcome to the developer documentation for protocol buffers – a language-neutral, platform-neutral, extensible way of serializing structured data for use in communications protocols, data storage, and more.

This documentation is aimed at Java, C++, or Python developers who want to use protocol buffers in their applications. This overview introduces protocol buffers and tells you what you need to do to get started – you can then go on to follow the tutorials or delve deeper into protocol buffer encoding. API reference documentation is also provided for all three languages, as well as language and style guides for writing .proto files.

What are protocol buffers?

Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. You can even update your data structure without breaking deployed programs that are compiled against the "old" format.

How do they work?

You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:

message Person {

  required string name = 1;

  required int32 id = 2;

  optional string email = 3;

 

  enum PhoneType {

    MOBILE = 0;

    HOME = 1;

    WORK = 2;

  }

 

  message PhoneNumber {

    required string number = 1;

    optional PhoneType type = 2 [default = HOME];

  }

 

  repeated PhoneNumber phone = 4;

}

As you can see, the message format is simple – each message type has one or more uniquely numbered fields, and each field has a name and a value type, where value types can be numbers (integer or floating-point), booleans, strings, raw bytes, or even (as in the example above) other protocol buffer message types, allowing you to structure your data hierarchically. You can specify optional fields, required fields, and repeated fields. You can find more information about writing .proto files in the Protocol Buffer Language Guide.

Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes – so, for instance, if your chosen language is C++, running the compiler on the above example will generate a class called Person. You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:

Person person;
person.set_name(
"John Doe");
person.set_id(
1234);
person.set_email(
"jdoe@example.com");
fstream output(
"myfile", ios::out | ios::binary);
person.
SerializeToOstream(&output);
 

Then, later on, you could read your message back in:

fstream input("myfile", ios::in | ios::binary);
Person person;
person.
ParseFromIstream(&input);
cout <<
"Name: " << person.name() << endl;
cout <<
"E-mail: " << person.email() << endl;
 

You can add new fields to your message formats without breaking backwards-compatibility; old binaries simply ignore the new field when parsing. So if you have a communications protocol that uses protocol buffers as its data format, you can extend your protocol without having to worry about breaking existing code.

You'll find a complete reference for using generated protocol buffer code in the API Reference section, and you can find out more about how protocol buffer messages are encoded in Protocol Buffer Encoding.

Why not just use XML?

Protocol buffers have many advantages over XML for serializing structured data. Protocol buffers:

  • are simpler
  • are 3 to 10 times smaller
  • are 20 to 100 times faster
  • are less ambiguous
  • generate data access classes that are easier to use programmatically

For example, let's say you want to model a person with a name and an email. In XML, you need to do:

  <person>

    <name>John Doe</name>

    <email>jdoe@example.com</email>

  </person>

while the corresponding protocol buffer message (in protocol buffer text format) is:

# Textual representation of a protocol buffer.

# This is *not* the binary format used on the wire.

person {

  name: "John Doe"

  email: "jdoe@example.com"

}

When this message is encoded to the protocol buffer binary format (the text format above is just a convenient human-readable representation for debugging and editing), it would probably be 28 bytes long and take around 100-200 nanoseconds to parse. The XML version is at least 69 bytes if you remove whitespace, and would take around 5,000-10,000 nanoseconds to parse.

Also, manipulating a protocol buffer is much easier:

  cout << "Name: " << person.name() << endl;
  cout <<
"E-mail: " << person.email() << endl;

Whereas with XML you would have to do something like:

  cout << "Name: "
       << person.getElementsByTagName("name")->item(0)->innerText()
       << endl;
  cout <<
"E-mail: "
       << person.getElementsByTagName("email")->item(0)->innerText()
       << endl;

 

However, protocol buffers are not always a better solution than XML – for instance, protocol buffers would not be a good way to model a text-based document with markup (e.g. HTML), since you cannot easily interleave structure with text. In addition, XML is human-readable and human-editable; protocol buffers, at least in their native format, are not. XML is also – to some extent – self-describing. A protocol buffer is only meaningful if you have the message definition (the .proto file).

Sounds like the solution for me! How do I get started?

Download the package – this contains the complete source code for the Java, Python, and C++ protocol buffer compilers, as well as the classes you need for I/O and testing. To build and install your compiler, follow the instructions in the README.

Once you're all set, try following the tutorial for your chosen language – this will step you through creating a simple application that uses protocol buffers.

Introducing proto3

Our most recent version 3 release introduces a new language version - Protocol Buffers language version 3 (aka proto3), as well as some new features in our existing language version (aka proto2). Proto3 simplifies the protocol buffer language, both for ease of use and to make it available in a wider range of programming languages: our current release lets you generate protocol buffer code in Java, C++, Python, Java Lite, Ruby, JavaScript, Objective-C, and C#. In addition you can generate proto3 code for Go using the latest Go protoc plugin, available from the golang/protobuf Github repository. More languages are in the pipeline.

Note that the two language version APIs are not completely compatible. To avoid inconvenience to existing users, we will continue to support the previous language version in new protocol buffers releases.

You can see the major differences from the current default version in the release notes and learn about proto3 syntax in the Proto3 Language Guide. Full documentation for proto3 is coming soon!

(If the names proto2 and proto3 seem a little confusing, it's because when we originally open-sourced protocol buffers it was actually Google's second version of the language – also known as proto2. This is also why our open source version number started from v2.0.0).

A bit of history

Protocol buffers were initially developed at Google to deal with an index server request/response protocol. Prior to protocol buffers, there was a format for requests and responses that used hand marshalling/unmarshalling of requests and responses, and that supported a number of versions of the protocol. This resulted in some very ugly code, like:

 if (version == 3) {
   ...
 }
else if (version > 4) {
   
if (version == 5) {
     ...
   }
   ...
 }

Explicitly formatted protocols also complicated the rollout of new protocol versions, because developers had to make sure that all servers between the originator of the request and the actual server handling the request understood the new protocol before they could flip a switch to start using the new protocol.

Protocol buffers were designed to solve many of these problems:

  • New fields could be easily introduced, and intermediate servers that didn't need to inspect the data could simply parse it and pass through the data without needing to know about all the fields.
  • Formats were more self-describing, and could be dealt with from a variety of languages (C++, Java, etc.)

However, users still needed to hand-write their own parsing code.

As the system evolved, it acquired a number of other features and uses:

  • Automatically-generated serialization and deserialization code avoided the need for hand parsing.
  • In addition to being used for short-lived RPC (Remote Procedure Call) requests, people started to use protocol buffers as a handy self-describing format for storing data persistently (for example, in Bigtable).
  • Server RPC interfaces started to be declared as part of protocol files, with the protocol compiler generating stub classes that users could override with actual implementations of the server's interface.

Protocol buffers are now Google's lingua franca for data – at time of writing, there are 306,747 different message types defined in the Google code tree across 348,952 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.

//https://developers.google.com/protocol-buffers/docs/proto3

Language Guide (proto3)

This guide describes how to use the protocol buffer language to structure your protocol buffer data, including .protofile syntax and how to generate data access classes from your .proto files. It covers the proto3 version of the protocol buffers language: for information on the proto2 syntax, see the Proto2 Language Guide.

This is a reference guide – for a step by step example that uses many of the features described in this document, see the tutorial for your chosen language (currently proto2 only; more proto3 documentation is coming soon).

Defining A Message Type

First let's look at a very simple example. Let's say you want to define a search request message format, where each search request has a query string, the particular page of results you are interested in, and a number of results per page. Here's the .proto file you use to define the message type.

syntax = "proto3";



message SearchRequest {

  string query = 1;

  int32 page_number = 2;

  int32 result_per_page = 3;

}
  • The first line of the file specifies that you're using proto3 syntax: if you don't do this the protocol buffer compiler will assume you are using proto2. This must be the first non-empty, non-comment line of the file.
  • The SearchRequest message definition specifies three fields (name/value pairs), one for each piece of data that you want to include in this type of message. Each field has a name and a type.

Specifying Field Types

In the above example, all the fields are scalar types: two integers (page_number and result_per_page) and a string (query). However, you can also specify composite types for your fields, including enumerations and other message types.

Assigning Field Numbers

As you can see, each field in the message definition has a unique number. These field numbers are used to identify your fields in the message binary format, and should not be changed once your message type is in use. Note that field numbers in the range 1 through 15 take one byte to encode, including the field number and the field's type (you can find out more about this in Protocol Buffer Encoding). Field numbers in the range 16 through 2047 take two bytes. So you should reserve the numbers 1 through 15 for very frequently occurring message elements. Remember to leave some room for frequently occurring elements that might be added in the future.

The smallest field number you can specify is 1, and the largest is 229 - 1, or 536,870,911. You also cannot use the numbers 19000 through 19999 (FieldDescriptor::kFirstReservedNumber through FieldDescriptor::kLastReservedNumber), as they are reserved for the Protocol Buffers implementation - the protocol buffer compiler will complain if you use one of these reserved numbers in your .proto. Similarly, you cannot use any previously reserved field numbers.

Specifying Field Rules

Message fields can be one of the following:

  • singular: a well-formed message can have zero or one of this field (but not more than one). And this is the default field rule for proto3 syntax.
  • repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

In proto3, repeated fields of scalar numeric types use packed encoding by default.

You can find out more about packed encoding in Protocol Buffer Encoding.

Adding More Message Types

Multiple message types can be defined in a single .proto file. This is useful if you are defining multiple related messages – so, for example, if you wanted to define the reply message format that corresponds to your SearchResponse message type, you could add it to the same .proto:

message SearchRequest {

  string query = 1;

  int32 page_number = 2;

  int32 result_per_page = 3;

}



message SearchResponse {

 ...

}

Adding Comments

To add comments to your .proto files, use C/C++-style // and /* ... */ syntax.

/* SearchRequest represents a search query, with pagination options to

 * indicate which results to include in the response. */



message SearchRequest {

  string query = 1;

  int32 page_number = 2;  // Which page number do we want?

  int32 result_per_page = 3;  // Number of results to return per page.

}

Reserved Fields

If you update a message type by entirely removing a field, or commenting it out, future users can reuse the field number when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto, including data corruption, privacy bugs, and so on. One way to make sure this doesn't happen is to specify that the field numbers (and/or names, which can also cause issues for JSON serialization) of your deleted fields are reserved. The protocol buffer compiler will complain if any future users try to use these field identifiers.

message Foo {

  reserved 2, 15, 9 to 11;

  reserved "foo", "bar";

}

Note that you can't mix field names and field numbers in the same reserved statement.

What's Generated From Your .proto?

When you run the protocol buffer compiler on a .proto, the compiler generates the code in your chosen language you'll need to work with the message types you've described in the file, including getting and setting field values, serializing your messages to an output stream, and parsing your messages from an input stream.

  • For C++, the compiler generates a .h and .cc file from each .proto, with a class for each message type described in your file.
  • For Java, the compiler generates a .java file with a class for each message type, as well as a special Builderclasses for creating message class instances.
  • Python is a little different – the Python compiler generates a module with a static descriptor of each message type in your .proto, which is then used with a metaclass to create the necessary Python data access class at runtime.
  • For Go, the compiler generates a .pb.go file with a type for each message type in your file.
  • For Ruby, the compiler generates a .rb file with a Ruby module containing your message types.
  • For Objective-C, the compiler generates a pbobjc.h and pbobjc.m file from each .proto, with a class for each message type described in your file.
  • For C#, the compiler generates a .cs file from each .proto, with a class for each message type described in your file.
  • For Dart, the compiler generates a .pb.dart file with a class for each message type in your file.

You can find out more about using the APIs for each language by following the tutorial for your chosen language (proto3 versions coming soon). For even more API details, see the relevant API reference (proto3 versions also coming soon).

Scalar Value Types

A scalar message field can have one of the following types – the table shows the type specified in the .proto file, and the corresponding type in the automatically generated class:

.proto Type

Notes

C++ Type

Java Type

Python Type[2]

Go Type

Ruby Type

C# Type

PHP Type

Dart Type

double

 

double

double

float

float64

Float

double

float

double

float

 

float

float

float

float32

Float

float

float

double

int32

Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.

int32

int

int

int32

Fixnum or Bignum (as required)

int

integer

int

int64

Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.

int64

long

int/long[3]

int64

Bignum

long

integer/string[5]

Int64

uint32

Uses variable-length encoding.

uint32

int[1]

int/long[3]

uint32

Fixnum or Bignum (as required)

uint

integer

int

uint64

Uses variable-length encoding.

uint64

long[1]

int/long[3]

uint64

Bignum

ulong

integer/string[5]

Int64

sint32

Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.

int32

int

int

int32

Fixnum or Bignum (as required)

int

integer

int

sint64

Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.

int64

long

int/long[3]

int64

Bignum

long

integer/string[5]

Int64

fixed32

Always four bytes. More efficient than uint32 if values are often greater than 228.

uint32

int[1]

int/long[3]

uint32

Fixnum or Bignum (as required)

uint

integer

int

fixed64

Always eight bytes. More efficient than uint64 if values are often greater than 256.

uint64

long[1]

int/long[3]

uint64

Bignum

ulong

integer/string[5]

Int64

sfixed32

Always four bytes.

int32

int

int

int32

Fixnum or Bignum (as required)

int

integer

int

sfixed64

Always eight bytes.

int64

long

int/long[3]

int64

Bignum

long

integer/string[5]

Int64

bool

 

bool

boolean

bool

bool

TrueClass/FalseClass

bool

boolean

bool

string

A string must always contain UTF-8 encoded or 7-bit ASCII text, and cannot be longer than 232.

string

String

str/unicode[4]

string

String (UTF-8)

string

string

String

bytes

May contain any arbitrary sequence of bytes no longer than 232.

string

ByteString

str

[]byte

String (ASCII-8BIT)

ByteString

string

List

You can find out more about how these types are encoded when you serialize your message in Protocol Buffer Encoding.

[1] In Java, unsigned 32-bit and 64-bit integers are represented using their signed counterparts, with the top bit simply being stored in the sign bit.

[2] In all cases, setting values to a field will perform type checking to make sure it is valid.

[3] 64-bit or unsigned 32-bit integers are always represented as long when decoded, but can be an int if an int is given when setting the field. In all cases, the value must fit in the type represented when set. See [2].

[4] Python strings are represented as unicode on decode but can be str if an ASCII string is given (this is subject to change).

[5] Integer is used on 64-bit machines and string is used on 32-bit machines.

Default Values

When a message is parsed, if the encoded message does not contain a particular singular element, the corresponding field in the parsed object is set to the default value for that field. These defaults are type-specific:

  • For strings, the default value is the empty string.
  • For bytes, the default value is empty bytes.
  • For bools, the default value is false.
  • For numeric types, the default value is zero.
  • For enums, the default value is the first defined enum value, which must be 0.
  • For message fields, the field is not set. Its exact value is language-dependent. See the generated code guide for details.

The default value for repeated fields is empty (generally an empty list in the appropriate language).

Note that for scalar message fields, once a message is parsed there's no way of telling whether a field was explicitly set to the default value (for example whether a boolean was set to false) or just not set at all: you should bear this in mind when defining your message types. For example, don't have a boolean that switches on some behaviour when set to false if you don't want that behaviour to also happen by default. Also note that if a scalar message field is set to its default, the value will not be serialized on the wire.

See the generated code guide for your chosen language for more details about how defaults work in generated code.

Enumerations

When you're defining a message type, you might want one of its fields to only have one of a pre-defined list of values. For example, let's say you want to add a corpus field for each SearchRequest, where the corpus can be UNIVERSALWEBIMAGESLOCALNEWSPRODUCTS or VIDEO. You can do this very simply by adding an enum to your message definition with a constant for each possible value.

In the following example we've added an enum called Corpus with all the possible values, and a field of type Corpus:

message SearchRequest {

  string query = 1;

  int32 page_number = 2;

  int32 result_per_page = 3;

  enum Corpus {

    UNIVERSAL = 0;

    WEB = 1;

    IMAGES = 2;

    LOCAL = 3;

    NEWS = 4;

    PRODUCTS = 5;

    VIDEO = 6;

  }

  Corpus corpus = 4;

}



As you can see, the Corpus enum's first constant maps to zero: every enum definition must contain a constant that maps to zero as its first element. This is because:

  • There must be a zero value, so that we can use 0 as a numeric default value.
  • The zero value needs to be the first element, for compatibility with the proto2 semantics where the first enum value is always the default.

You can define aliases by assigning the same value to different enum constants. To do this you need to set the allow_alias option to true, otherwise the protocol compiler will generate an error message when aliases are found.

message MyMessage1 {

  enum EnumAllowingAlias {

    option allow_alias = true;

    UNKNOWN = 0;

    STARTED = 1;

    RUNNING = 1;

  }

}

message MyMessage2 {

  enum EnumNotAllowingAlias {

    UNKNOWN = 0;

    STARTED = 1;

    // RUNNING = 1;  // Uncommenting this line will cause a compile error inside Google and a warning message outside.

  }

}



Enumerator constants must be in the range of a 32-bit integer. Since enum values use varint encoding on the wire, negative values are inefficient and thus not recommended. You can define enums within a message definition, as in the above example, or outside – these enums can be reused in any message definition in your .proto file. You can also use an enum type declared in one message as the type of a field in a different message, using the syntax _MessageType_._EnumType_.

When you run the protocol buffer compiler on a .proto that uses an enum, the generated code will have a corresponding enum for Java or C++, a special EnumDescriptor class for Python that's used to create a set of symbolic constants with integer values in the runtime-generated class.

**Caution:** the generated code may be subject to language-specific limitations on the number of enumerators (low thousands for one language). Please review the limitations for the languages you plan to use.

During deserialization, unrecognized enum values will be preserved in the message, though how this is represented when the message is deserialized is language-dependent. In languages that support open enum types with values outside the range of specified symbols, such as C++ and Go, the unknown enum value is simply stored as its underlying integer representation. In languages with closed enum types such as Java, a case in the enum is used to represent an unrecognized value, and the underlying integer can be accessed with special accessors. In either case, if the message is serialized the unrecognized value will still be serialized with the message.

For more information about how to work with message enums in your applications, see the generated code guide for your chosen language.

Reserved Values

If you update an enum type by entirely removing an enum entry, or commenting it out, future users can reuse the numeric value when making their own updates to the type. This can cause severe issues if they later load old versions of the same .proto, including data corruption, privacy bugs, and so on. One way to make sure this doesn't happen is to specify that the numeric values (and/or names, which can also cause issues for JSON serialization) of your deleted entries are reserved. The protocol buffer compiler will complain if any future users try to use these identifiers. You can specify that your reserved numeric value range goes up to the maximum possible value using the max keyword.

enum Foo {

  reserved 2, 15, 9 to 11, 40 to max;

  reserved "FOO", "BAR";

}

Note that you can't mix field names and numeric values in the same reserved statement.

Using Other Message Types

You can use other message types as field types. For example, let's say you wanted to include Result messages in each SearchResponse message – to do this, you can define a Result message type in the same .proto and then specify a field of type Result in SearchResponse:

message SearchResponse {

  repeated Result results = 1;

}



message Result {

  string url = 1;

  string title = 2;

  repeated string snippets = 3;

}

Importing Definitions

In the above example, the Result message type is defined in the same file as SearchResponse – what if the message type you want to use as a field type is already defined in another .proto file?

You can use definitions from other .proto files by importing them. To import another .proto's definitions, you add an import statement to the top of your file:

import "myproject/other_protos.proto";

By default you can only use definitions from directly imported .proto files. However, sometimes you may need to move a .proto file to a new location. Instead of moving the .proto file directly and updating all the call sites in a single change, now you can put a dummy .proto file in the old location to forward all the imports to the new location using the import public notion. import public dependencies can be transitively relied upon by anyone importing the proto containing the import public statement. For example:

// new.proto

// All definitions are moved here
// old.proto

// This is the proto that all clients are importing.

import public "new.proto";

import "other.proto";
// client.proto

import "old.proto";

// You use definitions from old.proto and new.proto, but not other.proto

The protocol compiler searches for imported files in a set of directories specified on the protocol compiler command line using the -I/--proto_path flag. If no flag was given, it looks in the directory in which the compiler was invoked. In general you should set the --proto_path flag to the root of your project and use fully qualified names for all imports.

Using proto2 Message Types

It's possible to import proto2 message types and use them in your proto3 messages, and vice versa. However, proto2 enums cannot be used directly in proto3 syntax (it's okay if an imported proto2 message uses them).

Nested Types

You can define and use message types inside other message types, as in the following example – here the Resultmessage is defined inside the SearchResponse message:

message SearchResponse {

  message Result {

    string url = 1;

    string title = 2;

    repeated string snippets = 3;

  }

  repeated Result results = 1;

}

If you want to reuse this message type outside its parent message type, you refer to it as _Parent_._Type_:

message SomeOtherMessage {

  SearchResponse.Result result = 1;

}

You can nest messages as deeply as you like:

message Outer {                  // Level 0

  message MiddleAA {  // Level 1

    message Inner {   // Level 2

      int64 ival = 1;

      bool  booly = 2;

    }

  }

  message MiddleBB {  // Level 1

    message Inner {   // Level 2

      int32 ival = 1;

      bool  booly = 2;

    }

  }

}



Updating A Message Type

If an existing message type no longer meets all your needs – for example, you'd like the message format to have an extra field – but you'd still like to use code created with the old format, don't worry! It's very simple to update message types without breaking any of your existing code. Just remember the following rules:

  • Don't change the field numbers for any existing fields.
  • If you add new fields, any messages serialized by code using your "old" message format can still be parsed by your new generated code. You should keep in mind the default values for these elements so that new code can properly interact with messages generated by old code. Similarly, messages created by your new code can be parsed by your old code: old binaries simply ignore the new field when parsing. See the Unknown Fields section for details.
  • Fields can be removed, as long as the field number is not used again in your updated message type. You may want to rename the field instead, perhaps adding the prefix "OBSOLETE_", or make the field number reserved, so that future users of your .proto can't accidentally reuse the number.
  • int32uint32int64uint64, and bool are all compatible – this means you can change a field from one of these types to another without breaking forwards- or backwards-compatibility. If a number is parsed from the wire which doesn't fit in the corresponding type, you will get the same effect as if you had cast the number to that type in C++ (e.g. if a 64-bit number is read as an int32, it will be truncated to 32 bits).
  • sint32 and sint64 are compatible with each other but are not compatible with the other integer types.
  • string and bytes are compatible as long as the bytes are valid UTF-8.
  • Embedded messages are compatible with bytes if the bytes contain an encoded version of the message.
  • fixed32 is compatible with sfixed32, and fixed64 with sfixed64.
  • For stringbytes, and message fields, optional is compatible with repeated. Given serialized data of a repeated field as input, clients that expect this field to be optional will take the last input value if it's a primitive type field or merge all input elements if it's a message type field. Note that this is not generally safe for numeric types, including bools and enums. Repeated fields of numeric types can be serialized in the packed format, which will not be parsed correctly when an optional field is expected.
  • enum is compatible with int32uint32int64, and uint64 in terms of wire format (note that values will be truncated if they don't fit). However be aware that client code may treat them differently when the message is deserialized: for example, unrecognized proto3 enum types will be preserved in the message, but how this is represented when the message is deserialized is language-dependent. Int fields always just preserve their value.
  • Changing a single value into a member of a new oneof is safe and binary compatible. Moving multiple fields into a new oneof may be safe if you are sure that no code sets more than one at a time. Moving any fields into an existing oneof is not safe.

Unknown Fields

Unknown fields are well-formed protocol buffer serialized data representing fields that the parser does not recognize. For example, when an old binary parses data sent by a new binary with new fields, those new fields become unknown fields in the old binary.

Originally, proto3 messages always discarded unknown fields during parsing, but in version 3.5 we reintroduced the preservation of unknown fields to match the proto2 behavior. In versions 3.5 and later, unknown fields are retained during parsing and included in the serialized output.

Any

The Any message type lets you use messages as embedded types without having their .proto definition. An Anycontains an arbitrary serialized message as bytes, along with a URL that acts as a globally unique identifier for and resolves to that message's type. To use the Any type, you need to import google/protobuf/any.proto.

import "google/protobuf/any.proto";



message ErrorStatus {

  string message = 1;

  repeated google.protobuf.Any details = 2;

}

The default type URL for a given message type is type.googleapis.com/_packagename_._messagename_.

Different language implementations will support runtime library helpers to pack and unpack Any values in a typesafe manner – for example, in Java, the Any type will have special pack() and unpack() accessors, while in C++ there are PackFrom() and UnpackTo() methods:

// Storing an arbitrary message type in Any.

NetworkErrorDetails details = ...;

ErrorStatus status;

status.add_details()->PackFrom(details);



// Reading an arbitrary message from Any.

ErrorStatus status = ...;

for (const Any& detail : status.details()) {

  if (detail.Is()) {

    NetworkErrorDetails network_error;

    detail.UnpackTo(&network_error);

    ... processing network_error ...

  }

}



Currently the runtime libraries for working with Any types are under development.

If you are already familiar with proto2 syntax, the Any type replaces extensions.

Oneof

If you have a message with many fields and where at most one field will be set at the same time, you can enforce this behavior and save memory by using the oneof feature.

Oneof fields are like regular fields except all the fields in a oneof share memory, and at most one field can be set at the same time. Setting any member of the oneof automatically clears all the other members. You can check which value in a oneof is set (if any) using a special case() or WhichOneof() method, depending on your chosen language.

Using Oneof

To define a oneof in your .proto you use the oneof keyword followed by your oneof name, in this case test_oneof:

message SampleMessage {

  oneof test_oneof {

    string name = 4;

    SubMessage sub_message = 9;

  }

}

You then add your oneof fields to the oneof definition. You can add fields of any type, except map fields and repeatedfields.

In your generated code, oneof fields have the same getters and setters as regular fields. You also get a special method for checking which value (if any) in the oneof is set. You can find out more about the oneof API for your chosen language in the relevant API reference.

Oneof Features

  • Setting a oneof field will automatically clear all other members of the oneof. So if you set several oneof fields, only the last field you set will still have a value.
SampleMessage message;

message.set_name("name");

CHECK(message.has_name());

message.mutable_sub_message();   // Will clear name field.

CHECK(!message.has_name());
  • If the parser encounters multiple members of the same oneof on the wire, only the last member seen is used in the parsed message.
  • A oneof cannot be repeated.
  • Reflection APIs work for oneof fields.
  • If you set a oneof field to the default value (such as setting an int32 oneof field to 0), the "case" of that oneof field will be set, and the value will be serialized on the wire.
  • If you're using C++, make sure your code doesn't cause memory crashes. The following sample code will crash because sub_message was already deleted by calling the set_name() method.
SampleMessage message;

SubMessage* sub_message = message.mutable_sub_message();

message.set_name("name");      // Will delete sub_message

sub_message->set_...            // Crashes here
  • Again in C++, if you Swap() two messages with oneofs, each message will end up with the other’s oneof case: in the example below, msg1 will have a sub_message and msg2 will have a name.
SampleMessage msg1;

msg1.set_name("name");

SampleMessage msg2;

msg2.mutable_sub_message();

msg1.swap(&msg2);

CHECK(msg1.has_sub_message());

CHECK(msg2.has_name());

Backwards-compatibility issues

Be careful when adding or removing oneof fields. If checking the value of a oneof returns None/NOT_SET, it could mean that the oneof has not been set or it has been set to a field in a different version of the oneof. There is no way to tell the difference, since there's no way to know if an unknown field on the wire is a member of the oneof.

Tag Reuse Issues

  • Move fields into or out of a oneof: You may lose some of your information (some fields will be cleared) after the message is serialized and parsed. However, you can safely move a single field into a new oneof and may be able to move multiple fields if it is known that only one is ever set.
  • Delete a oneof field and add it back: This may clear your currently set oneof field after the message is serialized and parsed.
  • Split or merge oneof: This has similar issues to moving regular fields.

Maps

If you want to create an associative map as part of your data definition, protocol buffers provides a handy shortcut syntax:

map map_field = N;



...where the key_type can be any integral or string type (so, any scalar type except for floating point types and bytes). Note that enum is not a valid key_type. The value_type can be any type except another map.

So, for example, if you wanted to create a map of projects where each Project message is associated with a string key, you could define it like this:

map projects = 3;



  • Map fields cannot be repeated.
  • Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.
  • When generating text format for a .proto, maps are sorted by key. Numeric keys are sorted numerically.
  • When parsing from the wire or when merging, if there are duplicate map keys the last key seen is used. When parsing a map from text format, parsing may fail if there are duplicate keys.
  • If you provide a key but no value for a map field, the behavior when the field is serialized is language-dependent. In C++, Java, and Python the default value for the type is serialized, while in other languages nothing is serialized.

The generated map API is currently available for all proto3 supported languages. You can find out more about the map API for your chosen language in the relevant API reference.

Backwards compatibility

The map syntax is equivalent to the following on the wire, so protocol buffers implementations that do not support maps can still handle your data:

message MapFieldEntry {

  key_type key = 1;

  value_type value = 2;

}



repeated MapFieldEntry map_field = N;



Any protocol buffers implementation that supports maps must both produce and accept data that can be accepted by the above definition.

Packages

You can add an optional package specifier to a .proto file to prevent name clashes between protocol message types.

package foo.bar;

message Open { ... }

You can then use the package specifier when defining fields of your message type:

message Foo {

  ...

  foo.bar.Open open = 1;

  ...

}

The way a package specifier affects the generated code depends on your chosen language:

  • In C++ the generated classes are wrapped inside a C++ namespace. For example, Open would be in the namespace foo::bar.
  • In Java, the package is used as the Java package, unless you explicitly provide an option java_package in your .proto file.
  • In Python, the package directive is ignored, since Python modules are organized according to their location in the file system.
  • In Go, the package is used as the Go package name, unless you explicitly provide an option go_package in your .proto file.
  • In Ruby, the generated classes are wrapped inside nested Ruby namespaces, converted to the required Ruby capitalization style (first letter capitalized; if the first character is not a letter, PB_ is prepended). For example, Open would be in the namespace Foo::Bar.
  • In C# the package is used as the namespace after converting to PascalCase, unless you explicitly provide an option csharp_namespace in your .proto file. For example, Open would be in the namespace Foo.Bar.

Packages and Name Resolution

Type name resolution in the protocol buffer language works like C++: first the innermost scope is searched, then the next-innermost, and so on, with each package considered to be "inner" to its parent package. A leading '.' (for example, .foo.bar.Baz) means to start from the outermost scope instead.

The protocol buffer compiler resolves all type names by parsing the imported .proto files. The code generator for each language knows how to refer to each type in that language, even if it has different scoping rules.

Defining Services

If you want to use your message types with an RPC (Remote Procedure Call) system, you can define an RPC service interface in a .proto file and the protocol buffer compiler will generate service interface code and stubs in your chosen language. So, for example, if you want to define an RPC service with a method that takes your SearchRequest and returns a SearchResponse, you can define it in your .proto file as follows:

service SearchService {

  rpc Search (SearchRequest) returns (SearchResponse);

}

The most straightforward RPC system to use with protocol buffers is gRPC: a language- and platform-neutral open source RPC system developed at Google. gRPC works particularly well with protocol buffers and lets you generate the relevant RPC code directly from your .proto files using a special protocol buffer compiler plugin.

If you don't want to use gRPC, it's also possible to use protocol buffers with your own RPC implementation. You can find out more about this in the Proto2 Language Guide.

There are also a number of ongoing third-party projects to develop RPC implementations for Protocol Buffers. For a list of links to projects we know about, see the third-party add-ons wiki page.

JSON Mapping

Proto3 supports a canonical encoding in JSON, making it easier to share data between systems. The encoding is described on a type-by-type basis in the table below.

If a value is missing in the JSON-encoded data or if its value is null, it will be interpreted as the appropriate default value when parsed into a protocol buffer. If a field has the default value in the protocol buffer, it will be omitted in the JSON-encoded data by default to save space. An implementation may provide options to emit fields with default values in the JSON-encoded output.

proto3

JSON

JSON example

Notes

message

object

`{"fooBar": v, "g": null, …}`

Generates JSON objects. Message field names are mapped to lowerCamelCase and become JSON object keys. If the `json_name` field option is specified, the specified value will be used as the key instead. Parsers accept both the lowerCamelCase name (or the one specified by the `json_name` option) and the original proto field name. `null` is an accepted value for all field types and treated as the default value of the corresponding field type.

enum

string

`"FOO_BAR"`

The name of the enum value as specified in proto is used. Parsers accept both enum names and integer values.

map

object

`{"k": v, …}`

All keys are converted to strings.

repeated V

array

`[v, …]`

`null` is accepted as the empty list [].

bool

true, false

`true, false`

 

string

string

`"Hello World!"`

 

bytes

base64 string

`"YWJjMTIzIT8kKiYoKSctPUB+"`

JSON value will be the data encoded as a string using standard base64 encoding with paddings. Either standard or URL-safe base64 encoding with/without paddings are accepted.

int32, fixed32, uint32

number

`1, -10, 0`

JSON value will be a decimal number. Either numbers or strings are accepted.

int64, fixed64, uint64

string

`"1", "-10"`

JSON value will be a decimal string. Either numbers or strings are accepted.

float, double

number

`1.1, -10.0, 0, "NaN", "Infinity"`

JSON value will be a number or one of the special string values "NaN", "Infinity", and "-Infinity". Either numbers or strings are accepted. Exponent notation is also accepted.

Any

`object`

`{"@type": "url", "f": v, … }`

If the Any contains a value that has a special JSON mapping, it will be converted as follows: `{"@type": xxx, "value": yyy}`. Otherwise, the value will be converted into a JSON object, and the `"@type"` field will be inserted to indicate the actual data type.

Timestamp

string

`"1972-01-01T10:00:20.021Z"`

Uses RFC 3339, where generated output will always be Z-normalized and uses 0, 3, 6 or 9 fractional digits. Offsets other than "Z" are also accepted.

Duration

string

`"1.000340012s", "1s"`

Generated output always contains 0, 3, 6, or 9 fractional digits, depending on required precision, followed by the suffix "s". Accepted are any fractional digits (also none) as long as they fit into nano-seconds precision and the suffix "s" is required.

Struct

`object`

`{ … }`

Any JSON object. See `struct.proto`.

Wrapper types

various types

`2, "2", "foo", true, "true", null, 0, …`

Wrappers use the same representation in JSON as the wrapped primitive type, except that `null` is allowed and preserved during data conversion and transfer.

FieldMask

string

`"f.fooBar,h"`

See `field_mask.proto`.

ListValue

array

`[foo, bar, …]`

 

Value

value

 

Any JSON value

NullValue

null

 

JSON null

Empty

object

{}

An empty JSON object

JSON options

A proto3 JSON implementation may provide the following options:

  • Emit fields with default values: Fields with default values are omitted by default in proto3 JSON output. An implementation may provide an option to override this behavior and output fields with their default values.
  • Ignore unknown fields: Proto3 JSON parser should reject unknown fields by default but may provide an option to ignore unknown fields in parsing.
  • Use proto field name instead of lowerCamelCase name: By default proto3 JSON printer should convert the field name to lowerCamelCase and use that as the JSON name. An implementation may provide an option to use proto field name as the JSON name instead. Proto3 JSON parsers are required to accept both the converted lowerCamelCase name and the proto field name.
  • Emit enum values as integers instead of strings: The name of an enum value is used by default in JSON output. An option may be provided to use the numeric value of the enum value instead.

Options

Individual declarations in a .proto file can be annotated with a number of options. Options do not change the overall meaning of a declaration, but may affect the way it is handled in a particular context. The complete list of available options is defined in google/protobuf/descriptor.proto.

Some options are file-level options, meaning they should be written at the top-level scope, not inside any message, enum, or service definition. Some options are message-level options, meaning they should be written inside message definitions. Some options are field-level options, meaning they should be written inside field definitions. Options can also be written on enum types, enum values, oneof fields, service types, and service methods; however, no useful options currently exist for any of these.

Here are a few of the most commonly used options:

  • java_package (file option): The package you want to use for your generated Java classes. If no explicit java_package option is given in the .proto file, then by default the proto package (specified using the "package" keyword in the .proto file) will be used. However, proto packages generally do not make good Java packages since proto packages are not expected to start with reverse domain names. If not generating Java code, this option has no effect.
option java_package = "com.example.foo";
  • java_multiple_files (file option): Causes top-level messages, enums, and services to be defined at the package level, rather than inside an outer class named after the .proto file.
option java_multiple_files = true;
  • java_outer_classname (file option): The class name for the outermost Java class (and hence the file name) you want to generate. If no explicit java_outer_classname is specified in the .proto file, the class name will be constructed by converting the .proto file name to camel-case (so foo_bar.proto becomes FooBar.java). If not generating Java code, this option has no effect.
option java_outer_classname = "Ponycopter";
  • optimize_for (file option): Can be set to SPEEDCODE_SIZE, or LITE_RUNTIME. This affects the C++ and Java code generators (and possibly third-party generators) in the following ways:
    • SPEED (default): The protocol buffer compiler will generate code for serializing, parsing, and performing other common operations on your message types. This code is highly optimized.
    • CODE_SIZE: The protocol buffer compiler will generate minimal classes and will rely on shared, reflection-based code to implement serialialization, parsing, and various other operations. The generated code will thus be much smaller than with SPEED, but operations will be slower. Classes will still implement exactly the same public API as they do in SPEED mode. This mode is most useful in apps that contain a very large number .proto files and do not need all of them to be blindingly fast.
    • LITE_RUNTIME: The protocol buffer compiler will generate classes that depend only on the "lite" runtime library (libprotobuf-lite instead of libprotobuf). The lite runtime is much smaller than the full library (around an order of magnitude smaller) but omits certain features like descriptors and reflection. This is particularly useful for apps running on constrained platforms like mobile phones. The compiler will still generate fast implementations of all methods as it does in SPEED mode. Generated classes will only implement the MessageLite interface in each language, which provides only a subset of the methods of the full Message interface.
option optimize_for = CODE_SIZE;
  • cc_enable_arenas (file option): Enables arena allocation for C++ generated code.
  • objc_class_prefix (file option): Sets the Objective-C class prefix which is prepended to all Objective-C generated classes and enums from this .proto. There is no default. You should use prefixes that are between 3-5 uppercase characters as recommended by Apple. Note that all 2 letter prefixes are reserved by Apple.
  • deprecated (field option): If set to true, indicates that the field is deprecated and should not be used by new code. In most languages this has no actual effect. In Java, this becomes a @Deprecated annotation. In the future, other language-specific code generators may generate deprecation annotations on the field's accessors, which will in turn cause a warning to be emitted when compiling code which attempts to use the field. If the field is not used by anyone and you want to prevent new users from using it, consider replacing the field declaration with a reservedstatement.
int32 old_field = 6 [deprecated = true];

Custom Options

Protocol Buffers also allows you to define and use your own options. This is an advanced feature which most people don't need. If you do think you need to create your own options, see the Proto2 Language Guide for details. Note that creating custom options uses extensions, which are permitted only for custom options in proto3.

Generating Your Classes

To generate the Java, Python, C++, Go, Ruby, Objective-C, or C# code you need to work with the message types defined in a .proto file, you need to run the protocol buffer compiler protoc on the .proto. If you haven't installed the compiler, download the package and follow the instructions in the README. For Go, you also need to install a special code generator plugin for the compiler: you can find this and installation instructions in the golang/protobuf repository on GitHub.

The Protocol Compiler is invoked as follows:

protoc --proto_path=_IMPORT_PATH_ --cpp_out=_DST_DIR_ --java_out=_DST_DIR_ --python_out=_DST_DIR_ --go_out=_DST_DIR_ --ruby_out=_DST_DIR_ --objc_out=_DST_DIR_ --csharp_out=_DST_DIR_ _path/to/file_.proto
  • IMPORT_PATH specifies a directory in which to look for .proto files when resolving import directives. If omitted, the current directory is used. Multiple import directories can be specified by passing the --proto_path option multiple times; they will be searched in order. -I=_IMPORT_PATH_ can be used as a short form of --proto_path.
  • You can provide one or more output directives:
    • --cpp_out generates C++ code in DST_DIR. See the C++ generated code reference for more.
    • --java_out generates Java code in DST_DIR. See the Java generated code reference for more.
    • --python_out generates Python code in DST_DIR. See the Python generated code reference for more.
    • --go_out generates Go code in DST_DIR. See the Go generated code reference for more.
    • --ruby_out generates Ruby code in DST_DIR. Ruby generated code reference is coming soon!
    • --objc_out generates Objective-C code in DST_DIR. See the Objective-C generated code reference for more.
    • --csharp_out generates C# code in DST_DIR. See the C# generated code reference for more.
    • --php_out generates PHP code in DST_DIR. See the PHP generated code reference for more.As an extra convenience, if the DST_DIR ends in .zip or .jar, the compiler will write the output to a single ZIP-format archive file with the given name. .jar outputs will also be given a manifest file as required by the Java JAR specification. Note that if the output archive already exists, it will be overwritten; the compiler is not smart enough to add files to an existing archive.
  • You must provide one or more .proto files as input. Multiple .proto files can be specified at once. Although the files are named relative to the current directory, each file must reside in one of the IMPORT_PATHs so that the compiler can determine its canonical name.

Style Guide

This document provides a style guide for .proto files. By following these conventions, you'll make your protocol buffer message definitions and their corresponding classes consistent and easy to read.

Note that protocol buffer style has evolved over time, so it is likely that you will see .proto files written in different conventions or styles. Please respect the existing style when you modify these files. Consistency is key. However, it is best to adopt the current best style when you are creating a new .proto file.

Standard file formatting

  • Keep the line length to 80 characters.
  • Use an indent of 2 spaces.

File structure

Files should be named lower_snake_case.proto

All files should be ordered in the following manner:

  1. License header (if applicable)
  2. File overview
  3. Syntax
  4. Package
  5. Imports (sorted)
  6. File options
  7. Everything else

Packages

Package name should be in lowercase, and should correspond to the directory hierarchy. e.g., if a file is in my/package/, then the package name should be my.package.

Message and field names

Use CamelCase (with an initial capital) for message names – for example, SongServerRequest. Use underscore_separated_names for field names (including oneof field and extension names) – for example, song_name.

message SongServerRequest {

  required string song_name = 1;

}

Using this naming convention for field names gives you accessors like the following:

C++:
 
const string& song_name() { ... }
 
void set_song_name(const string& x) { ... }

Java:
 
public String getSongName() { ... }
 
public Builder setSongName(String v) { ... }
 

If your field name contains a number, the number should appear after the letter instead of after the underscore. e.g., use song_name1 instead of song_name_1

Repeated fields

Use pluralized names for repeated fields.

  repeated string keys = 1;
  ...
  repeated
MyMessage accounts = 17;
 

Enums

Use CamelCase (with an initial capital) for enum type names and CAPITALS_WITH_UNDERSCORES for value names:

enum Foo {
  FOO_UNSPECIFIED =
0;
  FOO_FIRST_VALUE =
1;
  FOO_SECOND_VALUE =
2;
}

Each enum value should end with a semicolon, not a comma. Prefer prefixing enum values instead of surrounding them in an enclosing message. The zero value enum should have the suffix UNSPECIFIED.

Services

If your .proto defines an RPC service, you should use CamelCase (with an initial capital) for both the service name and any RPC method names:

service FooService {
  rpc
GetSomething(FooRequest) returns (FooResponse);
}

Things to avoid

  • Required fields (only for proto2)
  • Groups (only for proto2)

Encoding

This document describes the binary wire format for protocol buffer messages. You don't need to understand this to use protocol buffers in your applications, but it can be very useful to know how different protocol buffer formats affect the size of your encoded messages.

A Simple Message

Let's say you have the following very simple message definition:

message Test1 {

  optional int32 a = 1;

}

In an application, you create a Test1 message and set a to 150. You then serialize the message to an output stream. If you were able to examine the encoded message, you'd see three bytes:

08 96 01

So far, so small and numeric – but what does it mean? Read on...

Base 128 Varints

To understand your simple protocol buffer encoding, you first need to understand varints. Varints are a method of serializing integers using one or more bytes. Smaller numbers take a smaller number of bytes.

Each byte in a varint, except the last byte, has the most significant bit (msb) set – this indicates that there are further bytes to come. The lower 7 bits of each byte are used to store the two's complement representation of the number in groups of 7 bits, least significant group first.

So, for example, here is the number 1 – it's a single byte, so the msb is not set:

0000 0001

And here is 300 – this is a bit more complicated:

1010 1100 0000 0010

How do you figure out that this is 300? First you drop the msb from each byte, as this is just there to tell us whether we've reached the end of the number (as you can see, it's set in the first byte as there is more than one byte in the varint):

 1010 1100 0000 0010
010 1100  000 0010

You reverse the two groups of 7 bits because, as you remember, varints store numbers with the least significant group first. Then you concatenate them to get your final value:

000 0010  010 1100

→  000 0010 ++ 010 1100

→  100101100

→  256 + 32 + 8 + 4 = 300

Message Structure

As you know, a protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field's number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type's definition (i.e. the .proto file).

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them. To this end, the "key" for each pair in a wire-format message is actually two values – the field number from your .proto file, plus a wire type that provides just enough information to find the length of the following value. In most language implementations this key is referred to as a tag.

The available wire types are as follows:

Type

Meaning

Used For

0

Varint

int32, int64, uint32, uint64, sint32, sint64, bool, enum

1

64-bit

fixed64, sfixed64, double

2

Length-delimited

string, bytes, embedded messages, packed repeated fields

3

Start group

groups (deprecated)

4

End group

groups (deprecated)

5

32-bit

fixed32, sfixed32, float

Each key in the streamed message is a varint with the value (field_number << 3) | wire_type – in other words, the last three bits of the number store the wire type.

Now let's look at our simple example again. You now know that the first number in the stream is always a varint key, and here it's 08, or (dropping the msb):

000 1000

You take the last three bits to get the wire type (0) and then right-shift by three to get the field number (1). So you now know that the field number is 1 and the following value is a varint. Using your varint-decoding knowledge from the previous section, you can see that the next two bytes store the value 150.

96 01 = 1001 0110  0000 0001

       → 000 0001  ++  001 0110 (drop the msb and reverse the groups of 7 bits)

       → 10010110

       → 128 + 16 + 4 + 2 = 150

More Value Types

Signed Integers

As you saw in the previous section, all the protocol buffer types associated with wire type 0 are encoded as varints. However, there is an important difference between the signed int types (sint32 and sint64) and the "standard" int types (int32 and int64) when it comes to encoding negative numbers. If you use int32 or int64 as the type for a negative number, the resulting varint is always ten bytes long – it is, effectively, treated like a very large unsigned integer. If you use one of the signed types, the resulting varint uses ZigZag encoding, which is much more efficient.

ZigZag encoding maps signed integers to unsigned integers so that numbers with a small absolute value (for instance, -1) have a small varint encoded value too. It does this in a way that "zig-zags" back and forth through the positive and negative integers, so that -1 is encoded as 1, 1 is encoded as 2, -2 is encoded as 3, and so on, as you can see in the following table:

Signed Original

Encoded As

0

0

-1

1

1

2

-2

3

2147483647

4294967294

-2147483648

4294967295

In other words, each value n is encoded using

(n << 1) ^ (n >> 31)

for sint32s, or

(n << 1) ^ (n >> 63)

for the 64-bit version.

Note that the second shift – the (n >> 31) part – is an arithmetic shift. So, in other words, the result of the shift is either a number that is all zero bits (if n is positive) or all one bits (if n is negative).

When the sint32 or sint64 is parsed, its value is decoded back to the original, signed version.

Non-varint Numbers

Non-varint numeric types are simple – double and fixed64 have wire type 1, which tells the parser to expect a fixed 64-bit lump of data; similarly float and fixed32 have wire type 5, which tells it to expect 32 bits. In both cases the values are stored in little-endian byte order.

Strings

A wire type of 2 (length-delimited) means that the value is a varint encoded length followed by the specified number of bytes of data.

message Test2 {

  optional string b = 2;

}

Setting the value of b to "testing" gives you:

12 07 74 65 73 74 69 6e 67

The red bytes are the UTF8 of "testing". The key here is 0x12 →

0001 0010

00010 010

→ field_number = 2, wire_type = 2. The length varint in the value is 7 and lo and behold, we find seven bytes following it – our string.

Embedded Messages

Here's a message definition with an embedded message of our example type, Test1:

message Test3 {

  optional Test1 c = 3;

}

And here's the encoded version, again with the Test1's a field set to 150:

 1a 03 08 96 01

As you can see, the last three bytes are exactly the same as our first example (08 96 01), and they're preceded by the number 3 – embedded messages are treated in exactly the same way as strings (wire type = 2).

Optional And Repeated Elements

If a proto2 message definition has repeated elements (without the [packed=true] option), the encoded message has zero or more key-value pairs with the same field number. These repeated values do not have to appear consecutively; they may be interleaved with other fields. The order of the elements with respect to each other is preserved when parsing, though the ordering with respect to other fields is lost. In proto3, repeated fields use packed encoding, which you can read about below.

For any non-repeated fields in proto3, or optional fields in proto2, the encoded message may or may not have a key-value pair with that field number.

Normally, an encoded message would never have more than one instance of a non-repeated field. However, parsers are expected to handle the case in which they do. For numeric types and strings, if the same field appears multiple times, the parser accepts the last value it sees. For embedded message fields, the parser merges multiple instances of the same field, as if with the Message::MergeFrom method – that is, all singular scalar fields in the latter instance replace those in the former, singular embedded messages are merged, and repeated fields are concatenated. The effect of these rules is that parsing the concatenation of two encoded messages produces exactly the same result as if you had parsed the two messages separately and merged the resulting objects. That is, this:

MyMessage message;

message.ParseFromString(str1 + str2);

is equivalent to this:

MyMessage message, message2;

message.ParseFromString(str1);

message2.ParseFromString(str2);

message.MergeFrom(message2);

This property is occasionally useful, as it allows you to merge two messages even if you do not know their types.

Packed Repeated Fields

Version 2.1.0 introduced packed repeated fields, which in proto2 are declared like repeated fields but with the special [packed=true] option. In proto3, repeated fields of scalar numeric types are packed by default. These function like repeated fields, but are encoded differently. A packed repeated field containing zero elements does not appear in the encoded message. Otherwise, all of the elements of the field are packed into a single key-value pair with wire type 2 (length-delimited). Each element is encoded the same way it would be normally, except without a key preceding it.

For example, imagine you have the message type:

message Test4 {

  repeated int32 d = 4 [packed=true];

}

Now let's say you construct a Test4, providing the values 3, 270, and 86942 for the repeated field d. Then, the encoded form would be:

22        // key (field number 4, wire type 2)

06        // payload size (6 bytes)

03        // first element (varint 3)

8E 02     // second element (varint 270)

9E A7 05  // third element (varint 86942)

Only repeated fields of primitive numeric types (types which use the varint, 32-bit, or 64-bit wire types) can be declared "packed".

Note that although there's usually no reason to encode more than one key-value pair for a packed repeated field, encoders must be prepared to accept multiple key-value pairs. In this case, the payloads should be concatenated. Each pair must contain a whole number of elements.

Protocol buffer parsers must be able to parse repeated fields that were compiled as packed as if they were not packed, and vice versa. This permits adding [packed=true] to existing fields in a forward- and backward-compatible way.

Field Order

Field numbers may be used in any order in a .proto file. The order chosen has no effect on how the messages are serialized.

When a message is serialized, there is no guaranteed order for how its known or unknown fields should be written. Serialization order is an implementation detail and the details of any particular implementation may change in the future. Therefore, protocol buffer parsers must be able to parse fields in any order.

Implications

  • Do not assume the byte output of a serialized message is stable. This is especially true for messages with transitive bytes fields representing other serialized protocol buffer messages.
  • By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
    • Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.
  • The following checks may fail for a protocol buffer message instance foo.
    • foo.SerializeAsString() == foo.SerializeAsString()
    • Hash(foo.SerializeAsString()) == Hash(foo.SerializeAsString())
    • CRC(foo.SerializeAsString()) == CRC(foo.SerializeAsString())
    • FingerPrint(foo.SerializeAsString()) == FingerPrint(foo.SerializeAsString())
  • Here're a few example scenarios where logically equivalent protocol buffer messages foo and bar may serialize to different byte outputs.
    • bar is serialized by an old server that treats some fields as unknown.
    • bar is serialized by a server that is implemented in a different programming language and serializes fields in different order.
    • bar has a field that serializes in non-deterministic manner.
    • bar has a field that stores a serialized byte output of a protocol buffer message which is serialized differently.
    • bar is serialized by a new server that serializes fields in different order due to an implementation change.
    • Both foo and bar are concatenation of individual messages but with different order.

Techniques

This page describes some commonly-used design patterns for dealing with Protocol Buffers. You can also send design and usage questions to the Protocol Buffers discussion group.

Streaming Multiple Messages

If you want to write multiple messages to a single file or stream, it is up to you to keep track of where one message ends and the next begins. The Protocol Buffer wire format is not self-delimiting, so protocol buffer parsers cannot determine where a message ends on their own. The easiest way to solve this problem is to write the size of each message before you write the message itself. When you read the messages back in, you read the size, then read the bytes into a separate buffer, then parse from that buffer. (If you want to avoid copying bytes to a separate buffer, check out the CodedInputStream class (in both C++ and Java) which can be told to limit reads to a certain number of bytes.)

Large Data Sets

Protocol Buffers are not designed to handle large messages. As a general rule of thumb, if you are dealing in messages larger than a megabyte each, it may be time to consider an alternate strategy.

That said, Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are really just a collection of small pieces, where each small piece may be a structured piece of data. Even though Protocol Buffers cannot handle the entire set at once, using Protocol Buffers to encode each piece greatly simplifies your problem: now all you need is to handle a set of byte strings rather than a set of structures.

Protocol Buffers do not include any built-in support for large data sets because different situations call for different solutions. Sometimes a simple list of records will do while other times you may want something more like a database. Each solution should be developed as a separate library, so that only those who need it need to pay the costs.

Self-describing Messages

Protocol Buffers do not contain descriptions of their own types. Thus, given only a raw message without the corresponding .proto file defining its type, it is difficult to extract any useful data.

However, note that the contents of a .proto file can itself be represented using protocol buffers. The file src/google/protobuf/descriptor.proto in the source code package defines the message types involved. protoccan output a FileDescriptorSet – which represents a set of .proto files – using the --descriptor_set_out option. With this, you could define a self-describing protocol message like so:

syntax = "proto3";



import "google/protobuf/any.proto";

import "google/protobuf/descriptor.proto";



message SelfDescribingMessage {

  // Set of FileDescriptorProtos which describe the type and its dependencies.

  google.protobuf.FileDescriptorSet descriptor_set = 1;



  // The message and its type, encoded as an Any message.

  google.protobuf.Any message = 2;

}



By using classes like DynamicMessage (available in C++ and Java), you can then write tools which can manipulate SelfDescribingMessages.

All that said, the reason that this functionality is not included in the Protocol Buffer library is because we have never had a use for it inside Google.

This technique requires support for dynamic messages using descriptors. Please check that your platforms support this feature before using self-describing messages.

Third-Party Add-ons

Many open source projects seek to add useful functionality on top of Protocol Buffers. For a list of links to projects we know about, see the third-party add-ons wiki page

协议缓冲区的第三方加载项

此页面列出了与协议缓冲区相关的代码,这些代码是由第三方开发和维护的。您可能会发现此代码很有用,但是请注意,这些项目不隶属于Google或由Google认可(除非明确标记);试试看,后果自负。还要注意,这里的许多项目都处于开发的早期阶段,还没有投入生产。

如果您有应在此处列出的项目,请向我们发送请求请求以更新此页面。

编程语言

我们知道这些项目是关于为其他编程语言实现协议缓冲区的:

RPC实施

GRPChttp://www.grpc.io/)是Google针对协议缓冲区的RPC实现。也有其他第三方RPC实现。其中一些实际上使用协议缓冲区服务定义(使用文件中的service关键字定义.proto),而其他一些仅使用协议缓冲区消息对象。

无效:

其他实用程序

作为协议缓冲区开发人员,您可能还会发现其他有用的东西。

Protocol Buffer Basics: C++

This tutorial provides a basic C++ programmer's introduction to working with protocol buffers. By walking through creating a simple example application, it shows you how to

  • Define message formats in a .proto file.
  • Use the protocol buffer compiler.
  • Use the C++ protocol buffer API to write and read messages.

This isn't a comprehensive guide to using protocol buffers in C++. For more detailed reference information, see the Protocol Buffer Language Guide, the C++ API Reference, the C++ Generated Code Guide, and the Encoding Reference.

Why Use Protocol Buffers?

The example we're going to use is a very simple "address book" application that can read and write people's contact details to and from a file. Each person in the address book has a name, an ID, an email address, and a contact phone number.

How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:

  • The raw in-memory data structures can be sent/saved in binary form. Over time, this is a fragile approach, as the receiving/reading code must be compiled with exactly the same memory layout, endianness, etc. Also, as files accumulate data in the raw format and copies of software that are wired for that format are spread around, it's very hard to extend the format.
  • You can invent an ad-hoc way to encode the data items into a single string – such as encoding 4 ints as "12:3:-23:67". This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
  • Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.

Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

Where to Find the Example Code

The example code is included in the source code package, under the "examples" directory. Download it here.

Defining Your Protocol Format

To create your address book application, you'll need to start with a .proto file. The definitions in a .proto file are simple: you add a message for each data structure you want to serialize, then specify a name and a type for each field in the message. Here is the .proto file that defines your messages, addressbook.proto.

syntax = "proto2";
 
package tutorial;
 
message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
 
  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }
 
  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }
 
  repeated PhoneNumber phones = 4;
}
 
message AddressBook {
  repeated Person people = 1;
}

As you can see, the syntax is similar to C++ or Java. Let's go through each part of the file and see what it does.

The .proto file starts with a package declaration, which helps to prevent naming conflicts between different projects. In C++, your generated classes will be placed in a namespace matching the package name.

Next, you have your message definitions. A message is just an aggregate containing a set of typed fields. Many standard simple data types are available as field types, including boolint32floatdouble, and string. You can also add further structure to your messages by using other message types as field types – in the above example the Personmessage contains PhoneNumber messages, while the AddressBook message contains Person messages. You can even define message types nested inside other messages – as you can see, the PhoneNumber type is defined inside Person. You can also define enum types if you want one of your fields to have one of a predefined list of values – here you want to specify that a phone number can be one of MOBILEHOME, or WORK.

The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional elements. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good candidates for this optimization.

Each field must be annotated with one of the following modifiers:

  • required: a value for the field must be provided, otherwise the message will be considered "uninitialized". If libprotobuf is compiled in debug mode, serializing an uninitialized message will cause an assertion failure. In optimized builds, the check is skipped and the message will be written anyway. However, parsing an uninitialized message will always fail (by returning false from the parse method). Other than this, a required field behaves exactly like an optional field.
  • optional: the field may or may not be set. If an optional field value isn't set, a default value is used. For simple types, you can specify your own default value, as we've done for the phone number type in the example. Otherwise, a system default is used: zero for numeric types, the empty string for strings, false for bools. For embedded messages, the default value is always the "default instance" or "prototype" of the message, which has none of its fields set. Calling the accessor to get the value of an optional (or required) field which has not been explicitly set always returns that field's default value.
  • repeated: the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.

Required Is Forever You should be very careful about marking fields as required. If at some point you wish to stop writing or sending a required field, it will be problematic to change the field to an optional field – old readers will consider messages without this field to be incomplete and may reject or drop them unintentionally. You should consider writing application-specific custom validation routines for your buffers instead. Some engineers at Google have come to the conclusion that using required does more harm than good; they prefer to use only optional and repeated. However, this view is not universal.

You'll find a complete guide to writing .proto files – including all the possible field types – in the Protocol Buffer Language Guide. Don't go looking for facilities similar to class inheritance, though – protocol buffers don't do that.

Compiling Your Protocol Buffers

Now that you have a .proto, the next thing you need to do is generate the classes you'll need to read and write AddressBook (and hence Person and PhoneNumber) messages. To do this, you need to run the protocol buffer compiler protoc on your .proto:

  1. If you haven't installed the compiler, download the package and follow the instructions in the README.
  2. Now run the compiler, specifying the source directory (where your application's source code lives – the current directory is used if you don't provide a value), the destination directory (where you want the generated code to go; often the same as $SRC_DIR), and the path to your .proto. In this case, you...:
protoc -I=$SRC_DIR --cpp_out=$DST_DIR $SRC_DIR/addressbook.proto

Because you want C++ classes, you use the --cpp_out option – similar options are provided for other supported languages.

This generates the following files in your specified destination directory:

  • addressbook.pb.h, the header which declares your generated classes.
  • addressbook.pb.cc, which contains the implementation of your classes.

The Protocol Buffer API

Let's look at some of the generated code and see what classes and functions the compiler has created for you. If you look in addressbook.pb.h, you can see that you have a class for each message you specified in addressbook.proto. Looking closer at the Person class, you can see that the compiler has generated accessors for each field. For example, for the nameidemail, and phones fields, you have these methods:

  // name

  inline bool has_name() const;

  inline void clear_name();

  inline const ::std::string& name() const;

  inline void set_name(const ::std::string& value);

  inline void set_name(const char* value);

  inline ::std::string* mutable_name();



  // id

  inline bool has_id() const;

  inline void clear_id();

  inline int32_t id() const;

  inline void set_id(int32_t value);



  // email

  inline bool has_email() const;

  inline void clear_email();

  inline const ::std::string& email() const;

  inline void set_email(const ::std::string& value);

  inline void set_email(const char* value);

  inline ::std::string* mutable_email();



  // phones

  inline int phones_size() const;

  inline void clear_phones();

  inline const ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >& phones() const;

  inline ::google::protobuf::RepeatedPtrField< ::tutorial::Person_PhoneNumber >* mutable_phones();

  inline const ::tutorial::Person_PhoneNumber& phones(int index) const;

  inline ::tutorial::Person_PhoneNumber* mutable_phones(int index);

  inline ::tutorial::Person_PhoneNumber* add_phones();

As you can see, the getters have exactly the name as the field in lowercase, and the setter methods begin with set_. There are also has_ methods for each singular (required or optional) field which return true if that field has been set. Finally, each field has a clear_ method that un-sets the field back to its empty state.

While the numeric id field just has the basic accessor set described above, the name and email fields have a couple of extra methods because they're strings – a mutable_ getter that lets you get a direct pointer to the string, and an extra setter. Note that you can call mutable_email() even if email is not already set; it will be initialized to an empty string automatically. If you had a singular message field in this example, it would also have a mutable_ method but not a set_ method.

Repeated fields also have some special methods – if you look at the methods for the repeated phones field, you'll see that you can

  • check the repeated field's _size (in other words, how many phone numbers are associated with this Person).
  • get a specified phone number using its index.
  • update an existing phone number at the specified index.
  • add another phone number to the message which you can then edit (repeated scalar types have an add_ that just lets you pass in the new value).

For more information on exactly what members the protocol compiler generates for any particular field definition, see the C++ generated code reference.

Enums and Nested Classes

The generated code includes a PhoneType enum that corresponds to your .proto enum. You can refer to this type as Person::PhoneType and its values as Person::MOBILEPerson::HOME, and Person::WORK (the implementation details are a little more complicated, but you don't need to understand them to use the enum).

The compiler has also generated a nested class for you called Person::PhoneNumber. If you look at the code, you can see that the "real" class is actually called Person_PhoneNumber, but a typedef defined inside Person allows you to treat it as if it were a nested class. The only case where this makes a difference is if you want to forward-declare the class in another file – you cannot forward-declare nested types in C++, but you can forward-declare Person_PhoneNumber.

Standard Message Methods

Each message class also contains a number of other methods that let you check or manipulate the entire message, including:

  • bool IsInitialized() const;: checks if all the required fields have been set.
  • string DebugString() const;: returns a human-readable representation of the message, particularly useful for debugging.
  • void CopyFrom(const Person& from);: overwrites the message with the given message's values.
  • void Clear();: clears all the elements back to the empty state.

These and the I/O methods described in the following section implement the Message interface shared by all C++ protocol buffer classes. For more info, see the complete API documentation for Message.

Parsing and Serialization

Finally, each protocol buffer class has methods for writing and reading messages of your chosen type using the protocol buffer binary format. These include:

  • bool SerializeToString(string* output) const;: serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.
  • bool ParseFromString(const string& data);: parses a message from the given string.
  • bool SerializeToOstream(ostream* output) const;: writes the message to the given C++ ostream.
  • bool ParseFromIstream(istream* input);: parses a message from the given C++ istream.

These are just a couple of the options provided for parsing and serialization. Again, see the Message API reference for a complete list.

Protocol Buffers and O-O Design Protocol buffer classes are basically dumb data holders (like structs in C); they don't make good first class citizens in an object model. If you want to add richer behaviour to a generated class, the best way to do this is to wrap the generated protocol buffer class in an application-specific class. Wrapping protocol buffers is also a good idea if you don't have control over the design of the .proto file (if, say, you're reusing one from another project). In that case, you can use the wrapper class to craft an interface better suited to the unique environment of your application: hiding some data and methods, exposing convenience functions, etc. You should never add behaviour to the generated classes by inheriting from them. This will break internal mechanisms and is not good object-oriented practice anyway.

Writing A Message

Now let's try using your protocol buffer classes. The first thing you want your address book application to be able to do is write personal details to your address book file. To do this, you need to create and populate instances of your protocol buffer classes and then write them to an output stream.

Here is a program which reads an AddressBook from a file, adds one new Person to it based on user input, and writes the new AddressBook back out to the file again. The parts which directly call or reference code generated by the protocol compiler are highlighted.

#include <iostream>

#include <fstream>

#include <string>

#include "addressbook.pb.h"

using namespace std;



// This function fills in a Person message based on user input.

void PromptForAddress(tutorial::Person* person) {

  cout << "Enter person ID number: ";

  int id;

  cin >> id;

  person->set_id(id);

  cin.ignore(256, '\n');



  cout << "Enter name: ";

  getline(cin, *person->mutable_name());



  cout << "Enter email address (blank for none): ";

  string email;

  getline(cin, email);

  if (!email.empty()) {

    person->set_email(email);

  }



  while (true) {

    cout << "Enter a phone number (or leave blank to finish): ";

    string number;

    getline(cin, number);

    if (number.empty()) {

      break;

    }



    tutorial::Person::PhoneNumber* phone_number = person->add_phones();

    phone_number->set_number(number);



    cout << "Is this a mobile, home, or work phone? ";

    string type;

    getline(cin, type);

    if (type == "mobile") {

      phone_number->set_type(tutorial::Person::MOBILE);

    } else if (type == "home") {

      phone_number->set_type(tutorial::Person::HOME);

    } else if (type == "work") {

      phone_number->set_type(tutorial::Person::WORK);

    } else {

      cout << "Unknown phone type.  Using default." << endl;

    }

  }

}



// Main function:  Reads the entire address book from a file,

//   adds one person based on user input, then writes it back out to the same

//   file.

int main(int argc, char* argv[]) {

  // Verify that the version of the library that we linked against is

  // compatible with the version of the headers we compiled against.

  GOOGLE_PROTOBUF_VERIFY_VERSION;



  if (argc != 2) {

    cerr << "Usage:  " << argv[0] << " ADDRESS_BOOK_FILE" << endl;

    return -1;

  }



  tutorial::AddressBook address_book;



  {

    // Read the existing address book.

    fstream input(argv[1], ios::in | ios::binary);

    if (!input) {

      cout << argv[1] << ": File not found.  Creating a new file." << endl;

    } else if (!address_book.ParseFromIstream(&input)) {

      cerr << "Failed to parse address book." << endl;

      return -1;

    }

  }



  // Add an address.

  PromptForAddress(address_book.add_people());



  {

    // Write the new address book back to disk.

    fstream output(argv[1], ios::out | ios::trunc | ios::binary);

    if (!address_book.SerializeToOstream(&output)) {

      cerr << "Failed to write address book." << endl;

      return -1;

    }

  }



  // Optional:  Delete all global objects allocated by libprotobuf.

  google::protobuf::ShutdownProtobufLibrary();



  return 0;

}



Notice the GOOGLE_PROTOBUF_VERIFY_VERSION macro. It is good practice – though not strictly necessary – to execute this macro before using the C++ Protocol Buffer library. It verifies that you have not accidentally linked against a version of the library which is incompatible with the version of the headers you compiled with. If a version mismatch is detected, the program will abort. Note that every .pb.cc file automatically invokes this macro on startup.

Also notice the call to ShutdownProtobufLibrary() at the end of the program. All this does is delete any global objects that were allocated by the Protocol Buffer library. This is unnecessary for most programs, since the process is just going to exit anyway and the OS will take care of reclaiming all of its memory. However, if you use a memory leak checker that requires that every last object be freed, or if you are writing a library which may be loaded and unloaded multiple times by a single process, then you may want to force Protocol Buffers to clean up everything.

Reading A Message

Of course, an address book wouldn't be much use if you couldn't get any information out of it! This example reads the file created by the above example and prints all the information in it.

#include <iostream>

#include <fstream>

#include <string>

#include "addressbook.pb.h"

using namespace std;



// Iterates though all people in the AddressBook and prints info about them.

void ListPeople(const tutorial::AddressBook& address_book) {

  for (int i = 0; i < address_book.people_size(); i++) {

    const tutorial::Person& person = address_book.people(i);



    cout << "Person ID: " << person.id() << endl;

    cout << "  Name: " << person.name() << endl;

    if (person.has_email()) {

      cout << "  E-mail address: " << person.email() << endl;

    }



    for (int j = 0; j < person.phones_size(); j++) {

      const tutorial::Person::PhoneNumber& phone_number = person.phones(j);



      switch (phone_number.type()) {

        case tutorial::Person::MOBILE:

          cout << "  Mobile phone #: ";

          break;

        case tutorial::Person::HOME:

          cout << "  Home phone #: ";

          break;

        case tutorial::Person::WORK:

          cout << "  Work phone #: ";

          break;

      }

      cout << phone_number.number() << endl;

    }

  }

}



// Main function:  Reads the entire address book from a file and prints all

//   the information inside.

int main(int argc, char* argv[]) {

  // Verify that the version of the library that we linked against is

  // compatible with the version of the headers we compiled against.

  GOOGLE_PROTOBUF_VERIFY_VERSION;



  if (argc != 2) {

    cerr << "Usage:  " << argv[0] << " ADDRESS_BOOK_FILE" << endl;

    return -1;

  }



  tutorial::AddressBook address_book;



  {

    // Read the existing address book.

    fstream input(argv[1], ios::in | ios::binary);

    if (!address_book.ParseFromIstream(&input)) {

      cerr << "Failed to parse address book." << endl;

      return -1;

    }

  }



  ListPeople(address_book);



  // Optional:  Delete all global objects allocated by libprotobuf.

  google::protobuf::ShutdownProtobufLibrary();



  return 0;

}



Extending a Protocol Buffer

Sooner or later after you release the code that uses your protocol buffer, you will undoubtedly want to "improve" the protocol buffer's definition. If you want your new buffers to be backwards-compatible, and your old buffers to be forward-compatible – and you almost certainly do want this – then there are some rules you need to follow. In the new version of the protocol buffer:

  • you must not change the tag numbers of any existing fields.
  • you must not add or delete any required fields.
  • you may delete optional or repeated fields.
  • you may add new optional or repeated fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

(There are some exceptions to these rules, but they are rarely used.)

If you follow these rules, old code will happily read new messages and simply ignore any new fields. To the old code, optional fields that were deleted will simply have their default value, and deleted repeated fields will be empty. New code will also transparently read old messages. However, keep in mind that new optional fields will not be present in old messages, so you will need to either check explicitly whether they're set with has_, or provide a reasonable default value in your .proto file with [default = value] after the tag number. If the default value is not specified for an optional element, a type-specific default value is used instead: for strings, the default value is the empty string. For booleans, the default value is false. For numeric types, the default value is zero. Note also that if you added a new repeated field, your new code will not be able to tell whether it was left empty (by new code) or never set at all (by old code) since there is no has_ flag for it.

Optimization Tips

The C++ Protocol Buffers library is extremely heavily optimized. However, proper usage can improve performance even more. Here are some tips for squeezing every last drop of speed out of the library:

  • Reuse message objects when possible. Messages try to keep around any memory they allocate for reuse, even when they are cleared. Thus, if you are handling many messages with the same type and similar structure in succession, it is a good idea to reuse the same message object each time to take load off the memory allocator. However, objects can become bloated over time, especially if your messages vary in "shape" or if you occasionally construct a message that is much larger than usual. You should monitor the sizes of your message objects by calling the SpaceUsed method and delete them once they get too big.
  • Your system's memory allocator may not be well-optimized for allocating lots of small objects from multiple threads. Try using Google's tcmalloc instead.

Advanced Usage

Protocol buffers have uses that go beyond simple accessors and serialization. Be sure to explore the C++ API referenceto see what else you can do with them.

One key feature provided by protocol message classes is reflection. You can iterate over the fields of a message and manipulate their values without writing your code against any specific message type. One very useful way to use reflection is for converting protocol messages to and from other encodings, such as XML or JSON. A more advanced use of reflection might be to find differences between two messages of the same type, or to develop a sort of "regular expressions for protocol messages" in which you can write expressions that match certain message contents. If you use your imagination, it's possible to apply Protocol Buffers to a much wider range of problems than you might initially expect!

Reflection is provided by the Message::Reflection interface.

Protocol Buffer Basics: C#

This tutorial provides a basic C# programmer's introduction to working with protocol buffers, using the proto3 version of the protocol buffers language. By walking through creating a simple example application, it shows you how to

  • Define message formats in a .proto file.
  • Use the protocol buffer compiler.
  • Use the C# protocol buffer API to write and read messages.

This isn't a comprehensive guide to using protocol buffers in C#. For more detailed reference information, see the Protocol Buffer Language Guide, the C# API Reference, the C# Generated Code Guide, and the Encoding Reference.

Why use protocol buffers?

The example we're going to use is a very simple "address book" application that can read and write people's contact details to and from a file. Each person in the address book has a name, an ID, an email address, and a contact phone number.

How do you serialize and retrieve structured data like this? There are a few ways to solve this problem:

  • Use .NET binary serialization with System.Runtime.Serialization.Formatters.Binary.BinaryFormatter and associated classes. This ends up being very fragile in the face of changes, expensive in terms of data size in some cases. It also doesn't work very well if you need to share data with applications written for other platforms.
  • You can invent an ad-hoc way to encode the data items into a single string – such as encoding 4 ints as "12:3:-23:67". This is a simple and flexible approach, although it does require writing one-off encoding and parsing code, and the parsing imposes a small run-time cost. This works best for encoding very simple data.
  • Serialize the data to XML. This approach can be very attractive since XML is (sort of) human readable and there are binding libraries for lots of languages. This can be a good choice if you want to share data with other applications/projects. However, XML is notoriously space intensive, and encoding/decoding it can impose a huge performance penalty on applications. Also, navigating an XML DOM tree is considerably more complicated than navigating simple fields in a class normally would be.

Protocol buffers are the flexible, efficient, automated solution to solve exactly this problem. With protocol buffers, you write a .proto description of the data structure you wish to store. From that, the protocol buffer compiler creates a class that implements automatic encoding and parsing of the protocol buffer data with an efficient binary format. The generated class provides getters and setters for the fields that make up a protocol buffer and takes care of the details of reading and writing the protocol buffer as a unit. Importantly, the protocol buffer format supports the idea of extending the format over time in such a way that the code can still read data encoded with the old format.

Where to find the example code

Our example is a command-line application for managing an address book data file, encoded using protocol buffers. The command AddressBook (see: Program.cs) can add a new entry to the data file or parse the data file and print the data to the console.

You can find the complete example in the examples directory and csharp/src/AddressBook directory of the GitHub repository.

Defining your protocol format

To create your address book application, you'll need to start with a .proto file. The definitions in a .proto file are simple: you add a message for each data structure you want to serialize, then specify a name and a type for each field in the message. In our example, the .proto file that defines the messages is addressbook.proto.

The .proto file starts with a package declaration, which helps to prevent naming conflicts between different projects.

syntax = "proto3";
package tutorial;

import "google/protobuf/timestamp.proto";

In C#, your generated classes will be placed in a namespace matching the package name if csharp_namespace is not specified. In our example, the csharp_namespace option has been specified to override the default, so the generated code uses a namespace of Google.Protobuf.Examples.AddressBook instead of Tutorial.

option csharp_namespace = "Google.Protobuf.Examples.AddressBook";

Next, you have your message definitions. A message is just an aggregate containing a set of typed fields. Many standard simple data types are available as field types, including boolint32floatdouble, and string. You can also add further structure to your messages by using other message types as field types.

message Person {
 
string name = 1;
  int32 id =
2;  // Unique ID number for this person.
  string email = 3;

 
enum PhoneType {
    MOBILE =
0;
    HOME =
1;
    WORK =
2;
  }

  message
PhoneNumber {
   
string number = 1;
   
PhoneType type = 2;
  }

  repeated
PhoneNumber phones = 4;

  google.protobuf.
Timestamp last_updated = 5;
}


// Our address book file is just one of these.
message AddressBook {
  repeated
Person people = 1;
}

In the above example, the Person message contains PhoneNumber messages, while the AddressBook message contains Person messages. You can even define message types nested inside other messages – as you can see, thePhoneNumber type is defined inside Person. You can also define enum types if you want one of your fields to have one of a predefined list of values – here you want to specify that a phone number can be one of MOBILEHOME, or WORK.

The " = 1", " = 2" markers on each element identify the unique "tag" that field uses in the binary encoding. Tag numbers 1-15 require one less byte to encode than higher numbers, so as an optimization you can decide to use those tags for the commonly used or repeated elements, leaving tags 16 and higher for less-commonly used optional elements. Each element in a repeated field requires re-encoding the tag number, so repeated fields are particularly good candidates for this optimization.

If a field value isn't set, a default value is used: zero for numeric types, the empty string for strings, false for bools. For embedded messages, the default value is always the "default instance" or "prototype" of the message, which has none of its fields set. Calling the accessor to get the value of a field which has not been explicitly set always returns that field's default value.

If a field is repeated, the field may be repeated any number of times (including zero). The order of the repeated values will be preserved in the protocol buffer. Think of repeated fields as dynamically sized arrays.

You'll find a complete guide to writing .proto files – including all the possible field types – in the Protocol Buffer Language Guide. Don't go looking for facilities similar to class inheritance, though – protocol buffers don't do that.

Compiling your protocol buffers

Now that you have a .proto, the next thing you need to do is generate the classes you'll need to read and write AddressBook (and hence Person and PhoneNumber) messages. To do this, you need to run the protocol buffer compiler protoc on your .proto:

  1. If you haven't installed the compiler, download the package and follow the instructions in the README.
  2. Now run the compiler, specifying the source directory (where your application's source code lives – the current directory is used if you don't provide a value), the destination directory (where you want the generated code to go; often the same as $SRC_DIR), and the path to your .proto. In this case, you would invoke:

protoc -I=$SRC_DIR --csharp_out=$DST_DIR $SRC_DIR/addressbook.proto

Because you want C# code, you use the --csharp_out option – similar options are provided for other supported languages.

This generates Addressbook.cs in your specified destination directory. To compile this code, you'll need a project with a reference to the Google.Protobuf assembly.

The addressbook classes

Generating Addressbook.cs gives you five useful types:

  • A static Addressbook class that contains metadata about the protocol buffer messages.
  • An AddressBook class with a read-only People property.
  • Person class with properties for NameIdEmail and Phones.
  • PhoneNumber class, nested in a static Person.Types class.
  • PhoneType enum, also nested in Person.Types.

You can read more about the details of exactly what's generated in the C# Generated Code guide, but for the most part you can treat these as perfectly ordinary C# types. One point to highlight is that any properties corresponding to repeated fields are read-only. You can add items to the collection or remove items from it, but you can't replace it with an entirely separate collection. The collection type for repeated fields is always RepeatedField<T>. This type is like List<T> but with a few extra convenience methods, such as an Add overload accepting a collection of items, for use in colleciton initializers.

Here's an example of how you might create an instance of Person:

Person john = new Person
{
   
Id = 1234,
   
Name = "John Doe",
   
Email = "jdoe@example.com",
   
Phones = { new Person.Types.PhoneNumber { Number = "555-4321", Type = Person.Types.PhoneType.Home } }
};

 

Note that with C# 6, you can use using static to remove the Person.Types ugliness:

// Add this to the other using directives
using static Google.Protobuf.Examples.AddressBook.Person.Types;
...

// The earlier Phones assignment can now be simplified to:
Phones = { new PhoneNumber { Number = "555-4321", Type = PhoneType.HOME } }
 

Parsing and serialization

The whole purpose of using protocol buffers is to serialize your data so that it can be parsed elsewhere. Every generated class has a WriteTo(CodedOutputStream) method, where CodedOutputStream is a class in the protocol buffer runtime library. However, usually you'll use one of the extension methods to write to a regular System.IO.Stream or convert the message to a byte array or ByteString. These extension messages are in the Google.Protobuf.MessageExtensions class, so when you want to serialize you'll usually want a using directive for the Google.Protobuf namespace. For example:

using Google.Protobuf;
...

Person john = ...; // Code as before
using (var output = File.Create("john.dat"))
{
    john.
WriteTo(output);
}

 

Parsing is also simple. Each generated class has a static Parser property which returns a MessageParser<T> for that type. That in turn has methods to parse streams, byte arrays and ByteStrings. So to parse the file we've just created, we can use:

Person john;
using (
var input = File.OpenRead("john.dat"))
{
    john =
Person.Parser.ParseFrom(input);
}

 

A full example program to maintain an addressbook (adding new entries and listing existing ones) using these messages is available in the Github repository.

Extending a Protocol Buffer

Sooner or later after you release the code that uses your protocol buffer, you will undoubtedly want to "improve" the protocol buffer's definition. If you want your new buffers to be backwards-compatible, and your old buffers to be forward-compatible – and you almost certainly do want this – then there are some rules you need to follow. In the new version of the protocol buffer:

  • you must not change the tag numbers of any existing fields.
  • you may delete fields.
  • you may add new fields but you must use fresh tag numbers (i.e. tag numbers that were never used in this protocol buffer, not even by deleted fields).

(There are some exceptions to these rules, but they are rarely used.)

If you follow these rules, old code will happily read new messages and simply ignore any new fields. To the old code, singular fields that were deleted will simply have their default value, and deleted repeated fields will be empty. New code will also transparently read old messages.

However, keep in mind that new fields will not be present in old messages, so you will need to do something reasonable with the default value. A type-specific default value is used: for strings, the default value is the empty string. For booleans, the default value is false. For numeric types, the default value is zero.

Reflection

Message descriptors (the information in the .proto file) and instances of messages can be examined programmatically using the reflection API. This can be useful when writing generic code such as a different text format or a smart diff tool. Each generated class has a static Descriptor property, and the descriptor for any instance can be retrieved using the IMessage.Descriptor property. As a quick example of how these can be used, here is a short method to print the top-level fields of any message.

public void PrintMessage(IMessage message)
{
   
var descriptor = message.Descriptor;
   
foreach (var field in descriptor.Fields.InDeclarationOrder())
    {
       
Console.WriteLine(
           
"Field {0} ({1}): {2}",
            field.
FieldNumber,
            field.
Name,
            field.
Accessor.GetValue(message);
    }
}

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Protobuf是Google开发的一种二进制数据交换格式,其主要用于序列化结构化数据,使其能够在网络中进行传输。Protobuf定义了一种结构化数据格式,通过这种格式可以定义数据的类型和字段,然后通过编译器生成对应的代码,用于在不同的编程语言中进行数据的序列化和反序列化操作。 WebSocket是一种在单个TCP连接上进行全双工通信的协议,它提供了一种实时的、持久的、双向的通信通道,可以在客户端和服务器之间进行双向通信。与HTTP不同,WebSocket不需要通过不断的请求和响应来实现数据的交换,而是通过使用WebSocket协议,可以直接在连接建立后进行数据的传输和接收。 在前端与服务端进行通信时,可以使用Protobuf与WebSocket结合使用。通过在服务端和前端分别定义相同的数据结构,并使用Protobuf进行序列化和反序列化操作,可以将数据以二进制形式传输,并节省带宽和提高传输效率。在WebSocket连接建立后,可以直接通过发送和接收二进制数据来进行双向通信,而不需要通过HTTP请求和响应的方式。 通过使用Protobuf与WebSocket结合,可以在前端与服务端之间建立实时的、持久的、双向的通信通道,实现高效的数据交换和迅速的响应。同时,使用Protobuf进行数据序列化和反序列化,可以保证数据的一致性和准确性,提高数据交换的可靠性。因此,Protobuf与WebSocket结合可以在Web开发中发挥重要的作用,提升前端与服务端之间的通信效率和性能。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值