Protobuf源码分析

高铭杰

已于 2024-11-21 17:31:18 修改

阅读量1k

点赞数 21

分类专栏： pgsql general 文章标签： protobuf protobuf-c varint unpack pack

于 2024-11-21 17:29:55 首次发布

本文链接：https://blog.csdn.net/jackgo73/article/details/143950391

版权

pgsql 同时被 2 个专栏收录

294 篇文章

订阅专栏

general

12 篇文章

订阅专栏

相关
《Varints变长整数编码分析》

分析protobuf-c的关键代码。

速查

消息格式：TAG+PAYLOAD
TAG
- 包含field id，就是proto里面的name=1，id1=2。
- 紧跟wire type，表示类型，例如name就是2表示string类型，id1就是0-表示varint类型。
PAYLOAD
- 根据wire type类型不同，对payload会有不同的处理方式
  - 例如varint就直接&0x80一个接一个的读。
  - 例如string就先读一个长度varint出来，在根据长度读数据。

例如proto文件定义为：

message Person {
  string name = 1;
  int32 id1 = 2;
  int32 id2 = 3;
  string email = 4;
}

序列化后：

FIELD和WIRE TYPE叫做TAG。0 1 2位是wire type，3 4 5 6位是field num，7位是msb，和varint算法一样，表示是否还存在高位字节，7位是1时，说明field号比较大，4个bit不够存了。

在这里插入图片描述数据格式：https://protobuf.dev/programming-guides/encoding

Message Structure
A protocol buffer message is a series of key-value pairs. The binary version of a message just uses the field’s number as the key – the name and declared type for each field can only be determined on the decoding end by referencing the message type’s definition (i.e. the .proto file). Protoscope does not have access to this information, so it can only provide the field numbers.

When a message is encoded, each key-value pair is turned into a record consisting of the field number, a wire type and a payload. The wire type tells the parser how big the payload after it is. This allows old parsers to skip over new fields they don’t understand. This type of scheme is sometimes called Tag-Length-Value, or TLV.

There are six wire types: VARINT, I64, LEN, SGROUP, EGROUP, and I32

ID	Name	Used For
0	VARINT	int32, int64, uint32, uint64, sint32, sint64, bool, enum
1	I64	    fixed64, sfixed64, double
2	LEN	    string, bytes, embedded messages, packed repeated fields
3	SGROUP	group start (deprecated)
4	EGROUP	group end (deprecated)
5	I32	    fixed32, sfixed32, float
The “tag” of a record is encoded as a varint formed from the field number and the wire type via the formula (field_number << 3) | wire_type. In other words, after decoding the varint representing a field, the low 3 bits tell us the wire type, and the rest of the integer tells us the field number.

Now let’s look at our simple example again. You now know that the first number in the stream is always a varint key, and here it’s `08`, or (dropping the MSB):

000 1000
You take the last three bits to get the wire type (0) and then right-shift by three to get the field number (1). Protoscope represents a tag as an integer followed by a colon and the wire type, so we can write the above bytes as 1:VARINT.

Because the wire type is 0, or VARINT, we know that we need to decode a varint to get the payload. As we saw above, the bytes `9601` varint-decode to 150, giving us our record. We can write it in Protoscope as 1:VARINT 150.

Protoscope can infer the type for a tag if there is whitespace after the :. It does so by looking ahead at the next token and guessing what you meant (the rules are documented in detail in Protoscope’s language.txt). For example, in 1: 150, there is a varint immediately after the untyped tag, so Protoscope infers its type to be VARINT. If you wrote 2: {}, it would see the { and guess LEN; if you wrote 3: 5i32 it would guess I32, and so on.

反序列化

proto定义

message Person {
  string name = 1;
  int32 id1 = 2;
  int32 id2 = 3;
  string email = 4;
}

protoc-c --c_out=. test-proto3.proto生成的结构体test-proto3.pb-c.h

struct  Foo__Person
{
  ProtobufCMessage base;
  char *name;
  int32_t id1;
  int32_t id2;
  char *email;
  size_t n_phone;
  Foo__Person__PhoneNumber **phone;
};

调用foo__person__pack序列化后的数据：
在这里插入图片描述

反序列化前person的值

(gdb) p person
$2 = 
{
  base = {descriptor = 0x485da0 <foo.person.descriptor>, n_unknown_fields = 0, unknown_fields = 0x0}, 
  name = 0x6b5980 'a' <repeats 64 times>, 
  id1 = 0, 
  id2 = 2100000000,
  email = 0x6b59d0 'b' <repeats 200 times>..., 
  n_phone = 0, 
  phone = 0x0
}

反序列化整体计算流程总结：

读取头部：拿到字段编号tag、字段类型wire_type、头部使用了几个字节used，三个信息。
确定结构字段index：根据tag计算目标字段index。
获取payload长度：根据类型计算payload长度，不用类型算法不同，例如varint只需遍历每个字节的最高位，就知道用了几个字节。

protobuf_c_message_unpack
  
   while (rem > 0)
     used = parse_tag_and_wiretype(rem, at, &tag, &wire_type);
     
     int field_index = int_range_lookup(desc->n_field_ranges, desc->field_ranges, tag);
     
     at += used;
     rem -= used;
     ...
     switch
       ...
       case PROTOBUF_C_WIRE_TYPE_VARINT:
         unsigned max_len = rem < 10 ? rem : 10;
         unsigned i;
         for (i = 0; i < max_len; i++)
           if ((at[i] & 0x80) == 0)
             break;
         tmp.len = i + 1;
         break;
         ...
       case PROTOBUF_C_WIRE_TYPE_LENGTH_PREFIXED:
         ...
         tmp.len = scan_length_prefixed_data(rem, at, &pref_len);
         break;
       ...
     at += tmp.len;
     rem -= tmp.len;

第一轮循环解析字段name，string类型，tag=1表示第一个字段，wire_type=2表示string, bytes, embedded messages, packed repeated fields，tmp.len=65说明payload使用了65字节，其中payload的hdr占用了1字节，这1个字节使用varint记录了字符串长度。
第二个轮循环解析字段id2，int类型，tag=3表示第三个字段，wire_type=0表示varint，然后开始计算这个varint使用了几个字节，循环遍历每个字节的最高位，有4个字节最高位为1，使用了tmp.len=5个字节。
第三轮缓存解析字段email，payload中前pref_len=2个字节使用varint记录了长度，payload总共1026字节。

used	tag	wire_type	tmp.len	pref_len	rem
1	1(name)	2	65	1	1099-1098-1033
1	3(id2)	0	5		1033-1032-1027
1	4(email)	2	1026	2	1027-1026-0

实例

test-proto3.proto

执行protoc-c --c_out=. test-proto3.proto生成.c和.h文件。

syntax = "proto3";

package foo;

message Person {
  string name = 1;
  int32 id1 = 2;
  int32 id2 = 3;
  string email = 4;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    message Comment {
      string comment = 1;
    }

    string number = 1;
    PhoneType type = 2;
    Comment comment = 3;
  }

  repeated PhoneNumber phone = 5;
}

message LookupResult
{
  Person person = 1;
}

message Name {
  string name = 1;
};

service DirLookup {
  rpc ByName (Name) returns (LookupResult);
}

main.c

#include "test-proto3.pb-c.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>


int main(int argc, char **argv)
{
	int pack_len;
	int unpack_len;
	char *pack_buf = NULL;
	Foo__Person person = FOO__PERSON__INIT;
	Foo__Person *person_unpack;

	printf("compile date: %s %s\n", __DATE__, __TIME__);

	person.name = malloc(64);
	memset(person.name, 'a', 64);
	person.id2 = 2100000000;
	person.email = malloc(1024);
	memset(person.email, 'b', 1024);


	pack_len = foo__person__get_packed_size(&person);
	pack_buf = (char *) malloc(pack_len);

	unpack_len = foo__person__pack(&person, pack_buf);
	printf("get size=%d %d\n", pack_len, unpack_len);

	printf("%s\n", pack_buf);

	person_unpack = foo__person__unpack(NULL, pack_len, pack_buf);

	printf("unpack name:%s\n", person_unpack->name);


	return 0;
}

makefile编译调试编静态包进来

# Makefile for building the protobuf-c example

# Compiler and Linker
CC := gcc

# The Target Binary Program
TARGET := test_proto

# Directories
LIBDIR := ./lib
INCDIR := ./include

# The Directories, Source, Includes, Objects, Binary and Resources
SRCDIR := .
SRCS := $(SRCDIR)/main.c $(SRCDIR)/test-proto3.pb-c.c

# Flags, Libraries and Includes
CFLAGS := -I$(INCDIR) -g -O0 -ggdb
# LDFLAGS := -L$(LIBDIR) -lprotobuf-c
LDFLAGS := -L$(LIBDIR) -lprotobuf-c -static

# Default make
all: $(TARGET)

# Link the target with all objects files
$(TARGET): $(SRCS)
	$(CC) $(SRCS) -o $@ $(CFLAGS) $(LDFLAGS)

# Clean the build directory
clean:
	rm -f $(TARGET)

# Non-file targets
.PHONY: all clean

protobuf_c_message_unpack完整代码

ProtobufCMessage *
protobuf_c_message_unpack(const ProtobufCMessageDescriptor *desc,
			  ProtobufCAllocator *allocator,
			  size_t len, const uint8_t *data)
{
	ProtobufCMessage *rv;
	size_t rem = len;
	const uint8_t *at = data;
	const ProtobufCFieldDescriptor *last_field = desc->fields + 0;
	ScannedMember first_member_slab[1UL <<
					FIRST_SCANNED_MEMBER_SLAB_SIZE_LOG2];

	/*
	 * scanned_member_slabs[i] is an array of arrays of ScannedMember.
	 * The first slab (scanned_member_slabs[0] is just a pointer to
	 * first_member_slab), above. All subsequent slabs will be allocated
	 * using the allocator.
	 */
	ScannedMember *scanned_member_slabs[MAX_SCANNED_MEMBER_SLAB + 1];
	unsigned which_slab = 0; /* the slab we are currently populating */
	unsigned in_slab_index = 0; /* number of members in the slab */
	size_t n_unknown = 0;
	unsigned f;
	unsigned j;
	unsigned i_slab;
	unsigned last_field_index = 0;
	unsigned required_fields_bitmap_len;
	unsigned char required_fields_bitmap_stack[16];
	unsigned char *required_fields_bitmap = required_fields_bitmap_stack;
	protobuf_c_boolean required_fields_bitmap_alloced = FALSE;

	ASSERT_IS_MESSAGE_DESCRIPTOR(desc);

	if (allocator == NULL)
		allocator = &protobuf_c__allocator;

	rv = do_alloc(allocator, desc->sizeof_message);
	if (!rv)
		return (NULL);
	scanned_member_slabs[0] = first_member_slab;

	required_fields_bitmap_len = (desc->n_fields + 7) / 8;
	if (required_fields_bitmap_len > sizeof(required_fields_bitmap_stack)) {
		required_fields_bitmap = do_alloc(allocator, required_fields_bitmap_len);
		if (!required_fields_bitmap) {
			do_free(allocator, rv);
			return (NULL);
		}
		required_fields_bitmap_alloced = TRUE;
	}
	memset(required_fields_bitmap, 0, required_fields_bitmap_len);

	/*
	 * Generated code always defines "message_init". However, we provide a
	 * fallback for (1) users of old protobuf-c generated-code that do not
	 * provide the function, and (2) descriptors constructed from some other
	 * source (most likely, direct construction from the .proto file).
	 */
	if (desc->message_init != NULL)
		protobuf_c_message_init(desc, rv);
	else
		message_init_generic(desc, rv);

	while (rem > 0) {
		uint32_t tag;
		uint8_t wire_type;
		size_t used = parse_tag_and_wiretype(rem, at, &tag, &wire_type);
		const ProtobufCFieldDescriptor *field;
		ScannedMember tmp;

		if (used == 0) {
			PROTOBUF_C_UNPACK_ERROR("error parsing tag/wiretype at offset %u",
						(unsigned) (at - data));
			goto error_cleanup_during_scan;
		}
		/*
		 * \todo Consider optimizing for field[1].id == tag, if field[1]
		 * exists!
		 */
		if (last_field == NULL || last_field->id != tag) {
			/* lookup field */
			int field_index =
			    int_range_lookup(desc->n_field_ranges,
					     desc->field_ranges,
					     tag);
			if (field_index < 0) {
				field = NULL;
				n_unknown++;
			} else {
				field = desc->fields + field_index;
				last_field = field;
				last_field_index = field_index;
			}
		} else {
			field = last_field;
		}

		if (field != NULL && field->label == PROTOBUF_C_LABEL_REQUIRED)
			REQUIRED_FIELD_BITMAP_SET(last_field_index);

		at += used;
		rem -= used;
		tmp.tag = tag;
		tmp.wire_type = wire_type;
		tmp.field = field;
		tmp.data = at;
		tmp.length_prefix_len = 0;

		switch (wire_type) {
		case PROTOBUF_C_WIRE_TYPE_VARINT: {
			unsigned max_len = rem < 10 ? rem : 10;
			unsigned i;

			for (i = 0; i < max_len; i++)
				if ((at[i] & 0x80) == 0)
					break;
			if (i == max_len) {
				PROTOBUF_C_UNPACK_ERROR("unterminated varint at offset %u",
							(unsigned) (at - data));
				goto error_cleanup_during_scan;
			}
			tmp.len = i + 1;
			break;
		}
		case PROTOBUF_C_WIRE_TYPE_64BIT:
			if (rem < 8) {
				PROTOBUF_C_UNPACK_ERROR("too short after 64bit wiretype at offset %u",
							(unsigned) (at - data));
				goto error_cleanup_during_scan;
			}
			tmp.len = 8;
			break;
		case PROTOBUF_C_WIRE_TYPE_LENGTH_PREFIXED: {
			size_t pref_len;

			tmp.len = scan_length_prefixed_data(rem, at, &pref_len);
			if (tmp.len == 0) {
				/* NOTE: scan_length_prefixed_data calls UNPACK_ERROR */
				goto error_cleanup_during_scan;
			}
			tmp.length_prefix_len = pref_len;
			break;
		}
		case PROTOBUF_C_WIRE_TYPE_32BIT:
			if (rem < 4) {
				PROTOBUF_C_UNPACK_ERROR("too short after 32bit wiretype at offset %u",
					      (unsigned) (at - data));
				goto error_cleanup_during_scan;
			}
			tmp.len = 4;
			break;
		default:
			PROTOBUF_C_UNPACK_ERROR("unsupported tag %u at offset %u",
						wire_type, (unsigned) (at - data));
			goto error_cleanup_during_scan;
		}

		if (in_slab_index == (1UL <<
			(which_slab + FIRST_SCANNED_MEMBER_SLAB_SIZE_LOG2)))
		{
			size_t size;

			in_slab_index = 0;
			if (which_slab == MAX_SCANNED_MEMBER_SLAB) {
				PROTOBUF_C_UNPACK_ERROR("too many fields");
				goto error_cleanup_during_scan;
			}
			which_slab++;
			size = sizeof(ScannedMember)
				<< (which_slab + FIRST_SCANNED_MEMBER_SLAB_SIZE_LOG2);
			scanned_member_slabs[which_slab] = do_alloc(allocator, size);
			if (scanned_member_slabs[which_slab] == NULL)
				goto error_cleanup_during_scan;
		}
		scanned_member_slabs[which_slab][in_slab_index++] = tmp;

		if (field != NULL && field->label == PROTOBUF_C_LABEL_REPEATED) {
			size_t *n = STRUCT_MEMBER_PTR(size_t, rv,
						      field->quantifier_offset);
			if (wire_type == PROTOBUF_C_WIRE_TYPE_LENGTH_PREFIXED &&
			    (0 != (field->flags & PROTOBUF_C_FIELD_FLAG_PACKED) ||
			     is_packable_type(field->type)))
			{
				size_t count;
				if (!count_packed_elements(field->type,
							   tmp.len -
							   tmp.length_prefix_len,
							   tmp.data +
							   tmp.length_prefix_len,
							   &count))
				{
					PROTOBUF_C_UNPACK_ERROR("counting packed elements");
					goto error_cleanup_during_scan;
				}
				*n += count;
			} else {
				*n += 1;
			}
		}

		at += tmp.len;
		rem -= tmp.len;
	}

	/* allocate space for repeated fields, also check that all required fields have been set */
	for (f = 0; f < desc->n_fields; f++) {
		const ProtobufCFieldDescriptor *field = desc->fields + f;
		if (field == NULL) {
			continue;
		}
		if (field->label == PROTOBUF_C_LABEL_REPEATED) {
			size_t siz =
			    sizeof_elt_in_repeated_array(field->type);
			size_t *n_ptr =
			    STRUCT_MEMBER_PTR(size_t, rv,
					      field->quantifier_offset);
			if (*n_ptr != 0) {
				unsigned n = *n_ptr;
				void *a;
				*n_ptr = 0;
				assert(rv->descriptor != NULL);
#define CLEAR_REMAINING_N_PTRS()                                              \
              for(f++;f < desc->n_fields; f++)                                \
                {                                                             \
                  field = desc->fields + f;                                   \
                  if (field->label == PROTOBUF_C_LABEL_REPEATED)              \
                    STRUCT_MEMBER (size_t, rv, field->quantifier_offset) = 0; \
                }
				a = do_alloc(allocator, siz * n);
				if (!a) {
					CLEAR_REMAINING_N_PTRS();
					goto error_cleanup;
				}
				STRUCT_MEMBER(void *, rv, field->offset) = a;
			}
		} else if (field->label == PROTOBUF_C_LABEL_REQUIRED) {
			if (field->default_value == NULL &&
			    !REQUIRED_FIELD_BITMAP_IS_SET(f))
			{
				CLEAR_REMAINING_N_PTRS();
				PROTOBUF_C_UNPACK_ERROR("message '%s': missing required field '%s'",
							desc->name, field->name);
				goto error_cleanup;
			}
		}
	}
#undef CLEAR_REMAINING_N_PTRS

	/* allocate space for unknown fields */
	if (n_unknown) {
		rv->unknown_fields = do_alloc(allocator,
					      n_unknown * sizeof(ProtobufCMessageUnknownField));
		if (rv->unknown_fields == NULL)
			goto error_cleanup;
	}

	/* do real parsing */
	for (i_slab = 0; i_slab <= which_slab; i_slab++) {
		unsigned max = (i_slab == which_slab) ?
			in_slab_index : (1UL << (i_slab + 4));
		ScannedMember *slab = scanned_member_slabs[i_slab];

		for (j = 0; j < max; j++) {
			if (!parse_member(slab + j, rv, allocator)) {
				PROTOBUF_C_UNPACK_ERROR("error parsing member %s of %s",
							slab->field ? slab->field->name : "*unknown-field*",
					desc->name);
				goto error_cleanup;
			}
		}
	}

	/* cleanup */
	for (j = 1; j <= which_slab; j++)
		do_free(allocator, scanned_member_slabs[j]);
	if (required_fields_bitmap_alloced)
		do_free(allocator, required_fields_bitmap);
	return rv;

error_cleanup:
	protobuf_c_message_free_unpacked(rv, allocator);
	for (j = 1; j <= which_slab; j++)
		do_free(allocator, scanned_member_slabs[j]);
	if (required_fields_bitmap_alloced)
		do_free(allocator, required_fields_bitmap);
	return NULL;

error_cleanup_during_scan:
	do_free(allocator, rv);
	for (j = 1; j <= which_slab; j++)
		do_free(allocator, scanned_member_slabs[j]);
	if (required_fields_bitmap_alloced)
		do_free(allocator, required_fields_bitmap);
	return NULL;
}