Protobuf-源码中值得学习的地方

Posted on 2018-05-27

使用宏来提高代码可读性（代码的美感）

例1.宏CHARACTER_CLASS

定义

tokenizer.cc文件中，需要判断某个character是属于哪种类型的字符，通过宏CHARACTER_CLASS来定义字符类型，并且定义static类型的InClass()接口来判断。

#define CHARACTER_CLASS(NAME, EXPRESSION)      \
  class NAME {                                 \
   public:                                     \
    static inline bool InClass(char c) {       \
      return EXPRESSION;                       \
    }                                          \
  }

CHARACTER_CLASS(Whitespace, c == ' ' || c == '\n' || c == '\t' ||
                            c == '\r' || c == '\v' || c == '\f');

CHARACTER_CLASS(Unprintable, c < ' ' && c > '\0');

CHARACTER_CLASS(Digit, '0' <= c && c <= '9');
CHARACTER_CLASS(OctalDigit, '0' <= c && c <= '7');
CHARACTER_CLASS(HexDigit, ('0' <= c && c <= '9') ||
                          ('a' <= c && c <= 'f') ||
                          ('A' <= c && c <= 'F'));

CHARACTER_CLASS(Letter, ('a' <= c && c <= 'z') ||
                        ('A' <= c && c <= 'Z') ||
                        (c == '_'));

CHARACTER_CLASS(Alphanumeric, ('a' <= c && c <= 'z') ||
                              ('A' <= c && c <= 'Z') ||
                              ('0' <= c && c <= '9') ||
                              (c == '_'));

CHARACTER_CLASS(Escape, c == 'a' || c == 'b' || c == 'f' || c == 'n' ||
                        c == 'r' || c == 't' || c == 'v' || c == '\\' ||
                        c == '?' || c == '\'' || c == '\"');

#undef CHARACTER_CLASS

使用

使用时可以直接使用InClass():

template<typename CharacterClass>
inline bool Tokenizer::TryConsumeOne() {
  if (CharacterClass::InClass(current_char_)) {
    NextChar();
    return true;
  } else {
    return false;
  }
  }

例2.宏 PROTOBUF_DEFINE_ACCESSOR

FieldDescriptor类，因为可能类型是多样的，在实现对外暴露default数据的函数时，为了提高代码可读性，使用了如下宏的方式(文件descriptor.cc中)：

定义

1
2
3

// These macros makes this repetitive code more readable.
#define PROTOBUF_DEFINE_ACCESSOR(CLASS, FIELD, TYPE) \
  inline TYPE CLASS::FIELD() const { return FIELD##_; }

使用

1 2	PROTOBUF_DEFINE_ACCESSOR(FieldDescriptor, default_value_int32 , int32 ) PROTOBUF_DEFINE_ACCESSOR(FieldDescriptor, has_default_value, bool)

例3.宏BUILD_ARRAY

定义

BUILD_ARRAY宏定义如下，这里的INPUT是proto；OUTPUT是proto对应的descriptor；NAME是需要完成创建的成员；METHOD是创建descriptor成员时需要调用的函数；PARENT是发生嵌套时的上一级。

   // A common pattern:  We want to convert a repeated field in the descriptor
// to an array of values, calling some method to build each value.
#define BUILD_ARRAY(INPUT, OUTPUT, NAME, METHOD, PARENT)             \
  OUTPUT->NAME##_count_ = INPUT.NAME##_size();                       \
  AllocateArray(INPUT.NAME##_size(), &OUTPUT->NAME##s_);             \
  for (int i = 0; i < INPUT.NAME##_size(); i++) {                    \
    METHOD(INPUT.NAME(i), PARENT, OUTPUT->NAME##s_ + i);             \
    }

使用

DescriptorBuilder::BuildFile()中，利用FileDescriptorProto& proto来构建对应的descriptor:

// Convert children.
BUILD_ARRAY(proto, result, message_type, BuildMessage  , NULL);
BUILD_ARRAY(proto, result, enum_type   , BuildEnum     , NULL);
BUILD_ARRAY(proto, result, service     , BuildService  , NULL);
BUILD_ARRAY(proto, result, extension   , BuildExtension, NULL);

说明

各个Descriptor类中，使用count + 连续内存来保存成员，例如：

class LIBPROTOBUF_EXPORT FileDescriptor {

	//省略其它代码
	
	int message_type_count_;
	Descriptor* message_types_;
   
   //省略其它代码

  }

资源分配/处理的lazy机制

例1.类DescriptorPool数据分层设计

DescriptorPool的数据管理分为了多层（忽略了仅在protobuf内部使用 && 不推荐使用的underlay一层）：

最顶层：DescriptorPool::Tables tables_，保存name->descriptor;
最底层：DescriptorDatabase* fallback_database_,保存name->file_descriptor_proto(而不是直接的file_descriptor)

查找时，如果第一层tables_没找到，最终会到fallback_database_中找对应proto，并且调用临时构造的DescriptorBuilder::Build*()系列接口把生成的descriptor添加到tables_中，然后再从tables_中找。

这样数据分层设计的目的是：

用于定制地(on-demand)从某种”大”的database加载产生DescriptorPool。因为database太大，逐个调用DescriptorPool::BuildFile() 来处理原database中的每一个proto文件是低效的。
为了提升效率，使用DescriptorPool来封装DescriptorDatabase，并且只建立正真需要的descriptor。
针对编译依赖的每个proto文件，并不是在进程启动时，直接构建出proto中所包含的所有descriptor，而是hang on，直到某个descriptor真的被需要：
(1)用户调用例如descriptor(), GetDescriptor(), GetReflection()的方法，需要返回descriptor；
(2)用户从DescriptorPool::generated_pool()中查找descriptor；

可以看到descriptor的构建是hang-on的，只有需要使用某个descriptor时，才构建。适合依赖了很多的proto文件，但仅仅使用其中的少数proto的场景。

例2.类GeneratedMessageFactory映射关系加载

GeneratedMessageFactory类，管理的从Descriptor* -> Message*映射关系，并不是一开始就注册好的。仅仅在需要从descriptr查找message时(调用GeneratedMessageFactory::GetPrototype())，才会：

通过file_name找到注册函数；
调用注册函数，完成Descriptor* -> Message*映射关系的注册；
从hash_map<const Descriptor*, const Message*> type_map_查找到对应Message*返回；

资源管理/内存复用

类RepeatedPtrFieldBase

RepeatedPtrFields的父类（不是模板类，提供了多个模板函数），本身保存/管理的数据类型为void*（message对象的实际地址，也是通过连续内存array来保存）。

因为array中保存的是同一个descriptor对应的message，只是各个message中所包含的数据不一样，为了节省下message对象分配/释放的成本，所以message可以被clear(clear操作会将primitive类型的field设置为0，其余类型field调用自身的clear()接口处理，类似std::string::clear(),只清理数据并不回收内存)。
然后保留原有的内存地址在array中。下次需要从array中分配message时，优先使用这一批被clear的message（实现在RepeatedPtrFieldBase::AddFromCleared() ，参考GeneratedMessageReflection::AddMessage()中的调用方式）。

为了管理cleared状态的message指针，引入了多个游标来标记数据：

current_size_: 当前待处理的message地址；
allocated_size_:已经分配message的数据，current_size_ <= allocated_size_，从current_size_到allocated_size_之间的message就是被cleared的；
total_size_:elements_[]的长度，但从allocated_size_到total_size_之间的void*是无效的，并没有指向任何message；

对应内存分布如下图所示：

avatar

封装多种类型，统一对外的服务

针对数据/行为简单的类型，使用轻量级的方案（struct/enum/union/switch-case），来实现类型的封装，而不是采用继承方式来实现。

Symbol可能有多种类型，enum Type表示具体类型，union让多种类型都复用同一个内存地址：

struct Symbol {
  enum Type {
    NULL_SYMBOL, MESSAGE, FIELD, ENUM, ENUM_VALUE, SERVICE, METHOD, PACKAGE
  };
  Type type;
  union {
    const Descriptor* descriptor;
    const FieldDescriptor* field_descriptor;
    const EnumDescriptor* enum_descriptor;
    const EnumValueDescriptor* enum_value_descriptor;
    const ServiceDescriptor* service_descriptor;
    const MethodDescriptor* method_descriptor;
    const FileDescriptor* package_file_descriptor;
  };

inline Symbol() : type(NULL_SYMBOL) { descriptor = NULL; }
…… //省略部分

宏CONSTRUCTOR帮助提高代码可读性，来实现不同类型Symbol的构造函数：

#define CONSTRUCTOR(TYPE, TYPE_CONSTANT, FIELD)  \
  inline explicit Symbol(const TYPE* value) {    \
    type = TYPE_CONSTANT;                        \
    this->FIELD = value;                         \
  }

宏CONSTRUCTOR的使用：

  CONSTRUCTOR(Descriptor         , MESSAGE   , descriptor             )
  CONSTRUCTOR(FieldDescriptor    , FIELD     , field_descriptor       )
  CONSTRUCTOR(EnumDescriptor     , ENUM      , enum_descriptor        )
  CONSTRUCTOR(EnumValueDescriptor, ENUM_VALUE, enum_value_descriptor  )
  CONSTRUCTOR(ServiceDescriptor  , SERVICE   , service_descriptor     )
  CONSTRUCTOR(MethodDescriptor   , METHOD    , method_descriptor      )
  CONSTRUCTOR(FileDescriptor     , PACKAGE   , package_file_descriptor)
#undef CONSTRUCTOR

具体应用时，根据type来区分处理：

  const FileDescriptor* GetFile() const {
    switch (type) {
      case NULL_SYMBOL: return NULL;
      case MESSAGE    : return descriptor           ->file();
      case FIELD      : return field_descriptor     ->file();
      case ENUM       : return enum_descriptor      ->file();
      case ENUM_VALUE : return enum_value_descriptor->type()->file();
      case SERVICE    : return service_descriptor   ->file();
      case METHOD     : return method_descriptor    ->service()->file();
      case PACKAGE    : return package_file_descriptor;
    }
    return NULL;
  }
};

不同的类作为模版参数时，提供类独有的类型

类GenericTypeHandler和类StringTypeHandler 需要作为模版类型参数（typehandler），在子类RepeatedPtrField在调用父类RepeatedPtrFieldBase的模板函数时，通过模板参数直接传入父类RepeatedPtrFieldBase，这里需要根据不同的typehandler，返回对应不同的类型：

template <typename TypeHandler>
inline const typename TypeHandler::Type&
RepeatedPtrFieldBase::Get(int index) const {
  GOOGLE_DCHECK_LT(index, size());
  return *cast<TypeHandler>(elements_[index]);
  }

所以有如下方式，在不同模版参数类型中通过typedef方式来实现类型名称的统一，因为对于模版来说，关键点就是有统一的名称。

GenericTypeHandler

template <typename GenericType>
class GenericTypeHandler {
 public:
  typedef GenericType Type;
  
  static GenericType* New() { return new GenericType; }
  static void Delete(GenericType* value) { delete value; }
  static void Clear(GenericType* value) { value->Clear(); }
  static void Merge(const GenericType& from, GenericType* to) {
    to->MergeFrom(from);
  }
  static int SpaceUsed(const GenericType& value) { return value.SpaceUsed(); }
};

StringTypeHandler

// HACK:  If a class is declared as DLL-exported in MSVC, it insists on
//   generating copies of all its methods -- even inline ones -- to include
//   in the DLL.  But SpaceUsed() calls StringSpaceUsedExcludingSelf() which
//   isn't in the lite library, therefore the lite library cannot link if
//   StringTypeHandler is exported.  So, we factor out StringTypeHandlerBase,
//   export that, then make StringTypeHandler be a subclass which is NOT
//   exported.
// TODO(kenton):  There has to be a better way.

class LIBPROTOBUF_EXPORT StringTypeHandlerBase {
 public:
  typedef string Type;
  static string* New();
  static void Delete(string* value);
  static void Clear(string* value) { value->clear(); }
  static void Merge(const string& from, string* to) { *to = from; }
};

class LIBPROTOBUF_EXPORT StringTypeHandler : public StringTypeHandlerBase {
 public:
  static int SpaceUsed(const string& value)  {
    return sizeof(value) + StringSpaceUsedExcludingSelf(value);
  }
};

对应类的关系图：

avatar

低配版release来节省资源

在proto文件中增加配置，产出不支持reflection/descriptor的MessageLite子类，而不是Message子类。

1	option optimize_for = LITE_RUNTIME

Protobuf-Plugin机制

Posted on 2018-05-27

plugin机制

protobuf是一个支持plugin机制的序列化框架，除了protobuf自带的几种语言的CodeGenerator，用户可以按需实现自己的插件，来实现语言的拓展（比如protoc-gen-lua）或者功能的拓展(厂子内部的mcpack2pb插件)。

因为protoc插件需要实现跨平台、跨语言，所以采用的方式是父子进程的工作方式，父子进程通过pipe（父子进程共享fd）方式通信，父子进程通信数据的格式定义在compiler/plugin.proto。

父进程（protoc进程）中负责读取proto文件，转化为CodeGeneratorRequest格式，启动子进程，以及后续持久化子进程返回内容；
子进程（自定义插件进程）完成中子进程中启动自定义的CodeGeneratorResponse格式，按照自己需要完成处理，返回给父进程（protoc进程）；

avatar

父子进程共享fd工作机制说明：

利用父子进程共享fd机制，建立pipe（单工模式）。

子进程一侧，做重定向，将stdin_pipe[0]重定向到STDIN_FILENO，stdout_pipe[1]重定向到STDOUT_FILENO，这样plugin处理过程中，无需记录下输入/输出fd；
父进程一侧，从stdin_pipe[1]写入数据，然后从stdout_pipe[0]读数据；

父子进程通信接口定义说明：

定义格式在compiler/plugin.proto中，也是采用proto方式来完成自定义的。

CodeGeneratorRequest按照文件粒度FileDescriptorProto来提供：

		message CodeGeneratorRequest {
		  repeated string file_to_generate = 1;
		  optional string parameter = 2;
		  repeated FileDescriptorProto proto_file = 15;
}

CodeGeneratorResponse 返回结果中，返回文件内容是直接用string表达，protoc主进程直接负责后续持久化：

message CodeGeneratorResponse {
  optional string error = 1;

  // Represents a single generated file.
  message File {
    optional string name = 1;
    optional string insertion_point = 2;
    optional string content = 15;
  }
  repeated File file = 15;
}

plugin实现方式：

自定义 CodeGenerator类的子类MyCodeGenerator，在plugin进程的main函数中直接调用google::protobuf::compiler::PluginMain()，

int main(int argc, char* argv[]) {
	     MyCodeGenerator generator;
	     return google::protobuf::compiler::PluginMain(argc, argv, &generator);
	   }

google::protobuf::compiler::PluginMain()的功能：

从STDIN_FILENO读取protoc主进程的输入数据，并且反序列化到CodeGeneratorRequest request；
从FileDescriptorProto产出FileDescriptor
调用MyCodeGenerator::Generate()
输出CodeGeneratorResponse response序列化之后的结果到STDOUT_FILENO，提供给protoc主进程

Subprocess类

负责建立父子进程之间的pipe，启动子进程
完成父子进程之间通信的数据格式转换

Protobuf-Compiler相关类&代码生成流程

Posted on 2018-05-26

代码生成流程：

核心流程如下图所示：

avatar

核心数据结构

类CommandLineInterface

generators_: map<string, GeneratorInfo>，提供从”–cpp_out” -> CppGenerator的映射，从protoc参数中获取需要的generator的名称；
plugins_：map<string, string> ，plugin提供非protobuf已有的CodeGenerator服务，plugin采用进程方式提供服务。plugins_记录的是：plugin名称 -> plugin可执行程序在磁盘上的path
plugin_prefix_：设置为”protoc-“

类SourceTree

接口类，表示.proto文件的目录树。

类DiskSourceTree

类SourceTree的子类，用于加载磁盘上的多个文件，并且提供从物理磁盘路径/文件 ->SourceTree上的节点的map关系.还可以设置”” -> SourceTree上的root节点。如果多个路径设置对应了同一个文件，那么搜索时会按照设置的顺序来处理。

类Importer

根据.proto文件的name，返回对应的FileDescriptor。实际是通过DescriptorPool提供的服务。

类io::Tokenizer

词法分析器，1个Tokenizer对象处理一个ZeroCopyInputStream，将raw text的stream转化为能够被parser解析的stream（token序列）。外部使用者仅需循环调用Tokenizer::Next()和Tokenizer::current()，就可以按照顺序获得对应的token，就像一个token化的stream一样。

token的定义如下：

struct Token {
  TokenType type;
  string text;       // The exact text of the token as it appeared in
                     // the input.  e.g. tokens of TYPE_STRING will still
                     // be escaped and in quotes.
	
  // "line" and "column" specify the position of the first character of
  // the token within the input stream.  They are zero-based.
  int line;
  int column;
  int end_column;
};

token类型定义：

enum TokenType {
  TYPE_START,       // Next() has not yet been called.
  TYPE_END,         // End of input reached.  "text" is empty.
	
  TYPE_IDENTIFIER,  // A sequence of letters, digits, and underscores, not
                    // starting with a digit.  It is an error for a number
                    // to be followed by an identifier with no space in
                    // between.
  TYPE_INTEGER,     // A sequence of digits representing an integer.  Normally
                    // the digits are decimal, but a prefix of "0x" indicates
                    // a hex number and a leading zero indicates octal, just
                    // like with C numeric literals.  A leading negative sign
                    // is NOT included in the token; it's up to the parser to
                    // interpret the unary minus operator on its own.
  TYPE_FLOAT,       // A floating point literal, with a fractional part and/or
                    // an exponent.  Always in decimal.  Again, never
                    // negative.
  TYPE_STRING,      // A quoted sequence of escaped characters.  Either single
                    // or double quotes can be used, but they must match.
                    // A string literal cannot cross a line break.
  TYPE_SYMBOL,      // Any other printable character, like '!' or '+'.
                    // Symbols are always a single character, so "!+$%" is
                    // four tokens.
};

处理性能是O(n)，处理过程：

使用buffer_从ZeroCopyInputStream中获取对应raw data，current_表示当前的token对象，previous_表示上一个token对象；
将character分为8种类型(通过宏CHARACTER_CLASS定义)：Whitespace/Unprintable/Digit/OctalDigit/HexDigit/Letter/Alphanumeric/Escape
使用buffer_pos_指向当前处理character的位置，并且逐个character向后移动处理，根据character的类型（有时需要结合previous_.type）判断current_ token的类型和边界，核心处理过程在Tokenizer::Next() 中：
```
(1)先判断和处理Whitespace字符；
(2)再判断和处理COMMENT字符串
(3)判断和处理Unprintable字符；
(4)判断和处理其余类型字符，生成有效的token；
```

类Parser

语法分析器，将tokenizer对象（proto文件对应的token化的stream）转化为FileDescriptorProto.
递归下降语法分析器（recursive-descent-parser） https://en.wikipedia.org/wiki/Recursive_descent_parser

核心数据成员：

io::Tokenizer* input_;                                // 提供需要parse的token stream
SourceCodeInfo* source_code_info_;    // 记录整个proto文件中所有token的location信息（path和span），用于开发工具使用，并不影响产出的FileDescriptorProto内容

处理过程：

Parser::Parse()中循环扫描input的tokenizer，来调用Parser::ParseTopLevelStatement()来处理的，注意在整个处理过程中root_location的传递，使得当前层级继承了上一个层级的location信息。整个过程按照proto文件的层级结构进行，是recursive的。

avatar

	bool Parser::Parse(io::Tokenizer* input, FileDescriptorProto* file) {
	
		…… //省略
		LocationRecorder root_location(this);
		
		 …… //省略
		
		    // Repeatedly parse statements until we reach the end of the file.
		    while (!AtEnd()) {
		      if (!ParseTopLevelStatement(file, root_location)) {
		           …… //省略
		
		             input_->Next();
		        }
		      }
		    }
		…… //省略
}

location的信息传递，是通过如下方式(path上增加了FileDescriptorProto::kMessageTypeFieldNumber，以及当前状态下file层级中message的数量，也就是当前message在上一级repeated数组中的offset)，基于上一级的path不断拓展：

bool Parser::ParseTopLevelStatement(FileDescriptorProto* file,
                                    const LocationRecorder& root_location) {

…… //省略

	else if (LookingAt("message")) {
	    LocationRecorder location(root_location,
	      FileDescriptorProto::kMessageTypeFieldNumber, file->message_type_size());
	    return ParseMessageDefinition(file->add_message_type(), location);
	}
	
…… //省略
}

核心过程在Parser::ParseTopLevelStatement()函数中：

Parser::ParseTopLevelStatement()每次处理一个大块完整的信息（完整的message/enum/service/extend/etc），每个块的处理过程是按照.proto文件的语法结构来逐层处理的。并且在最底层（叶结点）完成FileDescriptorProto以及对应成员信息的赋值。例如在message的’field’这一层完成lable/type/name/number的赋值。

avatar

类Parser::LocationRecorder

类Parser的private类，记录SourceCodeInfo.location中的一个localtion ，RAII方式实现，constructor记录start位置，destructor记录end位置

核心数据成员：

Parser* parser_;
SourceCodeInfo::Location* location_;

Q：从函数调用层级关系看：

SourceTreeDescriptorDatabase::FindFileByName(const string& filename, FileDescriptorProto* output) ->
Parser::Parse(io::Tokenizer* input, FileDescriptorProto* file) ->
Parser::ParseTopLevelStatement(FileDescriptorProto* file, const LocationRecorder& root_location)

Parser::ParseTopLevelStatement(FileDescriptorProto* file, const LocationRecorder& root_location)函数中第一个参数 file并不是input数据（而是需要赋值的output数据），进入这个函数时，file并没有被填充内容，那么在函数内部为什么能够直接使用类似file->message_type_size()的调用来从file获取数据呢？

答案是这样：

file在整个处理过程中，是一直会被写入的。当新处理一个子结构时，就会调用FileDescriptorProto::add*() 接口产生一个新的子结构，所以从file读取数据时，获得的就是当前file的状态信息。具体看下面的例子，file->message_type_size()记录下的就是当前处理的message在整个array<message>中的offset，初始值为0。在调用file->add_message_type()之后，再次读取 file->message_type_size()的值就会+1了。例如：

	bool Parser::ParseTopLevelStatement(FileDescriptorProto* file,
	                                    const LocationRecorder& root_location) {
	
	…… //省略
	
	  } else if (LookingAt("message")) {
	    LocationRecorder location(root_location,
	      FileDescriptorProto::kMessageTypeFieldNumber, file->message_type_size());
	    return ParseMessageDefinition(file->add_message_type(), location);
	
	…… //省略
}

类SourceCodeInfo

封装了关于proto源文件的信息，用于生成对应的FileDescriptorProto。定义在descriptor.proto 文件，作为一个Message子类

message SourceCodeInfo {
	repeated Location location = 1;
	message Location {
	    repeated int32 path = 1 [packed=true];
	    repeated int32 span = 2 [packed=true];
	}
}

span: 记录某个location在proto文件中的位置：[start_line, start_column, end_line, end_column]
path: 记录某个location在整个proto文件层级路径（从FileDescriptorProto开始查找），其中包含了每一层的field number 以及对应的index(如果在上一层中是repeated类型表示)。

类SourceLocationTable

管理pair<descriptor, ErrorLocation> -> pair<line,column>，核心数据结构

typedef map<
  pair<const Message*, DescriptorPool::ErrorCollector::ErrorLocation>,
  pair<int, int> > LocationMap;
LocationMap location_map_;

CodeGenerator相关

类GeneratorContext

接口类，表示CodeGenerator产生文件的路径和CodeGenerator运行的其它context信息。

类GeneratorContextImpl

GeneratorContext类的子类，处理内存中的文件，并且output到磁盘上。一个独立的GeneratorContext对应一个output的地址，所以如果有2个generator对应同一个地址，那么需要共用同一个GeneratorContext。

类CodeGenerator

接口类，从.proto定义文件产生code。

OutputDirective结构体：描述需要输出的路径和对应的generator

// output_directives_ lists all the files we are supposed to output and what
// generator to use for each.
struct OutputDirective {                                                                                                
  string name;                // E.g. "--foo_out"
  CodeGenerator* generator;   // NULL for plugins
  string parameter;
  string output_location;
};  
vector<OutputDirective> output_directives_;

Protobuf-Reflection类

Posted on 2018-05-20

类Reflection

接口类，提供方法来动态访问/修改message中的field的接口类。调用Message::GetReflection()获得messge对应的reflection。
这个类没有放到Message类中，是从效率角度考虑的。因为绝大多数message的实现共用同一套Reflection（GeneratedMessageReflection），并且一个Message所有的object是共享同一个reflection object的。

注意：

针对所有不同的field类型FieldDescriptor::TYPE_*,需要使用不同的Get*()/Set*()/Add*() 接口;
repeated类型需要使用GetRepeated*()/SetRepeated*()接口，不可以和非repeated类型接口混用；
message对象只可以被由它自身的reflection（message.GetReflection()）来操作；

那么为什么需要针对每种FieldDescriptor::TYPE_*有单独的Get*()/Set*()呢？
因为如果使用抽象的type来解决，需要增加一层处理，这会导致message占用内存变大，也增加了内存泄漏的风险，所以在用这种flat的接口设计。

类GeneratedMessageReflection

类Reflection的子类（也是当前版本中唯一的子类），服务于某一个固定的descriptor（构造GeneratedMessageReflection对象时就确定了对应的descriptor）。反射机制中最为核心的类。

内部实现：

操作任何一个数据时，需要知道2个信息即可：

内存地址；
类型信息;

GeneratedMessageReflection也是这样设计的。GeneratedMessageReflection通过base_addr + $offset[i] 的方式管理message所有的field，$offset[i]记录了message中每个field在message内存对象中的偏移，并且descriptor中有每个field的类型信息。

需要针对某个(message, field)做处理的时候：

直接通过descriptor获取对应field在message中的index
再查询offset[$index]获取内存地址
然后通过descriptor中type信息
做reinterpret_cast就获得对应数据。

构建GeneratedMessageReflection对象时，传入的核心数据是：

descriptor：被管理的message的descriptor指针；
offsets：message类的所有成员在message类内存对象的偏移；
has_bits_offset：用于”记录某个field是否存在的bitmap”的偏移（这个bitmap是message子类内部成员，其实是取这个数组0元素_has_bits_[0]的偏移）,这个bitmap最终是用来判断optional类型的field是否存在；
unknown_fields_offset：和has_bits_offset功能类似，用于记录unkown数据；

field有不同的类型，所以需要将void*转化为相应的类型。

对于primitive类型和string类型，直接使用对应primitive类型/string*表示；
单个Message类型field，通过Message的指针来保存；
Enum类型field，通过int来保存，这个int作为EnumDescriptor::FindValueByNumber()的输入；
Repeated类型field（细节见《repeated字段》一章）：
其中Strings/Message类型使用RepeatedPtrFields
其它primitive类型使用RepeatedFields

应用举例：

在每个.pb.cc文件中，对应每个message都有对应的GeneratedMessageReflection对象。例如针对protobuf/compiler/plugin.proto文件中的message CodeGeneratorRequest，在protobuf/compiler/plugin.pb.cc中：

namespace {

const ::google::protobuf::Descriptor* CodeGeneratorRequest_descriptor_ = NULL;
const ::google::protobuf::internal::GeneratedMessageReflection*
  CodeGeneratorRequest_reflection_ = NULL;

…… //省略
                                                                                                                          
}  // namespace

…… //省略

void protobuf_AssignDesc_google_2fprotobuf_2fcompiler_2fplugin_2eproto() {

…… //省略

  // CodeGeneratorRequest包含这3个field
   static const int CodeGeneratorRequest_offsets_[3] = {                                                                                                                     
    GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, file_to_generate_),
    GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, parameter_),
    GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, proto_file_),
  };

  CodeGeneratorRequest_reflection_ =
    new ::google::protobuf::internal::GeneratedMessageReflection(
      CodeGeneratorRequest_descriptor_,
      CodeGeneratorRequest::default_instance_,
      CodeGeneratorRequest_offsets_,
      GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, _has_bits_[0]),
      GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, _unknown_fields_),
      -1,  
      ::google::protobuf::DescriptorPool::generated_pool(),
      ::google::protobuf::MessageFactory::generated_factory(),
      sizeof(CodeGeneratorRequest));

…… //省略

}

`GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET`宏

GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET宏作用是找到某个field在所被包含type内存的offset

Q：pb.h定义中，field都是message子类的private成员，这里为什么可以通过”->”访问private成员呢？

A: 函数protobuf_AssignDesc_google_2fprotobuf_2fcompiler_2fplugin_2eproto()被定义为各个message子类的friend（定义在private部分）

这里代码注释给了很多信息！protobuf针对关键点的注释非常详细，值得学习！

// Returns the offset of the given field within the given aggregate type.
// This is equivalent to the ANSI C offsetof() macro.  However, according
// to the C++ standard, offsetof() only works on POD types, and GCC
// enforces this requirement with a warning.  In practice, this rule is
// unnecessarily strict; there is probably no compiler or platform on
// which the offsets of the direct fields of a class are non-constant.
// Fields inherited from superclasses *can* have non-constant offsets,
// but that's not what this macro will be used for.
//
// Note that we calculate relative to the pointer value 16 here since if we
// just use zero, GCC complains about dereferencing a NULL pointer.  We
// choose 16 rather than some other number just in case the compiler would
// be confused by an unaligned pointer.
#define GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(TYPE, FIELD)    \
  static_cast<int>(                                           \
    reinterpret_cast<const char*>(                            \
      &reinterpret_cast<const TYPE*>(16)->FIELD) -            \
    reinterpret_cast<const char*>(16))

举例说明

以primitive类型为例说明GeneratedMessageReflection如何管理各个不同类型的field。

在read一侧：

// Template implementations of basic accessors.  Inline because each
// template instance is only called from one location.  These are
// used for all types except messages.
template <typename Type>
inline const Type& GeneratedMessageReflection::GetField(
    const Message& message, const FieldDescriptor* field) const {
  return GetRaw<Type>(message, field);
}

从message内存起始地址，按照field在message对象内存中的offset偏移之后获取field的内存地址，然后reinterpret_cast为Type类型（primitive的）

// These simple template accessors obtain pointers (or references) to
// the given field.
template <typename Type>
inline const Type& GeneratedMessageReflection::GetRaw(                                                                    
    const Message& message, const FieldDescriptor* field) const {
  const void* ptr = reinterpret_cast<const uint8*>(&message) +
                    offsets_[field->index()];
  return *reinterpret_cast<const Type*>(ptr);
}

这里的offsets_[]就是构造函数GeneratedMessageReflection::GeneratedMessageReflection()传入的，各个field在message中的偏移量数组（也就是上面例子中的 CodeGeneratorRequest_offsets_[3]）。field->index()是field在parent的children数组中的pos，实现如下：

			inline int FieldDescriptor::index() const {
				...... //省略
			    return this - containing_type_->fields_;
				...... //省略
}

在write一侧：

template <typename Type>
inline void GeneratedMessageReflection::SetField(
    Message* message, const FieldDescriptor* field, const Type& value) const {
  *MutableRaw<Type>(message, field) = value;
  SetBit(message, field);
     }

template <typename Type>
inline Type* GeneratedMessageReflection::MutableRaw(
    Message* message, const FieldDescriptor* field) const {
  void* ptr = reinterpret_cast<uint8*>(message) + offsets_[field->index()];
  return reinterpret_cast<Type*>(ptr);
}

这里has_bits_offset_为bitmap结构，通过某个bit是否存在，快速判断对应filed是否存在

inline void GeneratedMessageReflection::SetBit(
    Message* message, const FieldDescriptor* field) const {
  MutableHasBits(message)[field->index() / 32] |= (1 << (field->index() % 32));
}

inline uint32* GeneratedMessageReflection::MutableHasBits(
    Message* message) const {
  void* ptr = reinterpret_cast<uint8*>(message) + has_bits_offset_;
  return reinterpret_cast<uint32*>(ptr);
}

RepeatedPtrFields / RepeatedFields具体实现，见repeated_field.*文件，详情见《repeated字段》一章

内存分布说明

举一个具体的例子来说明offset[]的工作方式吧：

从message Student定义到对应class Student
再到offset以及Student类对象内存分布说明，请见下图：

avatar

Message类对象内存分布

针对不同类型，有不同的内存记录方式：

primitve类型：直接在内存中保存了对应的value；
string类型，保存的是string*地址；
repeated<message>类型，保存的是RepeatedPtrField<message>对象，采用2级内存管理，第一级内部数据管理的是array<void*>，void*是真实message对象内存地址
repeated<primitive>类型，保存的是RepeatedField<primitive>对象,内部数据管理的是array<primitive数据对象>

Student类中包含了多种类型成员，对应内存查找过程如下图所示：

avatar

Protobuf-Repeated相关类

Posted on 2018-05-19

Repeated类型

field包含2种类型：

Strings/Message类型使用RepeatedPtrFields
enum / primitive类型使用RepeatedFields

核心类关系图如下：

avatar

类RepeatedField

模板类，为primitive类型数据提供repeated类型容器。内部实现为连续内存的array（保存的内容就是primitive对象），并且对外提供iterator方式来访问。

array起始大小为4(static const int kInitialSize = 4;)，当长度不够的时候会分配新的内存（max(total_size_ * 2, new_size)）以及内存拷贝，所以如果repeated的成员很多，使用Reserve()接口能节省array增大时的内存分配和数据拷贝成本。

类RepeatedPtrFieldBase

RepeatedPtrFields的父类（不是模板类，提供了多个模板函数），本身保存/管理的数据类型为void*（message对象的实际地址，也是通过连续内存array来保存）。

RepeatedPtrFieldBase类并不感知自己管理的具体是什么message，通过模板函数的模板参数TypeHandler来为各种数据类型数据提供服务，例如：

	template <typename TypeHandler>
	void RepeatedPtrFieldBase::Clear() {
	  for (int i = 0; i < current_size_; i++) {
	    TypeHandler::Clear(cast<TypeHandler>(elements_[i]));
	  }
	  current_size_ = 0;
}

因为array中保存的是同一个descriptor对应的message，只是各个message中所包含的数据不一样，为了节省下message对象分配/删除的成本，所以message可以被clear(clear操作会将primitive类型的field设置为0，其余类型field调用自身的clear()接口处理，例如string类型的std::string::clear(),只清理数据并不回收内存)，然后保留原有的内存地址在array中。下次需要从array中分配message时，优先使用这一批被clear的message（实现在RepeatedPtrFieldBase::AddFromCleared() ，参考GeneratedMessageReflection::AddMessage()中的调用方式）。
为了管理cleared状态的message指针，引入了多个游标来标记数据：

current_size_: 当前待处理的message地址；
allocated_size_:已经分配message的数据，current_size_ <= allocated_size_，从current_size_到allocated_size_之间的message就是被cleared的；
total_size_: elements_[]的长度，但从allocated_size_到total_size_之间的void*是无效的，并没有指向任何message；

对应内存分布如下：
avatar

类RepeatedPtrField

模板类，RepeatedPtrFieldBase的子类，为Strings/Message类型数据提供repeated类型容器。

Q: 这里RepeatedPtrField类是RepeatedPtrFieldBase类唯一的子类，是否也没有必要这样区分父类/子类呢？
Answer如下：

提前铺垫（父类/子类的分工）：

RepeatedPtrFieldBase（非模板类，提供模板函数）负责的是最基本的基于array<void*>的操作，并不感知所保存的内容的数据类型，所有需要区分类型的操作都有模板类型TypeHandler来负责；
RepeatedPtrField（模板类）感知数据类型(数据类型由模板参数Element提供)，并且对外的接口都是基于类型Element的。针对Element的操作则由TypeHandler来负责，并且通过父类RepeatedPtrFieldBase模板函数的参数传递给父类。

这种分工可以在RepeatedPtrField的很多函数上体现，例如：

template <typename Element>
inline void RepeatedPtrField<Element>::RemoveLast() {
	 RepeatedPtrFieldBase::RemoveLast<TypeHandler>();
}

本质原因：

某些情况下无法感知子类(模板类)RepeatedPtrField的模板参数Element，所以并不清楚具体子类，只能指向父类RepeatedPtrFieldBase。

例如在GeneratedMessageReflection::AddMessage()中，其实message子类中保存的是RepeatedPtrField对象（可以参考student.proto中的例子），所以只能将使用父类RepeatedPtrFieldBase指针指向RepeatedPtrField的对象，然后：

调用RepeatedPtrFieldBase::AddFromCleared()，尝试获取已cleared但未释放的message对象。如果没有，就继续；
获取一个prototype（指向真实Message子类对象的父类Message指针）：
(2.1)先看RepeatedPtrFieldBase的array<RepeatedPtrField >是否有成员，如果有就使用;
(2.2)调用factory->GetPrototype()创建一个；
调用prototype（指向真实Message子类对象的父类Message指针）的Message::New()接口，创建出一个真实field_descriptor对应的Message子类对象；

Message* GeneratedMessageReflection::AddMessage(
    Message* message, const FieldDescriptor* field,
    MessageFactory* factory) const {
   
    // 省略非核心代码

    // We can't use AddField<Message>() because RepeatedPtrFieldBase doesn't
    // know how to allocate one.
    RepeatedPtrFieldBase* repeated =
      MutableRaw<RepeatedPtrFieldBase>(message, field);
    Message* result = repeated->AddFromCleared<GenericTypeHandler<Message> >();
    if (result == NULL) {
      // We must allocate a new object.
      const Message* prototype;
      if (repeated->size() == 0) {
        prototype = factory->GetPrototype(field->message_type());
      } else {
        prototype = &repeated->Get<GenericTypeHandler<Message> >(0);
      }
      result = prototype->New();
      repeated->AddAllocated<GenericTypeHandler<Message> >(result);
    }
    return result;
  
}

Q：这里的TypeHandler是在哪里定义的呢？类RepeatedPtrField中并没有提供接口来针对不同数据类型设置typehandler?

Answer如下：

定义在repeated_field.h中，根据模板类RepeatedPtrField<>模板参数的不同（Element或者string），继承了不同的父类（因为这里子类自己并没有独有的数据/行为，所以用这种方式来选择使用哪种handler）：

template <typename Element>
class RepeatedPtrField<Element>::TypeHandler
    : public internal::GenericTypeHandler<Element> {};

template <>
class RepeatedPtrField<string>::TypeHandler
    : public internal::StringTypeHandler {};

typehandler是直接子类RepeatedPtrField在调用父类RepeatedPtrFieldBase的模板函数时，通过模板参数直接传入父类RepeatedPtrFieldBase，可以通过GeneratedMessageReflection中使用的例子来看：

const Message& GeneratedMessageReflection::GetRepeatedMessage(
    const Message& message, const FieldDescriptor* field, int index) const {

     …… //省略

      return GetRaw<RepeatedPtrFieldBase>(message, field)
        .Get<GenericTypeHandler<Message> >(index);                                                                                                                            
}

模板函数RepeatedPtrFieldBase::Get()，这里的TypeHandler就是GenericTypeHandler：

template <typename TypeHandler>
inline const typename TypeHandler::Type&
RepeatedPtrFieldBase::Get(int index) const {
  GOOGLE_DCHECK_LT(index, size());
  return *cast<TypeHandler>(elements_[index]);
}

注意：这里TypeHandler是RepeatedPtrField类的protected成员，为了不让用户再将RepeatedPtrField作为父类来使用：

protected:
// Note:  RepeatedPtrField SHOULD NOT be subclassed by users.  We only
//   subclass it in one place as a hack for compatibility with proto1.  The
//   subclass needs to know about TypeHandler in order to call protected
//   methods on RepeatedPtrFieldBase.
  class TypeHandler;

类GenericTypeHandler

针对message的typehandler

类StringTypeHandler

StringTypeHandlerBase的子类，在父类基础上增加了SpaceUsed()接口。

	class LIBPROTOBUF_EXPORT StringTypeHandler : public StringTypeHandlerBase {
	 public:
	  static int SpaceUsed(const string& value)  {
	    return sizeof(value) + StringSpaceUsedExcludingSelf(value);
	  }
};

这里需要理解string的数据结构来理解这段代码了，从代码看start/end是保存在string对象的前第一/前第二个位置void*（sizeof(void*)为8个byte）的。

int StringSpaceUsedExcludingSelf(const string& str) {
  const void* start = &str;
  const void* end = &str + 1;

  if (start <= str.data() && str.data() <= end) {
    // The string's data is stored inside the string object itself.
    return 0;
  } else {
    return str.capacity();
  }
}

Q: 为什么需要区分父类/子类呢？直接使用StringTypeHandler即可啊

Answer如下：

// HACK:  If a class is declared as DLL-exported in MSVC, it insists on
//   generating copies of all its methods -- even inline ones -- to include
//   in the DLL.  But SpaceUsed() calls StringSpaceUsedExcludingSelf() which
//   isn't in the lite library, therefore the lite library cannot link if
//   StringTypeHandler is exported.  So, we factor out StringTypeHandlerBase,                                                                                               
//   export that, then make StringTypeHandler be a subclass which is NOT
//   exported.
// TODO(kenton):  There has to be a better way.

Protobuf-Unknown字段

Posted on 2018-05-13

待解决的问题：

分布式系统中，各个模块接口之间proto文件在升级过程中，必然会存在版本不一致的情况。
unknown字段，用于解决proto文件升级过程中，在多级联关系的各个模块（特别是涉及路由功能模块传递数据时）接口之间proto版本不一致，而导致数据无法传递的问题。

例如：

之前在hy-new-router重构项目开发中，就遇到这样的问题。利用driver向asp发送消息，消息到了hy-ui解析失败。
Asp/hy-router/hy-ui 3个模块之间通信是使用厂内历史留存的idl方式定义的，按照包含字段的内容，v1>v3>v2（v2的数据内容最少，虽然看上去各个字段差异部分都是使用了optional方式做了兼容），hy-router按照v2解析后，传递给下游hy-ui。

avatar

解决的思路：

parse数据时，如果发现某个field_id不在本模块接口定义proto中，那么将这个field保存到unknown字段中，在后续处理 serialize过程中，会将unkown字段继续传递下去。

具体技术实现：

类UnknownField

记录unknown字段的key（field_id,数据类型）和value。针对不同type数据，使用union方式实现。涉及到非primitive类型数据，考虑了DeepCopy。

key：

enum Type {
  TYPE_VARINT,
  TYPE_FIXED32,
  TYPE_FIXED64,
  TYPE_LENGTH_DELIMITED,
  TYPE_GROUP
};
	
unsigned int number_ : 29;
unsigned int type_   : 3;

value：

union {
  uint64 varint_;
  uint32 fixed32_;
  uint64 fixed64_;
  string* length_delimited_;
  UnknownFieldSet* group_;
};

类UnknownFieldSet

vector<UnknownField> 方式记录UnknownField。

message处理中的实现

所有具体message子类中，包含了对应unkown字段以及访问方法：


class LIBPROTOC_EXPORT CodeGeneratorRequest : public ::google::protobuf::Message {
 
public:
  inline const ::google::protobuf::UnknownFieldSet& unknown_fields() const {
    return _unknown_fields_;
  }

…… //省略

private:

  ::google::protobuf::UnknownFieldSet _unknown_fields_;
…… //省略

}

message对应的reflection也可以访问到对应unknown字段，访问方式和其它reflection功能一样，通过base + offset偏移方式获取到对应内存地址，然后reinterpret_cast。

const UnknownFieldSet& GeneratedMessageReflection::GetUnknownFields(
    const Message& message) const {
  const void* ptr = reinterpret_cast<const uint8*>(&message) +
                    unknown_fields_offset_;
  return *reinterpret_cast<const UnknownFieldSet*>(ptr);
}

这里unknown_fields_offset_是在构造GeneratedMessageReflection时传递的，在每个由protoc产生的pb.cc中都会有，例如compiler/plugin.pb.cc 中：


CodeGeneratorRequest_reflection_ =
  new ::google::protobuf::internal::GeneratedMessageReflection(
    CodeGeneratorRequest_descriptor_,
    CodeGeneratorRequest::default_instance_,
    CodeGeneratorRequest_offsets_,
    GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, _has_bits_[0]),
    GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET(CodeGeneratorRequest, _unknown_fields_),
    -1,
    ::google::protobuf::DescriptorPool::generated_pool(),
    ::google::protobuf::MessageFactory::generated_factory(),
    sizeof(CodeGeneratorRequest));

parse数据过程中，代码调用关系梳理如下。

MessageLite::ParseFromString() ->
InlineParseFromArray() ->
InlineMergeFromCodedStream() ->
Message::MergePartialFromCodedStream() ->
WireFormat::ParseAndMergePartial() ->
WireFormat::ParseAndMergeField() ->
WireFormat::SkipField()

核心部分在WireFormat::ParseAndMergePartial() 开始：

while循环中从io::CodedInputStream* input中逐个读取tag；
从tag提取field_number，由field_number从Descriptor*查找FieldDescriptor*；
如果找不到field_number，WireFormat::ParseAndMergeField()中获取Reflection* message_reflection，再通过GeneratedMessageReflection::GetUnknownFields()获得unknown字段；
WireFormat::SkipField()中，根据field_type,调用UnknownFieldSet不同方法；

Protobuf-Descriptor相关类

Posted on 2018-05-12

类Descriptor

描述一种message类型（不是一个单独的message对象）的meta信息。构造函数是private类型，必须通过DescriptorPool（friend类）来构造。

const的成员：

const FileDescriptor* file_：描述message所在的.proto文件信息
const Descriptor* containing_type_：如果在proto定义中，这个message是被其它message所包含，那么这个字段是上一级message的descriptor*；如果没有被包含，那么是NULL
const MessageOptions* options_：定义在descriptor.proto，从注释看是用来和老版本proto1中MessageSet做拓展，可以先不去关注涉及extension的部分。

非const的成员：

int field_count_：当前field包含的field的个数
FieldDescriptor* fields_：以连续数组方式保存的所有的fieds
int nested_type_count_: 嵌套类型数量
Descriptor* nested_types_: message中嵌套message
int enum_type_count_：内部enum的个数
EnumDescriptor* enum_types_： enum类型的连续内存起始地址

类FileDescriptor

描述整个.proto文件信息，其中包含：

依赖.proto文件信息：

int dependency_count_;

const FileDescriptor** dependencies_;
当前.proto文件包含的message信息：

int message_type_count_;

Descriptor* message_types_;
当前.proto文件包含的所有symbol (各种descriptor)的tables：

const FileDescriptorTables* tables_;

类FieldDescriptor

描述一个单独的field，构造函数为private，也必须由DescriptorPool（friend类）构造。通过包含这个field的message的descriptor的函数（Descriptor::FindFieldByName()）获得。

enum类型：

enum Type ： field类型；
enum CppType： cpp中field类型，CppType和Type类型映射关系是固定的；
enum Label ：标记field的存在性类型(optional/required/repeated)；

const类型的private数据：

const Descriptor* containing_type_;
const Descriptor* extension_scope_;
const Descriptor* message_type_;
const EnumDescriptor* enum_type_;
const FieldDescriptor* experimental_map_key_;
const FieldOptions* options_;

3个映射表（static const类型）：

static const CppType kTypeToCppTypeMap[MAX_TYPE + 1];
static const char * const kTypeToName[MAX_TYPE + 1];
static const char * const kLabelToName[MAX_LABEL + 1];

在descriptor.cc中，实现对外暴露数据的函数时，为了提高代码可读性，使用了如下宏的方式：

1 2	PROTOBUF_DEFINE_ACCESSOR(FieldDescriptor, default_value_int32 , int32 ) PROTOBUF_DEFINE_ACCESSOR(FieldDescriptor, has_default_value, bool)

PROTOBUF_DEFINE_ACCESSOR的定义如下：

1
2
3

// These macros makes this repetitive code more readable.
#define PROTOBUF_DEFINE_ACCESSOR(CLASS, FIELD, TYPE) \
  inline TYPE CLASS::FIELD() const { return FIELD##_; }

因为FieldDescriptor自己包含如下union数据成员，用来表示不同TYPE类型数据的default值：

private:
  bool has_default_value_;
  union {
    int32  default_value_int32_;
    int64  default_value_int64_;
    uint32 default_value_uint32_;
    uint64 default_value_uint64_;
    float  default_value_float_;
    double default_value_double_;
    bool   default_value_bool_;

    const EnumValueDescriptor* default_value_enum_;
    const string* default_value_string_;
  };

类EnumDescriptor

描述在.proto文件中定义的enum类型

结构体Symbol

针对protobuf中7种类型的descriptor的一个封装。
编程上，也适用union来适配不同类型的descriptor：


Type type;

union {
  const Descriptor* descriptor;
  const FieldDescriptor* field_descriptor;
  const EnumDescriptor* enum_descriptor;
  const EnumValueDescriptor* enum_value_descriptor;
  const ServiceDescriptor* service_descriptor;
  const MethodDescriptor* method_descriptor;
  const FileDescriptor* package_file_descriptor;
};

提高代码可读性上，使用宏的方式：

#define CONSTRUCTOR(TYPE, TYPE_CONSTANT, FIELD)  \
  inline explicit Symbol(const TYPE* value) {    \
    type = TYPE_CONSTANT;                        \
    this->FIELD = value;                         \
  }

  CONSTRUCTOR(Descriptor         , MESSAGE   , descriptor             )
  CONSTRUCTOR(FieldDescriptor    , FIELD     , field_descriptor       )
  CONSTRUCTOR(EnumDescriptor     , ENUM      , enum_descriptor        )
  CONSTRUCTOR(EnumValueDescriptor, ENUM_VALUE, enum_value_descriptor  )
  CONSTRUCTOR(ServiceDescriptor  , SERVICE   , service_descriptor     )
  CONSTRUCTOR(MethodDescriptor   , METHOD    , method_descriptor      )
  CONSTRUCTOR(FileDescriptor     , PACKAGE   , package_file_descriptor)
#undef CONSTRUCTOR

类DescriptorPool::Tables

各种数据表的集合，封装了一系列的hashmap结构。
注意这个类是descriptor.h文件中在类DescriptorPool的private成员中声明的，所以是类DescriptorPool内部的数据结构，

封装的一系列的hashmap：

typedef pair<const void*, const char*> PointerStringPair;
//这里是将message对应的descriptor地址和int组合在一起，指定descriptor中的某一个field

typedef pair<const Descriptor*, int> DescriptorIntPair;    
typedef pair<const EnumDescriptor*, int> EnumIntPair;
	

typedef hash_map<const char*, Symbol,
                 hash<const char*>, streq>
  SymbolsByNameMap;
  
typedef hash_map<PointerStringPair, Symbol,                     
                 PointerStringPairHash, PointerStringPairEqual>
  SymbolsByParentMap;   
  
typedef hash_map<const char*, const FileDescriptor*,
                 hash<const char*>, streq>
  FilesByNameMap;
  
typedef hash_map<PointerStringPair, const FieldDescriptor*,
                 PointerStringPairHash, PointerStringPairEqual>
  FieldsByNameMap;
  
typedef hash_map<DescriptorIntPair, const FieldDescriptor*,
                 PointerIntegerPairHash<DescriptorIntPair> >
  FieldsByNumberMap;
  
typedef hash_map<EnumIntPair, const EnumValueDescriptor*,
                 PointerIntegerPairHash<EnumIntPair> >
  EnumValuesByNumberMap;

parent的含义

从BUILD_ARRAY的定义和使用，可以理解parent的含义，有如下3种情况：

当一个message针对它所包含的成员（field/nested_message/enum/extension）, 这个message的Descriptor* 就是它成员的parent。
```
从函数`DescriptorBuilder::BuildMessage()`中的宏`BUILD_ARRAY`定义可以看出这一点。
```
一个enum，针对它所包含的enum_value是parent（函数DescriptorBuilder::BuildEnum()中体现）
一个service，针对它所包含的method是parent（函数DescriptorBuilder::BuildService()中体现）

具体数据成员

vector<string*> strings_;    // All strings in the pool.
vector<Message*> messages_;  // All messages in the pool.
vector<FileDescriptorTables*> file_tables_;  // All file tables in the pool.
vector<void*> allocations_;  // All other memory allocated in the pool.

SymbolsByNameMap      symbols_by_name_;
FilesByNameMap        files_by_name_;
ExtensionsGroupedByDescriptorMap extensions_;

和rollback相关的数据成员

int strings_before_checkpoint_;
int messages_before_checkpoint_;
int file_tables_before_checkpoint_;
int allocations_before_checkpoint_;
vector<const char*      > symbols_after_checkpoint_;
vector<const char*      > files_after_checkpoint_;
vector<DescriptorIntPair> extensions_after_checkpoint_;

其它数据成员

vector<string> pending_files_  // stack方式保存的文件名，用来检测文件的循环依赖错误

Checkpoint/Rollback

和数据库事务处理中的概念一样，在确保数据正常时，生成一个检查点(checkpoint)，针对当前状态做一个快照；如果在后续处理过程中，发生问题，做回滚(rollback)，数据恢复到上一个checkpoint，保证基础服务可以继续，提高系统的可用性。

生成checkpoint的点只有2个，都在函数DescriptorBuilder::BuildFile()中：

开始修改DescriptorPool::Tables* tables_内容之前；
所有操作都成功之后；

DescriptorPool::Tables::Checkpoint():

void DescriptorPool::Tables::Checkpoint() {
  // 记录下当前4个vector的size
  strings_before_checkpoint_ = strings_.size();
  messages_before_checkpoint_ = messages_.size();
  file_tables_before_checkpoint_ = file_tables_.size();
  allocations_before_checkpoint_ = allocations_.size();

  // clear掉3个`***_after_checkpoint_`的vector
  symbols_after_checkpoint_.clear();
  files_after_checkpoint_.clear();
  extensions_after_checkpoint_.clear();
   }

DescriptorPool::Tables::Rollback():

从通过name查询的hashmap删除掉after_checkpoint_[]的数据；
清理掉after_checkpoint_[]数据；
通过Checkpoint()记录下来的size，删除vector尾部数据，并且完成resize()，释放掉不再占有的内存空间；

DescriptorPool::Tables中的各个表中的数据是如何注册进来的呢？

对外接口是DescriptorPool::Tables::AddSymbol()，在DescriptorBuilder类的DescriptorBuilder::AddSymbol()和DescriptorBuilder::AddPackage()被调用。

类DescriptorPool

负责构造和管理所有的、各种类型的descriptor，并且帮助管理互相cross-linked的descriptor之间的关系，以及他们之间的数据依赖。可以通过name来从DescriptorPool找到对应descriptor。

按照singleton方式提供服务，全局数据包含：

EncodedDescriptorDatabase* generated_database_ = NULL;
DescriptorPool* generated_pool_ = NULL;
GOOGLE_PROTOBUF_DECLARE_ONCE(generated_pool_init_);

使用google::protobuf::GoogleOnceInit（本质是pthread_once）来控制仅仅被init一次。

虽然类DescriptorPool提供了3种构造函数，但从函数InitGeneratedPool()看，仅仅使用了配置DescriptorDatabase*的版本，其余2个并没有使用。在这种情况下，其实 const DescriptorPool* underlay_是为NULL的。

void InitGeneratedPool() {
  generated_database_ = new EncodedDescriptorDatabase;
  generated_pool_ = new DescriptorPool(generated_database_);

  internal::OnShutdown(&DeleteGeneratedPool);
}

###DescriptorDatabase* fallback_database_###

作用:

用于定制地(on-demand)从某种”大”的database加载产生DescriptorPool。因为database太大，逐个调用DescriptorPool::BuildFile() 来处理原database中的每一个proto文件是低效的。为了提升效率，使用DescriptorPool来封装DescriptorDatabase，并且只建立正真需要的descriptor。

针对编译依赖的每个proto文件，并不是在进程启动时，直接构建出proto中所包含的所有descriptor，而是hang on，直到某个descriptor真的被需要：

(1) 用户调用例如descriptor(), GetDescriptor(), GetReflection()的方法，需要返回descriptor；  
(2) 用户从DescriptorPool::generated_pool()中查找descriptor；  
这也是为什么DescriptorPool的底层数据，需要分层的原因！

说明：

采用fallback_database_之后，不能调用BuildFile*() 方法来构建pool,只能使用Find*By*() 方法
Find*By*() 因为上锁，所以即使没有去访问fallback_database_的请求也会变慢

`const DescriptorPool* underlay_`的作用

Underlay的作用（从注释中得到）：

仅在内部使用，并且可能存在诡异的问题（many subtle gotchas），建议使用DescriptorDatabases来解决问题。

应用场景：

需要runtime方式使用DynamicMessage来解析一个.proto文件，已知这个.proto文件的依赖已经按照静态编译方式包含。

一方面为了避免重复解析和加载这些依赖内容；
另一方面不能把runtime的.proto添加到原有的generated_pool()产生的DescriptorPool中，所以并不是直接把这个.proto文件的内容添加到全局的、由generated_pool()产生的DescriptorPool中，而是创建一个新的DescriptorPool，将generated_pool()产生的DescriptorPool作为新的pool的underlay。

`DescriptorPool::FindBy()`系列函数

作用：

通过name来从DescriptorPool找到对应descriptor时，查找时先上锁（MutexLockMaybe），代码上看是分3个层级数据来查找的：

从DescriptorPool::Tables tables_中找，没找到继续第2层中找；
从DescriptorPool* underlay_中找,没找到继续第3层中找；
从DescriptorDatabase* fallback_database_中找对应proto，并且调用临时构造的DescriptorBuilder::Build*()系列接口把生成的descriptor添加到tables_中，然后再从tables_中找；

Q: 这里为什么要临时构造一个DescriptorBuilder来使用呢？
答案是：锁是针对第3层DescriptorDatabase* fallback_database_的，因为这个可能被同时读/写

类DescriptorBuilder

封装了DescriptorPool，对外提供descriptor的构造。对外最主要的接口是DescriptorBuilder::BuildFile()，通过FileDescriptorProto来构建FileDescriptor。

DescriptorProto系列类

DescriptorProto系列类，在descriptor.proto文件中定义，用来描述由protobuf产生类型的类型原型（proto）。

一共有如下7种proto

FileDescriptorProto	用来描述文件
DescriptorProto	用来描述消息（message）
FieldDescriptorProto	用来描述字段
EnumDescriptorProto	用来描述枚举
EnumValueDescriptorProto	用来描述枚举值
ServiceDescriptorProto	用来描述服务器
MethodDescriptorProto	用来描述服务器方法

类FileDescriptorTables

单个proto文件中包含的tables，这些tables在文件加载时就固化下来，所以无需使用mutex保护，所以使得依赖单个文件的操作(例如Descriptor::FindFieldByName() )是lock-free的。
类FileDescriptorTables 和类 DescriptorPool::Tables过去是在同一个类中定义的。
原来Google也有类似的注释：// For historical reasons,xxxxxx。

它所包含的数据结构如下：

SymbolsByParentMap    symbols_by_parent_;
FieldsByNameMap       fields_by_lowercase_name_;
FieldsByNameMap       fields_by_camelcase_name_;
FieldsByNumberMap     fields_by_number_;       // Not including extensions.
EnumValuesByNumberMap enum_values_by_number_;

类DescriptorDatabase

接口类，用于定制地(on-demand)从某种”大”的database加载产生DescriptorPool。因为database太大，逐个调用DescriptorPool::BuildFile() 来处理原database中的每一个proto文件是低效的。
为了提升效率，使用DescriptorPool来封装DescriptorDatabase，并且只建立正真需要的descriptor。

包含了4个子类，提供通过name查询file_descriptor_proto 的接口（注意这里是file_descriptor_proto，而不是file_descriptor）。

类SimpleDescriptorDatabase

索引file_name-> file_descriptor_proto*，拥有被它索引的 file_descriptor_proto*的ownership，并提供add()/find()接口。类SimpleDescriptorDatabase在protobuf中并没有被使用。

内部实现：

通过SimpleDescriptorDatabase::DescriptorIndex<const FileDescriptorProto*> index_管理索引结构;
通过vector<const FileDescriptorProto*> files_to_delete_管理”深拷贝”的部分;

类SimpleDescriptorDatabase::DescriptorIndex

内部实现：

由map<string，Value>管理从name->Value的映射关系；
由map<string，Value>管理file所包含的symbol_name->Value的映射关系，这里的symbol可以是file包含的message/enum/service；
由map<string，Value>管理file所包含的extension_name->Value的映射关系；

应用：

protobuf中仅在如下2个地方被应用：

类SimpleDescriptorDatabase中的DescriptorIndex<const FileDescriptorProto*> index_
类EncodedDescriptorDatabase中的SimpleDescriptorDatabase::DescriptorIndex<pair<const void*, int> > index_

类EncodedDescriptorDatabase

功能说明：

索引file_name->pair<const void*, int>，结构pair<const void*, int>中的const void*指的是encoded_file_descriptor字符串的地址，int指的是encoded_file_descriptor字符串的长度。被管理的encoded_file_descriptor有两类ownership：

拥有ownership的，通过接口EncodedDescriptorDatabase::AddCopy()实现；
不用有ownership的，通过接口EncodedDescriptorDatabase::Add()实现；

具体应用：

每个proto生成的pb.cc中，都包含将本proto文件encoded之后的string添加到EncodedDescriptorDatabase中的函数。例如protobuf/compiler/plugin.pb.cc中的

void protobuf_AddDesc_google_2fprotobuf_2fcompiler_2fplugin_2eproto() {                                                   

  …… //省略函数

  ::google::protobuf::DescriptorPool::InternalAddGeneratedFile(
    "\n%google/protobuf/compiler/plugin.proto\022"
    "\030google.protobuf.compiler\032 google/protob"
    "uf/descriptor.proto\"}\n\024CodeGeneratorRequ"
    "est\022\030\n\020file_to_generate\030\001 \003(\t\022\021\n\tparamet"
    "er\030\002 \001(\t\0228\n\nproto_file\030\017 \003(\0132$.google.pr"
    "otobuf.FileDescriptorProto\"\252\001\n\025CodeGener"
    "atorResponse\022\r\n\005error\030\001 \001(\t\022B\n\004file\030\017 \003("
    "\01324.google.protobuf.compiler.CodeGenerat"
    "orResponse.File\032>\n\004File\022\014\n\004name\030\001 \001(\t\022\027\n"
    "\017insertion_point\030\002 \001(\t\022\017\n\007content\030\017 \001(\t", 399);

  …… //省略函数

}

::google::protobuf::DescriptorPool::InternalAddGeneratedFile()定义如下：

void DescriptorPool::InternalAddGeneratedFile(
    const void* encoded_file_descriptor, int size) {

// 每个protobuf产出的.pb.cc文件中都会包含InternalAddGeneratedFile()，在进程启动时会调用这个函数，注册.proto文件对应FileDescriptorProto的raw bytes
// Q：进程启动时会调用这个函数，上一级的入口在哪里呢？每个编译依赖(被include)的proto文件都会注册么？
// 针对编译依赖的每个proto文件，并不是在进程启动时，直接构建出proto中所包含的所有descriptor，而是hang on，直到某个descriptor真的被需要：
// (1) 用户调用例如descriptor(), GetDescriptor(), GetReflection()的方法，需要返回descriptor；
// (2) 用户从DescriptorPool::generated_pool()中查找descriptor；
//
// 上述2类请求发生时，DescriptorPool先获得并解析FileDescriptorProto，然后根据它产生对应的FileDescriptor（以及它所包含的descriptor）
//
// 因为FileDescriptorProto类型本身也是由protobuf通过protobuf/descriptor.proto文件产生的，所以当解析的时候，需要注意避免使用任何descriptor-based 的操作，避免死锁和死循环。

  InitGeneratedPoolOnce();
  GOOGLE_CHECK(generated_database_->Add(encoded_file_descriptor, size));
}

内部实现：

vector<void*> files_to_delete_：记录拥有ownership的encoded_file_descriptor字符串的地址，

类DescriptorPoolDatabase

针对单一DescriptorPool的封装，查询时先调用内置的DescriptorPool接口，从name查找到对应的file_descriptor, 再调用FileDescriptor::CopyTo(),获得file_descriptor_proto.

类MergedDescriptorDatabase

类DescriptorDatabase的子类，封装多个descriptor_database，本身结构简单，用vector<DescriptorDatabase*>保存，逐个遍历查询。

DescriptorDatabase 相关类的关系图

avatar

Protobuf-Message相关类

Posted on 2018-05-06

类MessageLite

所有message的接口类，从名字看是lite的message，普通message也是它的子类。

MessageLite适合“轻量级”的message（仅仅提供 encoding+序列化功能，没有使用使用reflection和descriptors）。在确定可以使用“轻量级”的message的场景下，可以在.proto文件中如下增加配置(option optimize_for = LITE_RUNTIME;)，来让protocol compiler产出MessageLite类型的类，这样可以节省runtime资源。

类Message

接口类，在类MessageLite的基础上增加了descriptors和reflection。

类MessageFactory

接口类，来创建Message对象，底层是封装了GeneratedMessageFactory类。

类GeneratedMessageFactory

MessageFactory的子类，singleton模式。

singleton模式是通过全局变量GeneratedMessageFactory* generated_message_factory_结合GOOGLE_PROTOBUF_DECLARE_ONCE（本质是pthread_once）来实现。

内部核心数据结构：

hash_map<const char*, RegistrationFunc*>：成员变量file_map_，从文件名到注册函数的映射关系，这个关系是在static初始化阶段完成，所以不需要锁；
hash_map<const Descriptor*, const Message*>：成员变量type_map_，Descriptor*到对应 Message*（这里其实是Message的prototype，调用它的New()接口，才创建具体的Message对象）的映射关系，这个关系会涉及多线程处理，使用读写锁保护；

对外关键接口：

const Message GeneratedMessageFactory::GetPrototype(const Descriptor type)

功能：

从descriptor找到对应message的prototype

处理流程：

上读锁，从hash_map<const Descriptor*, const Message*>找，有则返回、无则继续；
校验descriptor对应proto文件是否由全局的DescriptorPool管理；
用descriptor对应文件名从hash_map<const char*, RegistrationFunc*>找注册函数registration_func，无则返回、有则继续；
上写锁，判断是否有其它线程已经抢占（preempt）写入hash_map<const Descriptor*, const Message*>。如果没有，调用registration_func完成注册。并且从hash_map<const Descriptor*, const Message*>找到对应Message的prototype

void RegisterFile(const char file, RegistrationFunc registration_func)

功能：

void RegisterType(const Descriptor descriptor, const Message prototype)

功能：

注册descriptor和message的关系到hash_map<const Descriptor*, const Message*>

注册关系的生成：

在每个.pb.cc都会调用，例如protobuf/compiler/plugin.pb.cc 中：

	void protobuf_AddDesc_google_2fprotobuf_2fcompiler_2fplugin_2eproto() {
	
	…… // 省略
	  
	  ::google::protobuf::MessageFactory::InternalRegisterGeneratedFile(
	    "google/protobuf/compiler/plugin.proto", &protobuf_RegisterTypes);                                                                                                      
	
	…… // 省略
	
}

注册函数定义也在protobuf/compiler/plugin.pb.cc 中：

void protobuf_RegisterTypes(const ::std::string&) {
  protobuf_AssignDescriptorsOnce();
  ::google::protobuf::MessageFactory::InternalRegisterGeneratedMessage(
    CodeGeneratorRequest_descriptor_, &CodeGeneratorRequest::default_instance());
  ::google::protobuf::MessageFactory::InternalRegisterGeneratedMessage(
    CodeGeneratorResponse_descriptor_, &CodeGeneratorResponse::default_instance());
  ::google::protobuf::MessageFactory::InternalRegisterGeneratedMessage(
    CodeGeneratorResponse_File_descriptor_, &CodeGeneratorResponse_File::default_instance());
}

针对plugin.proto中的每一个Message，都会有对应的descriptor和default message对象：

1	CodeGeneratorRequest_descriptor_和CodeGeneratorRequest::default_instance()

	void MessageFactory::InternalRegisterGeneratedMessage(
	    const Descriptor* descriptor, const Message* prototype) {
	  GeneratedMessageFactory::singleton()->RegisterType(descriptor, prototype);
}

最终是调用了 GeneratedMessageFactory::RegisterType():

	void GeneratedMessageFactory::RegisterType(const Descriptor* descriptor,
	                                           const Message* prototype) {
	…  // 省略
	  if (!InsertIfNotPresent(&type_map_, descriptor, prototype)) {
	…  // 省略
}

类DynamicMessageFactory

MessageFactory的子类，用于处理非compile-time的message。

hello-world

Posted on 2018-05-03

Hello World!

收到海玉和有成的启发，今天也开一个自己的blog。

过去读过的源码，主要以厂内的一些基础库为主(记录在厂内wiki上)，后续会稍微调整方向到开源类型。

使用宏来提高代码可读性（代码的美感）

例1.宏CHARACTER_CLASS

定义

使用

例2.宏 PROTOBUF_DEFINE_ACCESSOR

定义

使用

例3.宏BUILD_ARRAY

定义

使用

资源分配/处理的lazy机制

例1.类DescriptorPool数据分层设计

例2.类GeneratedMessageFactory映射关系加载

资源管理/内存复用

类RepeatedPtrFieldBase

封装多种类型，统一对外的服务

不同的类作为模版参数时，提供类独有的类型

低配版release来节省资源

plugin机制

父子进程共享fd工作机制说明：

父子进程通信接口定义说明：

plugin实现方式：

Subprocess类

代码生成流程：

核心数据结构

类CommandLineInterface

类SourceTree

类DiskSourceTree

类Importer

类io::Tokenizer

token的定义如下：

token类型定义：

类Parser

核心数据成员：

处理过程：

类Parser::LocationRecorder

类SourceCodeInfo

类SourceLocationTable

CodeGenerator相关

类GeneratorContext

类GeneratorContextImpl

类CodeGenerator

相关类图

类Reflection

类GeneratedMessageReflection

内部实现：

应用举例：

GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET宏

举例说明

在read一侧：

在write一侧：

内存分布说明

Message类对象内存分布

Repeated类型

类RepeatedField

类RepeatedPtrFieldBase

类RepeatedPtrField

提前铺垫（父类/子类的分工）：

本质原因：

类GenericTypeHandler

类StringTypeHandler

待解决的问题：

例如：

解决的思路：

具体技术实现：

类UnknownField

类UnknownFieldSet

message处理中的实现

类Descriptor

const的成员：

非const的成员：

类FileDescriptor

类FieldDescriptor

enum类型：

const类型的private数据：

3个映射表（static const类型）：

类EnumDescriptor

结构体Symbol

类DescriptorPool::Tables

Checkpoint/Rollback

`GOOGLE_PROTOBUF_GENERATED_MESSAGE_FIELD_OFFSET`宏

`const DescriptorPool* underlay_`的作用

`DescriptorPool::FindBy()`系列函数