Protobuf-Compiler相关类&代码生成流程

代码生成流程：

核心流程如下图所示：

avatar

核心数据结构

类CommandLineInterface

generators_: map<string, GeneratorInfo>，提供从”–cpp_out” -> CppGenerator的映射，从protoc参数中获取需要的generator的名称；
plugins_：map<string, string> ，plugin提供非protobuf已有的CodeGenerator服务，plugin采用进程方式提供服务。plugins_记录的是：plugin名称 -> plugin可执行程序在磁盘上的path
plugin_prefix_：设置为”protoc-“

类SourceTree

接口类，表示.proto文件的目录树。

类DiskSourceTree

类SourceTree的子类，用于加载磁盘上的多个文件，并且提供从物理磁盘路径/文件 ->SourceTree上的节点的map关系.还可以设置”” -> SourceTree上的root节点。如果多个路径设置对应了同一个文件，那么搜索时会按照设置的顺序来处理。

类Importer

根据.proto文件的name，返回对应的FileDescriptor。实际是通过DescriptorPool提供的服务。

类io::Tokenizer

词法分析器，1个Tokenizer对象处理一个ZeroCopyInputStream，将raw text的stream转化为能够被parser解析的stream（token序列）。外部使用者仅需循环调用Tokenizer::Next()和Tokenizer::current()，就可以按照顺序获得对应的token，就像一个token化的stream一样。

token的定义如下：

struct Token {
  TokenType type;
  string text;       // The exact text of the token as it appeared in
                     // the input.  e.g. tokens of TYPE_STRING will still
                     // be escaped and in quotes.
	
  // "line" and "column" specify the position of the first character of
  // the token within the input stream.  They are zero-based.
  int line;
  int column;
  int end_column;
};

token类型定义：

enum TokenType {
  TYPE_START,       // Next() has not yet been called.
  TYPE_END,         // End of input reached.  "text" is empty.
	
  TYPE_IDENTIFIER,  // A sequence of letters, digits, and underscores, not
                    // starting with a digit.  It is an error for a number
                    // to be followed by an identifier with no space in
                    // between.
  TYPE_INTEGER,     // A sequence of digits representing an integer.  Normally
                    // the digits are decimal, but a prefix of "0x" indicates
                    // a hex number and a leading zero indicates octal, just
                    // like with C numeric literals.  A leading negative sign
                    // is NOT included in the token; it's up to the parser to
                    // interpret the unary minus operator on its own.
  TYPE_FLOAT,       // A floating point literal, with a fractional part and/or
                    // an exponent.  Always in decimal.  Again, never
                    // negative.
  TYPE_STRING,      // A quoted sequence of escaped characters.  Either single
                    // or double quotes can be used, but they must match.
                    // A string literal cannot cross a line break.
  TYPE_SYMBOL,      // Any other printable character, like '!' or '+'.
                    // Symbols are always a single character, so "!+$%" is
                    // four tokens.
};

处理性能是O(n)，处理过程：

使用buffer_从ZeroCopyInputStream中获取对应raw data，current_表示当前的token对象，previous_表示上一个token对象；
将character分为8种类型(通过宏CHARACTER_CLASS定义)：Whitespace/Unprintable/Digit/OctalDigit/HexDigit/Letter/Alphanumeric/Escape
使用buffer_pos_指向当前处理character的位置，并且逐个character向后移动处理，根据character的类型（有时需要结合previous_.type）判断current_ token的类型和边界，核心处理过程在Tokenizer::Next() 中：
```
(1)先判断和处理Whitespace字符；
(2)再判断和处理COMMENT字符串
(3)判断和处理Unprintable字符；
(4)判断和处理其余类型字符，生成有效的token；
```

类Parser

语法分析器，将tokenizer对象（proto文件对应的token化的stream）转化为FileDescriptorProto.
递归下降语法分析器（recursive-descent-parser） https://en.wikipedia.org/wiki/Recursive_descent_parser

核心数据成员：

io::Tokenizer* input_;                                // 提供需要parse的token stream
SourceCodeInfo* source_code_info_;    // 记录整个proto文件中所有token的location信息（path和span），用于开发工具使用，并不影响产出的FileDescriptorProto内容

处理过程：

Parser::Parse()中循环扫描input的tokenizer，来调用Parser::ParseTopLevelStatement()来处理的，注意在整个处理过程中root_location的传递，使得当前层级继承了上一个层级的location信息。整个过程按照proto文件的层级结构进行，是recursive的。

avatar

	bool Parser::Parse(io::Tokenizer* input, FileDescriptorProto* file) {
	
		…… //省略
		LocationRecorder root_location(this);
		
		 …… //省略
		
		    // Repeatedly parse statements until we reach the end of the file.
		    while (!AtEnd()) {
		      if (!ParseTopLevelStatement(file, root_location)) {
		           …… //省略
		
		             input_->Next();
		        }
		      }
		    }
		…… //省略
}

location的信息传递，是通过如下方式(path上增加了FileDescriptorProto::kMessageTypeFieldNumber，以及当前状态下file层级中message的数量，也就是当前message在上一级repeated数组中的offset)，基于上一级的path不断拓展：

bool Parser::ParseTopLevelStatement(FileDescriptorProto* file,
                                    const LocationRecorder& root_location) {

…… //省略

	else if (LookingAt("message")) {
	    LocationRecorder location(root_location,
	      FileDescriptorProto::kMessageTypeFieldNumber, file->message_type_size());
	    return ParseMessageDefinition(file->add_message_type(), location);
	}
	
…… //省略
}

核心过程在Parser::ParseTopLevelStatement()函数中：

Parser::ParseTopLevelStatement()每次处理一个大块完整的信息（完整的message/enum/service/extend/etc），每个块的处理过程是按照.proto文件的语法结构来逐层处理的。并且在最底层（叶结点）完成FileDescriptorProto以及对应成员信息的赋值。例如在message的’field’这一层完成lable/type/name/number的赋值。

avatar

类Parser::LocationRecorder

类Parser的private类，记录SourceCodeInfo.location中的一个localtion ，RAII方式实现，constructor记录start位置，destructor记录end位置

核心数据成员：

Parser* parser_;
SourceCodeInfo::Location* location_;

Q：从函数调用层级关系看：

SourceTreeDescriptorDatabase::FindFileByName(const string& filename, FileDescriptorProto* output) ->
Parser::Parse(io::Tokenizer* input, FileDescriptorProto* file) ->
Parser::ParseTopLevelStatement(FileDescriptorProto* file, const LocationRecorder& root_location)

Parser::ParseTopLevelStatement(FileDescriptorProto* file, const LocationRecorder& root_location)函数中第一个参数 file并不是input数据（而是需要赋值的output数据），进入这个函数时，file并没有被填充内容，那么在函数内部为什么能够直接使用类似file->message_type_size()的调用来从file获取数据呢？

答案是这样：

file在整个处理过程中，是一直会被写入的。当新处理一个子结构时，就会调用FileDescriptorProto::add*() 接口产生一个新的子结构，所以从file读取数据时，获得的就是当前file的状态信息。具体看下面的例子，file->message_type_size()记录下的就是当前处理的message在整个array<message>中的offset，初始值为0。在调用file->add_message_type()之后，再次读取 file->message_type_size()的值就会+1了。例如：

	bool Parser::ParseTopLevelStatement(FileDescriptorProto* file,
	                                    const LocationRecorder& root_location) {
	
	…… //省略
	
	  } else if (LookingAt("message")) {
	    LocationRecorder location(root_location,
	      FileDescriptorProto::kMessageTypeFieldNumber, file->message_type_size());
	    return ParseMessageDefinition(file->add_message_type(), location);
	
	…… //省略
}

类SourceCodeInfo

封装了关于proto源文件的信息，用于生成对应的FileDescriptorProto。定义在descriptor.proto 文件，作为一个Message子类

message SourceCodeInfo {
	repeated Location location = 1;
	message Location {
	    repeated int32 path = 1 [packed=true];
	    repeated int32 span = 2 [packed=true];
	}
}

span: 记录某个location在proto文件中的位置：[start_line, start_column, end_line, end_column]
path: 记录某个location在整个proto文件层级路径（从FileDescriptorProto开始查找），其中包含了每一层的field number 以及对应的index(如果在上一层中是repeated类型表示)。

类SourceLocationTable

管理pair<descriptor, ErrorLocation> -> pair<line,column>，核心数据结构

typedef map<
  pair<const Message*, DescriptorPool::ErrorCollector::ErrorLocation>,
  pair<int, int> > LocationMap;
LocationMap location_map_;

CodeGenerator相关

类GeneratorContext

接口类，表示CodeGenerator产生文件的路径和CodeGenerator运行的其它context信息。

类GeneratorContextImpl

GeneratorContext类的子类，处理内存中的文件，并且output到磁盘上。一个独立的GeneratorContext对应一个output的地址，所以如果有2个generator对应同一个地址，那么需要共用同一个GeneratorContext。

类CodeGenerator

接口类，从.proto定义文件产生code。

OutputDirective结构体：描述需要输出的路径和对应的generator

// output_directives_ lists all the files we are supposed to output and what
// generator to use for each.
struct OutputDirective {                                                                                                
  string name;                // E.g. "--foo_out"
  CodeGenerator* generator;   // NULL for plugins
  string parameter;
  string output_location;
};  
vector<OutputDirective> output_directives_;

代码生成流程：

核心数据结构

类CommandLineInterface

类SourceTree

类DiskSourceTree

类Importer

类io::Tokenizer

token的定义如下：

token类型定义：

类Parser

核心数据成员：

处理过程：

类Parser::LocationRecorder

类SourceCodeInfo

类SourceLocationTable

CodeGenerator相关

类GeneratorContext

类GeneratorContextImpl

类CodeGenerator

相关类图