Ogre源代码浅析——脚本及其解析（二）

解析Ogre脚本文件的机制及其实现

最新推荐文章于 2013-11-24 13:58:00 发布

转载最新推荐文章于 2013-11-24 13:58:00 发布 · 1k 阅读

OGRE3D 专栏收录该内容

27 篇文章

订阅专栏

本文详细探讨了Ogre脚本文件的解析机制，包括词法分析、语法解析和编译三个阶段。重点分析了Ogre自定义的词法分析器ScriptLexer和语法解析器ScriptParser的工作原理，以及如何通过这些组件实现脚本文件的高效解析。文章还提供了源代码示例，展示了如何在实际项目中应用这些解析技术。

对文本文件的解析，有两种方式比较常见。第一种，文件信息有固定的格式但没有太过复杂的结构，比如Ogre中的.cfg格式文件。对这种文件，一般可以逐行读取并直接按行解析。整个过程相对比较简单。第二种，文件本身有比较复杂的结构，而且文件信息的组织要符合一定的语法规范。比如各种计算机语言的源文件以及各种脚本语言（JavaScript、Python、Lua等等）的源文件。对这种文件的解析一般要经历几个较复杂的阶段，并根据最终的解析结果得到相应的指令顺列。Ogre的脚本文件的组织和解析，其方法和复杂呈度实际上介于这两者之间。

以.material、.program、.particle和.compositor为例，其脚本包含各种待处理的输入信息且有相同的格式规范，这与第一种文件类似；但同时这些输入信息，一般是由较简单的数据单元按照某种方式嵌套构造而成，各单元之间可以形成一种树形结构，这又明显比第一种文件要复杂。另外，以上Ogre脚本与各种脚本语言的源文件（乃至各种计算机语言源文件）的最大不同是，Ogre的脚本没有控制结构，也没有对各种数据类型的定义。实际上，Ogre的各项脚本文件更象是HML或XML等格式化文本文件。研究Ogre对其脚本文件的解析机制，对了解这类文件的解析机制，或定义自已的外部文本格式并编写相应的文件解析程序是有帮助的。

Ogre的各种脚本都遵循共同的格式或者说共同的数据组织规则。Ogre对脚本的解析方法分为：词法分析，解析或称语义分析和编译三个过程。这与计算机语言源文件的编译过程很象，但研究其源代码后可知道，其中的“词法分析”和“语义分析过程”要比实际的计算机语言源文件编译过程简单很多（但作用有类似之处），而其第三阶段的所谓的compile（位于ScriptCompile::compile()函数中，以下代码的第6行）的作用则与脚本程序语言源文件的最后阶段的完全不同。来看Script::compiler()函数：

1     bool ScriptCompiler::compile(const String &str, const String &source, const String &group)
2     {
3         ScriptLexer lexer;
4         ScriptParser parser;
5         ConcreteNodeListPtr nodes = parser.parse(lexer.tokenize(str, source));
6         return compile(nodes, group);
7     }

ScriptLexer是Ogre自定义的词法分析器，ScriptParser则是语法解析器，compile()函数的第一个参数，str指向的是从外部读入的整个脚本文件的文本。待解析的脚本文件首先由词法分析器的tokenize()函数进行相关处理，看一下相应源代码：

  1     ScriptTokenListPtr ScriptLexer::tokenize(const String &str, const String &source)
  2     {
  3         // State enums
  4         enum{ READY = 0, COMMENT, MULTICOMMENT, WORD, QUOTE, VAR, POSSIBLECOMMENT };
  5 
  6         // Set up some constant characters of interest
  7 #if OGRE_WCHAR_T_STRINGS
  8         const wchar_t varopener = L'$', quote = L'\"', slash = L'/', backslash = L'\\', openbrace = L'{', closebrace = L'}', colon = L':', star = L'*', cr = L'\r', lf = L'\n';
  9         wchar_t c = 0, lastc = 0;
 10 #else
 11         const wchar_t varopener = '$', quote = '\"', slash = '/', backslash = '\\', openbrace = '{', closebrace = '}', colon = ':', star = '*', cr = '\r', lf = '\n';
 12         char c = 0, lastc = 0;
 13 #endif
 14 
 15         String lexeme;
 16         uint32 line = 1, state = READY, lastQuote = 0;
 17         ScriptTokenListPtr tokens(OGRE_NEW_T(ScriptTokenList, MEMCATEGORY_GENERAL)(), SPFM_DELETE_T);
 18 
 19         // Iterate over the input
 20         String::const_iterator i = str.begin(), end = str.end();
 21         while(i != end)
 22         {
 23             lastc = c;
 24             c = *i;
 25 
 26             if(c == quote)
 27                 lastQuote = line;
 28 
 29             switch(state)
 30             {
 31             case READY:
 32                 if(c == slash && lastc == slash)
 33                 {
 34                     // Comment start, clear out the lexeme
 35                     lexeme = "";
 36                     state = COMMENT;
 37                 }
 38                 else if(c == star && lastc == slash)
 39                 {
 40                     lexeme = "";
 41                     state = MULTICOMMENT;
 42                 }
 43                 else if(c == quote)
 44                 {
 45                     // Clear out the lexeme ready to be filled with quotes!
 46                     lexeme = c;
 47                     state = QUOTE;
 48                 }
 49                 else if(c == varopener)
 50                 {
 51                     // Set up to read in a variable
 52                     lexeme = c;
 53                     state = VAR;
 54                 }
 55                 else if(isNewline(c))
 56                 {
 57                     lexeme = c;
 58                     setToken(lexeme, line, source, tokens.get());
 59                 }
 60                 else if(!isWhitespace(c))
 61                 {
 62                     lexeme = c;
 63                     if(c == slash)
 64                         state = POSSIBLECOMMENT;
 65                     else
 66                         state = WORD;
 67                 }
 68                 break;
 69             case COMMENT:
 70                 // This newline happens to be ignored automatically
 71                 if(isNewline(c))
 72                     state = READY;
 73                 break;
 74             case MULTICOMMENT:
 75                 if(c == slash && lastc == star)
 76                     state = READY;
 77                 break;
 78             case POSSIBLECOMMENT:
 79                 if(c == slash && lastc == slash)
 80                 {
 81                     lexeme = "";
 82                     state = COMMENT;
 83                     break;    
 84                 }
 85                 else if(c == star && lastc == slash)
 86                 {
 87                     lexeme = "";
 88                     state = MULTICOMMENT;
 89                     break;
 90                 }
 91                 else
 92                 {
 93                     state = WORD;
 94                 }
 95             case WORD:
 96                 if(isNewline(c))
 97                 {
 98                     setToken(lexeme, line, source, tokens.get());
 99                     lexeme = c;
100                     setToken(lexeme, line, source, tokens.get());
101                     state = READY;
102                 }
103                 else if(isWhitespace(c))
104                 {
105                     setToken(lexeme, line, source, tokens.get());
106                     state = READY;
107                 }
108                 else if(c == openbrace || c == closebrace || c == colon)
109                 {
110                     setToken(lexeme, line, source, tokens.get());
111                     lexeme = c;
112                     setToken(lexeme, line, source, tokens.get());
113                     state = READY;
114                 }
115                 else
116                 {
117                     lexeme += c;
118                 }
119                 break;
120             case QUOTE:
121                 if(c != backslash)
122                 {
123                     // Allow embedded quotes with escaping
124                     if(c == quote && lastc == backslash)
125                     {
126                         lexeme += c;
127                     }
128                     else if(c == quote)
129                     {
130                         lexeme += c;
131                         setToken(lexeme, line, source, tokens.get());
132                         state = READY;
133                     }
134                     else
135                     {
136                         // Backtrack here and allow a backslash normally within the quote
137                         if(lastc == backslash)
138                             lexeme = lexeme + "\\" + c;
139                         else
140                             lexeme += c;
141                     }
142                 }
143                 break;
144             case VAR:
145                 if(isNewline(c))
146                 {
147                     setToken(lexeme, line, source, tokens.get());
148                     lexeme = c;
149                     setToken(lexeme, line, source, tokens.get());
150                     state = READY;
151                 }
152                 else if(isWhitespace(c))
153                 {
154                     setToken(lexeme, line, source, tokens.get());
155                     state = READY;
156                 }
157                 else if(c == openbrace || c == closebrace || c == colon)
158                 {
159                     setToken(lexeme, line, source, tokens.get());
160                     lexeme = c;
161                     setToken(lexeme, line, source, tokens.get());
162                     state = READY;
163                 }
164                 else
165                 {
166                     lexeme += c;
167                 }
168                 break;
169             }
170 
171             // Separate check for newlines just to track line numbers
172             if(c == cr || (c == lf && lastc != cr))
173                 line++;
174             
175             i++;
176         }
177 
178         // Check for valid exit states
179         if(state == WORD || state == VAR)
180         {
181             if(!lexeme.empty())
182                 setToken(lexeme, line, source, tokens.get());
183         }
184         else
185         {
186             if(state == QUOTE)
187             {
188                 OGRE_EXCEPT(Exception::ERR_INVALID_STATE, 
189                     Ogre::String("no matching \" found for \" at line ") + 
190                         Ogre::StringConverter::toString(lastQuote),
191                     "ScriptLexer::tokenize");
192             }
193         }
194 
195         return tokens;
196     }

由于返回值是ScriptTokenListPtr，可以先看一下相关定义：

 1     /** This struct represents a token, which is an ID'd lexeme from the
 2         parsing input stream.
 3     */
 4     struct ScriptToken
 5     {
 6         /// This is the lexeme for this token
 7         String lexeme, file;
 8         /// This is the id associated with the lexeme, which comes from a lexeme-token id mapping
 9         uint32 type;
10         /// This holds the line number of the input stream where the token was found.
11         uint32 line;
12     };
13     typedef SharedPtr<ScriptToken> ScriptTokenPtr;
14     typedef vector<ScriptTokenPtr>::type ScriptTokenList;
15     typedef SharedPtr<ScriptTokenList> ScriptTokenListPtr;

ScriptLexer::tokenize()函数，先定义了一个保存ScriptToken的vector（ScriptTokenList）并得到了它的指针——tokens（17行），然后将词法分析结果以token为单位，逐一保存到tokens中并返回tokens值。整个函数的处理机制是：逐字符分析+状态机。

变量“i”和“end”分别标识了读入的待解析的脚本文件的开头和结尾（20行），随着解析的进行，变量“i”将逐字符后移（175行）。整个解析过程由状态机的几种状态来表达，它们分别是：准备状态（READY 31-68行）、对注释信息的解析状态（COMMENT, MULTICOMMENT 69-77行）、对单词的解析状态（WORD 95-119）、对双引号中引用信息的解析状态（QUOTE 120-143行），对变量信息的解析状态（VAR 144-168行）、对可能是注释信息的数据进行解析的状态（POSSIBLECOMMENT 78-94行）。词法分析的主要目的，是将脚本文件中的各个词素（lexeme 比如，一个单词、脚本中大括号的左半边、脚本中大括号的右半边等都被看一个词素）解读出来，并针对每个词素生成一个token对象，将此词素的相关信息保存在token对象中。在每一次循环开始时都要初始化两个变量：“c”和“lastc” （23，24行）。c表示当前正要被处理字符，lastc表示当前字符的前一个字符。之所以要申请这两个变量是因为，Ogre脚本中的“词素（lexeme）”是以空格为分格符的，用这两个变量就可以方便的识别出：当前读取的字符是一个新词素的第一个字符，还是正在解析的词素的最后一个字符，又或者是当前正在解析的词素的多个字符（如果存在的话）中间位置的某个字符。

生成token对象并保存相应词素信息的过程由ScriptLexer::setToken()函数来实现，来看一下相关代码：

 1     void ScriptLexer::setToken(const Ogre::String &lexeme, Ogre::uint32 line, const String &source, Ogre::ScriptTokenList *tokens)
 2     {
 3 #if OGRE_WCHAR_T_STRINGS
 4         const wchar_t openBracket = L'{', closeBracket = L'}', colon = L':', 
 5             quote = L'\"', var = L'$';
 6 #else
 7         const char openBracket = '{', closeBracket = '}', colon = ':', 
 8             quote = '\"', var = '$';
 9 #endif
10 
11         ScriptTokenPtr token(OGRE_NEW_T(ScriptToken, MEMCATEGORY_GENERAL)(), SPFM_DELETE_T);
12         token->lexeme = lexeme;
13         token->line = line;
14         token->file = source;
15         bool ignore = false;
16 
17         // Check the user token map first
18         if(lexeme.size() == 1 && isNewline(lexeme[0]))
19         {
20             token->type = TID_NEWLINE;
21             if(!tokens->empty() && tokens->back()->type == TID_NEWLINE)
22                 ignore = true;
23         }
24         else if(lexeme.size() == 1 && lexeme[0] == openBracket)
25             token->type = TID_LBRACKET;
26         else if(lexeme.size() == 1 && lexeme[0] == closeBracket)
27             token->type = TID_RBRACKET;
28         else if(lexeme.size() == 1 && lexeme[0] == colon)
29             token->type = TID_COLON;
30         else if(lexeme[0] == var)
31             token->type = TID_VARIABLE;
32         else
33         {
34             // This is either a non-zero length phrase or quoted phrase
35             if(lexeme.size() >= 2 && lexeme[0] == quote && lexeme[lexeme.size() - 1] == quote)
36             {
37                 token->type = TID_QUOTE;
38             }
39             else
40             {
41                 token->type = TID_WORD;
42             }
43         }
44 
45         if(!ignore)
46             tokens->push_back(token);
47     }

可以看到，本函数生成ScriptToken对象token（11行），赋给token的lexeme成员变量的值（12行），就是ScriptLexer::tokenize()函数分析后得到的词素，它是一个字符串（在tokenize()函数15行定义，并在后续的while()循环中解析得到实际值）；赋给token的line成员变量的值（13行），表示此词素所在脚本的行的序数；赋给token的file成员变量的值（14行），表示此词素所在的脚本文件的文件名。token的type表示此词素的属性，其属性定义为：

    enum{
        TID_LBRACKET = 0, // {
        TID_RBRACKET, // }
        TID_COLON, // :
        TID_VARIABLE, // $...
        TID_WORD, // *
        TID_QUOTE, // "*"
        TID_NEWLINE, // \n
        TID_UNKNOWN,
        TID_END
    };

它是一个定义在“OgreScriptLexer.h”头文件中的枚举类型。定义中各枚举项的含义已在其相应的注释中标明。