GCC-3.4.6源代码学习笔记(77)

5.6. 准备解析器

现在马上就要解析源文件了,而解析器的输入是必须C++的标识符。用于获取标识符的就称为词法分析器(Lexer)。值得注意的是,GCC并没有所谓的预处理遍,因为cpp_get_token这样的函数,直接就可以获取经过预处理的符号。显然,这个函数也是词法分析器的重要部分。作为源文件解析的第一步,首先准备解析器及附属的词法分析器。

 

15112 void

15113 c_parse_file (void)                                                                                  in parser.c

15114 {

15115   bool error_occurred;

15116

15117   the_parser = cp_parser_new ();

15118   push_deferring_access_checks (flag_access_control

15119                             ? dk_no_deferred : dk_no_check);

15120   error_occurred = cp_parser_translation_unit (the_parser);

15121   the_parser = NULL;

15122 }

 

作为C++解析器的数据结构,cp_parser定义如下。注意到这个类型的数据由GCC的垃圾收集器管理,对于每个编译单元,都要创建一个新的解析器对象。

 

1170 typedef struct cp_parser GTY(())                                                                 in parser.c

1171 {

1172   /* The lexer from which we are obtaining tokens.  */

1173   cp_lexer *lexer;

1174

1175   /* The scope in which names should be looked up. If NULL_TREE, then

1176     we look up names in the scope that is currently open in the

1177     source program. If non-NULL, this is either a TYPE or

1178     NAMESPACE_DECL for the scope in which we should look. 

1179

1180     This value is not cleared automatically after a name is looked

1181     up, so we must be careful to clear it before starting a new look

1182     up sequence. (If it is not cleared, then `X::Y' followed by `Z'

1183     will look up `Z' in the scope of `X', rather than the current

1184     scope.) Unfortunately, it is difficult to tell when name lookup

1185     is complete, because we sometimes peek at a token, look it up,

1186     and then decide not to consume it.  */

1187   tree scope;

1188

1189   /* OBJECT_SCOPE and QUALIFYING_SCOPE give the scopes in which the

1190     last lookup took place. OBJECT_SCOPE is used if an expression

1191     like "x->y" or "x.y" was used; it gives the type of "*x" or "x",

1192     respectively. QUALIFYING_SCOPE is used for an expression of the

1193     form "X::Y"; it refers to X.  */

1194   tree object_scope;

1195   tree qualifying_scope;

1196

1197   /* A stack of parsing contexts. All but the bottom entry on the

1198     stack will be tentative contexts.

1199

1200     We parse tentatively in order to determine which construct is in

1201     use in some situations. For example, in order to determine

1202     whether a statement is an expression-statement or a

1203     declaration-statement we parse it tentatively as a

1204     declaration-statement. If that fails, we then reparse the same

1205     token stream as an expression-statement.  */

1206   cp_parser_context *context;

1207

1208   /* True if we are parsing GNU C++. If this flag is not set, then

1209     GNU extensions are not recognized.  */

1210   bool allow_gnu_extensions_p;

1211

1212   /* TRUE if the `>' token should be interpreted as the greater-than

1213     operator. FALSE if it is the end of a template-id or

1214     template-parameter-list.  */

1215   bool greater_than_is_operator_p;

1216

1217   /* TRUE if default arguments are allowed within a parameter list

1218     that starts at this point. FALSE if only a gnu extension makes

1219     them permissible.  */

1220   bool default_arg_ok_p;

1221  

1222   /* TRUE if we are parsing an integral constant-expression. See

1223     [expr.const] for a precise definition.  */

1224   bool integral_constant_expression_p;

1225

1226   /* TRUE if we are parsing an integral constant-expression -- but a

1227     non-constant expression should be permitted as well. This flag

1228     is used when parsing an array bound so that GNU variable-length

1229     arrays are tolerated.  */

1230   bool allow_non_integral_constant_expression_p;

1231

1232   /* TRUE if ALLOW_NON_CONSTANT_EXPRESSION_P is TRUE and something has

1233     been seen that makes the expression non-constant.  */

1234   bool non_integral_constant_expression_p;

1235

1236   /* TRUE if we are parsing the argument to "__offsetof__".  */

1237   bool in_offsetof_p;

1238

1239  /* TRUE if local variable names and `this' are forbidden in the

1240     current context.  */

1241   bool local_variables_forbidden_p;

1242

1243   /* TRUE if the declaration we are parsing is part of a

1244     linkage-specification of the form `extern string-literal

1245     declaration'.  */

1246   bool in_unbraced_linkage_specification_p;

1247

1248   /* TRUE if we are presently parsing a declarator, after the

1249     direct-declarator.  */

1250   bool in_declarator_p;

1251

1252   /* TRUE if we are presently parsing a template-argument-list.  */

1253   bool in_template_argument_list_p;

1254

1255  /* TRUE if we are presently parsing the body of an

1256     iteration-statement.  */

1257   bool in_iteration_statement_p;

1258

1259   /* TRUE if we are presently parsing the body of a switch

1260     statement.  */

1261   bool in_switch_statement_p;

1262

1263   /* TRUE if we are parsing a type-id in an expression context. In

1264     such a situation, both "type (expr)" and "type (type)" are valid

1265     alternatives.  */

1266   bool in_type_id_in_expr_p;

1267

1268   /* If non-NULL, then we are parsing a construct where new type

1269     definitions are not permitted. The string stored here will be

1270     issued as an error message if a type is defined.  */

1271   const char *type_definition_forbidden_message;

1272

1273   /* A list of lists. The outer list is a stack, used for member

1274     functions of local classes. At each level there are two sub-list,

1275     one on TREE_VALUE and one on TREE_PURPOSE. Each of those

1276     sub-lists has a FUNCTION_DECL or TEMPLATE_DECL on their

1277     TREE_VALUE's. The functions are chained in reverse declaration

1278     order.

1279

1280     The TREE_PURPOSE sublist contains those functions with default

1281     arguments that need post processing, and the TREE_VALUE sublist

1282     contains those functions with definitions that need post

1283     processing.

1284

1285     These lists can only be processed once the outermost class being

1286     defined is complete.  */

1287   tree unparsed_functions_queues;

1288

1289   /* The number of classes whose definitions are currently in

1290     progress.  */

1291   unsigned num_classes_being_defined;

1292

1293   /* The number of template parameter lists that apply directly to the

1294     current declaration.  */

1295   unsigned num_template_parameter_lists;

1296 } cp_parser;

 

用于创建cp_parser解析器实例的函数cp_parser_new的定义如下:

 

2230 static cp_parser *

2231 cp_parser_new (void)                                                                                 in parser.c

2232 {

2233   cp_parser *parser;

2234   cp_lexer *lexer;

2235

2236   /* cp_lexer_new_main is called before calling ggc_alloc because  

2237     cp_lexer_new_main might load a PCH file.  */

2238   lexer = cp_lexer_new_main ();

 

cp_lexer_new_main创建的词法分析器则有以下的定义,它亦是GC管理的类型。注意到在其定义中,所有指针成员都由GC管理,除了212行的next域。这个next域透露不像在整个编译单元唯一的解析器,更多的词法分析器可以被临时创建。

 

166  typedef struct cp_lexer GTY (())                                                                  in parser.c

167  {

168    /* The memory allocated for the buffer.  Never NULL.  */

169    cp_token * GTY ((length ("(%h.buffer_end - %h.buffer)"))) buffer;

170    /* A pointer just past the end of the memory allocated for the buffer.  */

171    cp_token * GTY ((skip (""))) buffer_end;

172    /* The first valid token in the buffer, or NULL if none.  */

173    cp_token * GTY ((skip (""))) first_token;

174    /* The next available token. If NEXT_TOKEN is NULL, then there are

175      no more available tokens.  */

176    cp_token * GTY ((skip (""))) next_token;

177    /* A pointer just past the last available token. If FIRST_TOKEN is

178      NULL, however, there are no available tokens, and then this

179      location is simply the place in which the next token read will be

180     placed. If LAST_TOKEN == FIRST_TOKEN, then the buffer is full.

181      When the LAST_TOKEN == BUFFER, then the last token is at the

182      highest memory address in the BUFFER.  */

183    cp_token * GTY ((skip (""))) last_token;

184 

185    /* A stack indicating positions at which cp_lexer_save_tokens was

186      called. The top entry is the most recent position at which we

187      began saving tokens. The entries are differences in token

188      position between FIRST_TOKEN and the first saved token.

189 

190      If the stack is non-empty, we are saving tokens. When a token is

191      consumed, the NEXT_TOKEN pointer will move, but the FIRST_TOKEN

192      pointer will not. The token stream will be preserved so that it

193      can be reexamined later.

194 

195      If the stack is empty, then we are not saving tokens. Whenever a

196      token is consumed, the FIRST_TOKEN pointer will be moved, and the

197      consumed token will be gone forever.  */

198    varray_type saved_tokens;

199 

200    /* The STRING_CST tokens encountered while processing the current

201      string literal.  */

202    varray_type string_tokens;

203 

204    /* True if we should obtain more tokens from the preprocessor; false

205      if we are processing a saved token cache.  */

206    bool main_lexer_p;

207 

208    /* True if we should output debugging information.  */

209    bool debugging_p;

210 

211     /* The next lexer in a linked list of lexers.  */

212    struct cp_lexer *next;

213  } cp_lexer;

 

在前面章节中,我们已经看到符号由类型cpp_token来表示,然而这个类型是为预处理器设计的。经过预处理后,宏、断言、#include指示等预处理成分不复存在,cpp_token不再合适,取而代之的是下面的cp_token所表示的预处理后符号。

 

69    typedef struct cp_token GTY (())                                                                 in parser.c

70    {

71      /* The kind of token.  */

72      ENUM_BITFIELD (cpp_ttype) type : 8;

73      /* If this token is a keyword, this value indicates which keyword.

74        Otherwise, this value is RID_MAX.  */

75      ENUM_BITFIELD (rid) keyword : 8;

76     /* Token flags.  */

77      unsigned char flags;

78      /* The value associated with this token, if any.  */

79      tree value;

80      /* The location at which this token was found.  */

81      location_t location;

82    } cp_token;

 

比较可见,这2者的定义相当的相似。

5.6.1. 创建主词法分析器

每个编译单元都会有一个主词法分析器伴随解析器。这个主词法分析器由下面的函数创建。

 

301  static cp_lexer *

302  cp_lexer_new_main (void)                                                                          in parser.c

303  {

304    cp_lexer *lexer;

305    cp_token first_token;

306 

307    /* It's possible that lexing the first token will load a PCH file,

308      which is a GC collection point. So we have to grab the first

309      token before allocating any memory.  */

310    cp_lexer_get_preprocessor_token (NULL, &first_token);

311     c_common_no_more_pch ();

312 

313   /* Allocate the memory.  */

314    lexer = ggc_alloc_cleared (sizeof (cp_lexer));

315 

316    /* Create the circular buffer.  */

317    lexer->buffer = ggc_calloc (CP_TOKEN_BUFFER_SIZE, sizeof (cp_token));

318    lexer->buffer_end = lexer->buffer + CP_TOKEN_BUFFER_SIZE;

319 

320    /* There is one token in the buffer.  */

321    lexer->last_token = lexer->buffer + 1;

322    lexer->first_token = lexer->buffer;

323    lexer->next_token = lexer->buffer;

324    memcpy (lexer->buffer, &first_token, sizeof (cp_token));

325 

326    /* This lexer obtains more tokens by calling c_lex.  */

327    lexer->main_lexer_p = true;

328 

329    /* Create the SAVED_TOKENS stack.  */

330    VARRAY_INT_INIT(lexer->saved_tokens, CP_SAVED_TOKENS_SIZE, "saved_tokens");

331   

332    /* Create the STRINGS array.  */

333    VARRAY_TREE_INIT (lexer->string_tokens, 32, "strings");

334 

335    /* Assume we are not debugging.  */

336    lexer->debugging_p = false;

337 

338    return lexer;

339  }

 

注意到目前为止,我们读入了主输入文件,-include引入的头文件(如果有的话),但尚未开始分析源文件的符号。因此下面310行的cp_lexer_get_preprocessor_token将触发源文件的第一个符号分析。按照GCC目前的实现和要求,每个源文件只能包含一个预编译头文件,而且预编译头文件必须是第一个包含文件。因此,如果当前源文件使用了预编译头文件,该函数将读入该预编译头文件(还记得吗,首先看到#include指示,由run_directive调用处理句柄do_include,该句柄则调用_cpp_stack_include,这个函数进一步调用c_common_read_pch读入PCH文件)。而在c_common_read_pch所调用的ggc_pch_read里,如果编译器所在操作系统使用分页内存管理,将触发GC垃圾收集。

 

580  static void

581  cp_lexer_get_preprocessor_token (cp_lexer *lexer ATTRIBUTE_UNUSED ,   in parser.c

582                             cp_token *token)

583  {

584    bool done;

585 

586    /* If this not the main lexer, return a terminating CPP_EOF token.  */

587    if (lexer != NULL && !lexer->main_lexer_p)

588    {

589      token->type = CPP_EOF;

590      token->location.line = 0;

591      token->location.file = NULL;

592      token->value = NULL_TREE;

593      token->keyword = RID_MAX;

594 

595      return;

596    }

597 

598    done = false;

599    /* Keep going until we get a token we like.  */

600    while (!done)

601    {

602      /* Get a new token from the preprocessor.  */

603      token->type = c_lex_with_flags (&token->value, &token->flags);

604      /* Issue messages about tokens we cannot process.  */

605      switch (token->type)

606      {

607        case CPP_ATSIGN:

608        case CPP_HASH:

609        case CPP_PASTE:

610          error ("invalid token");

611           break;

612 

613        default:

614          /* This is a good token, so we exit the loop.  */

615          done = true;

616          break;

617      }

618    }

619    /* Now we've got our token.  */

620    token->location = input_location;

621 

622    /* Check to see if this token is a keyword.  */

623    if (token->type == CPP_NAME

624       && C_IS_RESERVED_WORD (token->value))

625    {

626      /* Mark this token as a keyword.  */

627      token->type = CPP_KEYWORD;

628      /* Record which keyword.  */

629      token->keyword = C_RID_CODE (token->value);

630      /* Update the value. Some keywords are mapped to particular

631        entities, rather than simply having the value of the

632        corresponding IDENTIFIER_NODE. For example, `__const' is

633        mapped to `const'.  */

634      token->value = ridpointers[token->keyword];

635    }

636    else

637     token->keyword = RID_MAX;

638  }

 

cp_lexer_get_preprocessor_token是词法分析器的低级函数,负责向词法分析器返回预处理后的符号。显然“#”,“##”,“@”(607行,用于Obj-C)都不是有效的预处理后符号。另外,预处理后符号应该都是标识符或各种常量,但C++保留了某些标识符作为保留字,在这里要予以识别(参考C++初始化关键字一节)。

5.6.1.1.    获取预处理后符号
5.6.1.1.1.            标识符

预处理后符号由cp_token来表示,而其flags域的取值,为以下各值。

 

619  #define CPP_N_CATEGORY  0x000F                                                        in cpplib.h

620  #define CPP_N_INVALID    0x0000

621  #define CPP_N_INTEGER   0x0001

622  #define CPP_N_FLOATING 0x0002

623 

624  #define CPP_N_WIDTH      0x00F0

625  #define CPP_N_SMALL      0x0010    /* int, float.  */

626  #define CPP_N_MEDIUM   0x0020    /* long, double.  */

627  #define CPP_N_LARGE      0x0040    /* long long, long double.  */

628 

629  #define CPP_N_RADIX       0x0F00

630  #define CPP_N_DECIMAL  0x0100

631  #define CPP_N_HEX    0x0200

632  #define CPP_N_OCTAL       0x0400

633 

634  #define CPP_N_UNSIGNED       0x1000    /* Properties.  */

635  #define CPP_N_IMAGINARY     0x2000

 

这里分为5个组别为flags设置。例如,对于符号0x50flags被设为CPP_N_INTEGERCPP_N_SMALLCPP_N_HEXCPP_N_UNSIGNED。预处理后符号的类型、值、标记(flags)都由c_lex_with_flags来获得。

 

315  int

316  c_lex_with_flags (tree *value, unsigned char *cpp_flags)                               in c-lex.c

317  {

318    const cpp_token *tok;

319    location_t atloc;

320    static bool no_more_pch;

321 

322  retry:

323    tok = get_nonpadding_token ();

 

get_nonpadding_token的核心是cpp_get_token。正如我们在前面章节所见,这个函数是预处理进行的地方。在那里,宏定义直接被消化进了cpp_macro,宏调用直接通过实参替换(如果需要的话)展开,其它指示为各色句柄所承包,并执行各种预处理操作符。

 

302  static inline const cpp_token *

303  get_nonpadding_token (void)                                                                      in c-lex.c

304  {

305    const cpp_token *tok;

306    timevar_push (TV_CPP);

307    do

308      tok = cpp_get_token (parse_in);

309    while (tok->type == CPP_PADDING);

310    timevar_pop (TV_CPP);

311  

312    return tok;

313  } 

 

注意get_nonpadding_token返回的还是cpp_token,而不是cp_token

 

c_lex_with_flags (continue)

 

325  retry_after_at:

326    switch (tok->type)

327    {

328      case CPP_NAME:

329        *value = HT_IDENT_TO_GCC_IDENT (HT_NODE (tok->val.node));

330        break;

331 

332      case CPP_NUMBER:

333      {

334        unsigned int flags = cpp_classify_number (parse_in, tok);

335 

336        switch (flags & CPP_N_CATEGORY)

337        {

338          case CPP_N_INVALID:

339            /* cpplib has issued an error.  */

340            *value = error_mark_node;

341            break;

342 

343          case CPP_N_INTEGER:

344            *value = interpret_integer (tok, flags);

345            break;

346 

347          case CPP_N_FLOATING:

348            *value = interpret_float (tok, flags);

349            break;

350 

351          default:

352            abort ();

353        }

354      }

355      break;

 

328行,类型为CPP_NAME的符号即是标识符,通过HT_IDENT_TO_GCC_IDENT将符号对应的哈希表节点转换为树节点。

 

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值