input module of lexical analyzer in lcc

[input sub-module of lexical analyzer]
.  requirement
   > speed, as fast as possible
   > there is rarely a limit on line length, i.e, line length can be aribitrary
   > extract tokens of C language definition

. techniqual implementation tips
  > read input characters in large chunk into a buffer, to reduce IO accesses to save time
  > there are '/n'(newline) in buffer, it's better to read tokens upon a line in buffer if necessary
     i.e, a line of file may span different buffer refill
  > tokens cannot span line boundaries most oftern(except identifiers and string literals) [1]
     i.e, how to deal with tokens that span line boundaries
  > refill the buffer when remaining characters that compose a token may span line boundaries.
     i.e, refill the buffer when remaining characters are insufficient to the max-length of a token,
     in lcc, the max-length is 32(MAXTOKEN)
  > input module will be used by lexical analyzer

  // according to the implementation tips, write a gettok() function
  // input: source file
  // output: return the token type
  //             {to simplify the problem, tokens are simple punctuation(PUNC),  BLANK
  //              ID, string literal, EOF, ERR}
  // guarantee: buffer has been filled with BUFFER_SIZE characters
  // auxiliary variables:
  //             buffer[BUFFER_SIZE]  : buffer for storing read characters from file
  //             cp                                 : current input char, usually points to start of a
  //                                                    token or peseudo-token
  //             rcp                                : help pointer in scanning token or peseudo-token
  //            
  //             limit                               : sentinel of buffer, i.e, buffer end
  //             map[256]                      : map char to its category, PUNC | LETTER | BLANK |
  //                                                    DIGIT
  ALGORITHM gettok()
        WHILE  true
        DO
              rcp <- cp

              // skip over blanks
              WHILE map[*rcp] & BLANK
              DO
                   rcp <- rcp + 1
              DONE
              cp <- rcp   // points to a non-BLANK character

              CASE *rcp++)
                   '/n': 
                       refill(buffer)
                       IF cp = limit   // reach end of file
                           return EOF
                       CONTINUE
                   ',' : ';' : '&' : '|' :
                       cp <- rcp;
                       return PUNC;
                   '/':
                       IF *rcp = '*'   // enter comment pseudo token
                            prev <- 0 //?
                            FOR( rcp++; *rcp != '/' && prev = '*'; ) {
                                  IF map[*rcp] & NEWLINE
                                      // !!! there is a determination whether cp >= limit in refill()
                                      //     "+1" because character before buffer sentinel maybe '/0'
                                      cp <- rcp + 1  
                                      IF rcp < limit   // there is no need to refill()
                                          prev <- *rcp
                                      nextline()
                                      rcp <- cp

                                      IF cp == limit  // read characters at EOF
                                          BREAK
                                  ELSE
                                      prev <- rcp++
                            }
                           
                            IF cp >= limit   // error, unexpected eof when tries "*/"
                                  return ERR
                            cp <- rcp + 1   // ! cp points to next token
                            BREAK           // skip over comments and move to next token
                       ELSE    
                            cp++
                            return PUNC
                   'a'-'z': 'A'-'Z' : '_' :
                       // note: !! it's necessary to check refill() action first before consuming consequent
                       //          characters of ID, and does not consume current *rcp
                       IF limit - rcp < MAXTOKEN
                           cp <- rcp - 1   // cp is used when call refill()
                           refill()
                           rcp <- cp + 1  // !! first ID character has been scanned
                       token <- rcp - 1  // !!! mark the beginning of ID token
                       WHILE map[*rcp++] & (DIGIT|LETTER);
                       token <- strncpy(token, rcp - token)

                       cp <- rcp            // reset cp to the beginning of next token
                       return ID
                   default:
                       IF *rcp = '/n'
                           nextline()
                           IF cp >= limit
                                return EOF
                       return ERR
        DONE

. lcc implementation of input module
  (1) interface of input module
       extern unsigned char  *cp;  // current input char, usually points to start of a {peseudo}token
       extern unsigned char *limit;  // sentinel of input buffer, with the value '/n'

       #define MAXLINE    512
       #define BUFSIZE    4096
       static unsigned char buffer[MAXLINE+1 + BUFSIZE+1];

  (2) get next line when *cp++ == '/n'
       // input: a source file, cp, buffer
      //  If the read line falls in buffer, increase line number
      //  else, the read line span different buffers when read
       ALGORITHM nextline()
             IF cp >= limit
                 // read characters are stored in consecutive units from address &buffer[MAXLINE+1],
                 // if no characters read when apply refill(), it already reach EOF in the file.
                 refill(buffer)                  
                 IF cp = limit  // EOF
                     RETURN
             ELSE  // the next line falls in the consecutive units of buffer[]
                 increase line number      

[lexical analyzer module]

. token types in C after preprocessing
  1) identifiers including keywords
  2) numbers
  3) chars
      e.g, 'a', '/t'
  4) string literals
     > "hello world"
     > L"hello world"  (wide character for representing chinese, japanese)
     > "hello" "world" // ok, string concetation
     > ""            // ok, empty string
     > "hello
       world"       // error
     > "hello /      // there is '/n' after character '//'
           tworld"  // invalid definition in c, but valid in c++, no error reported because after
                   // c-preprocessor, it's become "hello /tworld"

  5) punctuations or compound punctuations
       e.g, "*", "+", "&&", "|"
  6) comments -- PESEUDO token
  7) blanks -- PESUDO token
  8) line directive or "# progma ..." -- PESEUDO token, but change coordinate info of symbols

. interface
  extern char* file;            // in which file that current token falls in
  extern char* firstfile;      // ?? (not sure about the use), records the first file that is line directive?
  extern int    lineno;       // in which line that current token locates
  extern char* token;       // !! store string literal or number characters
  extern Symbol tsym;      // symbol points to string literal/identifier/number/build in type

  int gettok();
  int getchr()

. souce codes analysis:
  lex analyzer
        |--- input.c
                |-- static void resynch(void);   // process line directive & "# progma"
                |-- void nextline();    // get next line, and refill buffer if necessary
                |-- refill()                  // refill the buffer, and merge with un-scanned part of previous fill
        |--- lex.c
                |-- int gettok()          // get the token that "cp" points, consuming characters
                |-- * int getchr()       // get a non-blank character, and without consuming this character

   note:
   1)  void nextline();
        this function is called when *cp++ == '/n'


[1] refer to p103
     it make sense that string literals can span line boundaries, for example:
     const char* str = "i love this /
                                world!/";
     Q: list an instance that an identifier spans line boundaries?
     A: An identifier may be with length more 32 characters, so it may cause refilling the buffer when remaining character number is insufficient to MAXTOKEN.  (see p105 for explainations)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值