input module of lexical analyzer in lcc

[input sub-module of lexical analyzer]
.  requirement
   > speed, as fast as possible
   > there is rarely a limit on line length, i.e, line length can be aribitrary
   > extract tokens of C language definition

. techniqual implementation tips
  > read input characters in large chunk into a buffer, to reduce IO accesses to save time
  > there are '/n'(newline) in buffer, it's better to read tokens upon a line in buffer if necessary
     i.e, a line of file may span different buffer refill
  > tokens cannot span line boundaries most oftern(except identifiers and string literals) [1]
     i.e, how to deal with tokens that span line boundaries
  > refill the buffer when remaining characters that compose a token may span line boundaries.
     i.e, refill the buffer when remaining characters are insufficient to the max-length of a token,
     in lcc, the max-length is 32(MAXTOKEN)
  > input module will be used by lexical analyzer

  // according to the implementation tips, write a gettok() function
  // input: source file
  // output: return the token type
  //             {to simplify the problem, tokens are simple punctuation(PUNC),  BLANK
  //              ID, string literal, EOF, ERR}
  // guarantee: buffer has been filled with BUFFER_SIZE characters
  // auxiliary variables:
  //             buffer[BUFFER_SIZE]  : buffer for storing read characters from file
  //             cp                                 : current input char, usually points to start of a
  //                                                    token or peseudo-token
  //             rcp                                : help pointer in scanning token or peseudo-token
  //            
  //             limit                               : sentinel of buffer, i.e, buffer end
  //             map[256]                      : map char to its category, PUNC | LETTER | BLANK |
  //                                                    DIGIT
  ALGORITHM gettok()
        WHILE  true
        DO
              rcp <- cp

              // skip over blanks
              WHILE map[*rcp] & BLANK
              DO
                   rcp <- rcp + 1
              DONE
              cp <- rcp   // points to a non-BLANK character

              CASE *rcp++)
                   '/n': 
                       refill(buffer)
                       IF cp = limit   // reach end of file
                           return EOF
                       CONTINUE
                   ',' : ';' : '&' : '|' :
                       cp <- rcp;
                       return PUNC;
                   '/':
                       IF *rcp = '*'   // enter comment pseudo token
                            prev <- 0 //?
                            FOR( rcp++; *rcp != '/' && prev = '*'; ) {
                                  IF map[*rcp] & NEWLINE
                                      // !!! there is a determination whether cp >= limit in refill()
                                      //     "+1" because character before buffer sentinel maybe '/0'
                                      cp <- rcp + 1  
                                      IF rcp < limit   // there is no need to refill()
                                          prev <- *rcp
                                      nextline()
                                      rcp <- cp

                                      IF cp == limit  // read characters at EOF
                                          BREAK
                                  ELSE
                                      prev <- rcp++
                            }
                           
                            IF cp >= limit   // error, unexpected eof when tries "*/"
                                  return ERR
                            cp <- rcp + 1   // ! cp points to next token
                            BREAK           // skip over comments and move to next token
                       ELSE    
                            cp++
                            return PUNC
                   'a'-'z': 'A'-'Z' : '_' :
                       // note: !! it's necessary to check refill() action first before consuming consequent
                       //          characters of ID, and does not consume current *rcp
                       IF limit - rcp < MAXTOKEN
                           cp <- rcp - 1   // cp is used when call refill()
                           refill()
                           rcp <- cp + 1  // !! first ID character has been scanned
                       token <- rcp - 1  // !!! mark the beginning of ID token
                       WHILE map[*rcp++] & (DIGIT|LETTER);
                       token <- strncpy(token, rcp - token)

                       cp <- rcp            // reset cp to the beginning of next token
                       return ID
                   default:
                       IF *rcp = '/n'
                           nextline()
                           IF cp >= limit
                                return EOF
                       return ERR
        DONE

. lcc implementation of input module
  (1) interface of input module
       extern unsigned char  *cp;  // current input char, usually points to start of a {peseudo}token
       extern unsigned char *limit;  // sentinel of input buffer, with the value '/n'

       #define MAXLINE    512
       #define BUFSIZE    4096
       static unsigned char buffer[MAXLINE+1 + BUFSIZE+1];

  (2) get next line when *cp++ == '/n'
       // input: a source file, cp, buffer
      //  If the read line falls in buffer, increase line number
      //  else, the read line span different buffers when read
       ALGORITHM nextline()
             IF cp >= limit
                 // read characters are stored in consecutive units from address &buffer[MAXLINE+1],
                 // if no characters read when apply refill(), it already reach EOF in the file.
                 refill(buffer)                  
                 IF cp = limit  // EOF
                     RETURN
             ELSE  // the next line falls in the consecutive units of buffer[]
                 increase line number      

[lexical analyzer module]

. token types in C after preprocessing
  1) identifiers including keywords
  2) numbers
  3) chars
      e.g, 'a', '/t'
  4) string literals
     > "hello world"
     > L"hello world"  (wide character for representing chinese, japanese)
     > "hello" "world" // ok, string concetation
     > ""            // ok, empty string
     > "hello
       world"       // error
     > "hello /      // there is '/n' after character '//'
           tworld"  // invalid definition in c, but valid in c++, no error reported because after
                   // c-preprocessor, it's become "hello /tworld"

  5) punctuations or compound punctuations
       e.g, "*", "+", "&&", "|"
  6) comments -- PESEUDO token
  7) blanks -- PESUDO token
  8) line directive or "# progma ..." -- PESEUDO token, but change coordinate info of symbols

. interface
  extern char* file;            // in which file that current token falls in
  extern char* firstfile;      // ?? (not sure about the use), records the first file that is line directive?
  extern int    lineno;       // in which line that current token locates
  extern char* token;       // !! store string literal or number characters
  extern Symbol tsym;      // symbol points to string literal/identifier/number/build in type

  int gettok();
  int getchr()

. souce codes analysis:
  lex analyzer
        |--- input.c
                |-- static void resynch(void);   // process line directive & "# progma"
                |-- void nextline();    // get next line, and refill buffer if necessary
                |-- refill()                  // refill the buffer, and merge with un-scanned part of previous fill
        |--- lex.c
                |-- int gettok()          // get the token that "cp" points, consuming characters
                |-- * int getchr()       // get a non-blank character, and without consuming this character

   note:
   1)  void nextline();
        this function is called when *cp++ == '/n'


[1] refer to p103
     it make sense that string literals can span line boundaries, for example:
     const char* str = "i love this /
                                world!/";
     Q: list an instance that an identifier spans line boundaries?
     A: An identifier may be with length more 32 characters, so it may cause refilling the buffer when remaining character number is insufficient to MAXTOKEN.  (see p105 for explainations)

1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。 1、资源项目源码均已通过严格测试验证,保证能够正常运行; 2、项目问题、技术讨论,可以给博主私信或留言,博主看到后会第一时间与您进行沟通; 3、本项目比较适合计算机领域相关的毕业设计课题、课程作业等使用,尤其对于人工智能、计算机科学与技术等相关专业,更为适合; 4、下载使用后,可先查看README.md或论文文件(如有),本项目仅用作交流学习参考,请切勿用于商业用途。 5、资源来自互联网采集,如有侵权,私聊博主删除。 6、可私信博主看论文后选择购买源代码。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值