Hello,
I started to write a lexer in Python -- my first attempt to do something
useful with Python (rather than trying out snippets from tutorials). It
is not complete yet, but I would like some feedback -- I''m a Python
newbie and it seems that, with Python, there is always a simpler and
better way to do it than you think.
### Begin ###
import re
class Lexer(object):
def __init__( self, source, tokens ):
self.source = re.sub( r"\r?\n|\r\n", "\n", source )
self.tokens = tokens
self.offset = 0
self.result = []
self.line = 1
self._compile()
self._tokenize()
def _compile( self ):
for name, regex in self.tokens.iteritems():
self.tokens[name] = re.compile( regex, re.M )
def _tokenize( self ):
while self.offset < len( self.source ):
for name, regex in self.tokens.iteritems():
match = regex.match( self.source, self.offset )
if not match: continue
self.offset += len( match.group(0) )
self.result.append( ( name, match, self.line ) )
self.line += match.group(0).count( "\n" )
break
else:
raise Exception(
''Syntax error in source at offset %s'' %
str( self.offset ) )
def __str__( self ):
return "\n".join(
[ "[L:%s]\t[O:%s]\t[%s]\t''%s''" %
( str( line ), str( match.pos ), name, match.group(0) )
for name, match, line in self.result ] )
# Test Example
source = r"""
Name: "Thomas", # just a comment
Age: 37
"""
tokens = {
''T_IDENTIFIER'' : r''[A-Za-z_][A-Za-z0-9_]*'',
''T_NUMBER'' : r''[+-]?\d+'',
''T_STRING'' : r''"(?:\\.|[^\\"])*"'',
''T_OPERATOR'' : r''[=:,;]'',
''T_NEWLINE'' : r''\n'',
''T_LWSP'' : r''[ \t]+'',
''T_COMMENT'' : r''(?:\#|//).*$'' }
print Lexer( source, tokens )
### End ###
Greetings,
Thomas
--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)
解决方案'' }
print Lexer( source, tokens )
### End ###
Greetings,
Thomas
--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)
Thomas Mlynarczyk
Hello,
I started to write a lexer in Python -- my first attempt to do
something useful with Python (rather than trying out snippets from
tutorials). It is not complete yet, but I would like some feedback --
I''m a Python newbie and it seems that, with Python, there is always a
simpler and better way to do it than you think.
Hi,
Adding to John''s comments, I wouldn''t have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this.
>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
# Later:
>>mylexer.tokenise(another_source)
--
Arnaud
Arnaud Delobelle schrieb:
Adding to John''s comments, I wouldn''t have source as a member of the
Lexer object but as an argument of the tokenise() method (which I would
make public). The tokenise method would return what you currently call
self.result. So it would be used like this. >>>mylexer = Lexer(tokens)
mylexer.tokenise(source)
mylexer.tokenise(another_source)
At a later stage, I intend to have the source tokenised not all at once,
but token by token, "just in time" when the parser (yet to be written)
accesses the next token:
token = mylexer.next( ''FOO_TOKEN'' )
if not token: raise Exception( ''FOO token expected.'' )
# continue doing something useful with token
Where next() would return the next token (and advance an internal
pointer) *if* it is a FOO_TOKEN, otherwise it would return False. This
way, the total number of regex matchings would be reduced: Only that
which is expected is "tried out".
But otherwise, upon reflection, I think you are right and it would
indeed be more appropriate to do as you suggest.
Thanks for your feedback.
Greetings,
Thomas
--
Ce n''est pas parce qu''ils sont nombreux à avoir tort qu''ils ont raison!
(Coluche)