c++ tokenizer_如何在C＃中构建Tokenizer / Lexer生成器

最新推荐文章于 2024-08-29 07:48:07 发布

cunhan4654

最新推荐文章于 2024-08-29 07:48:07 发布

阅读量669

点赞数

文章标签： python java 正则表达式 linux android

原文链接：https://www.codeproject.com/Articles/5252200/How-to-Build-a-Tokenizer-Lexer-Generator-in-Csharp

版权

本文介绍了如何构建一个名为Rolex的C#词法分析器生成器，它允许创建无外部依赖的词法分析器，并支持多字符结束条件匹配、隐藏文本等功能。Rolex的独特之处在于它可以生成任何受支持语言的完整依赖代码，无需外部库。此外，文章还展示了如何使用Rolex的属性规范格式来定义词法规则。

摘要由CSDN通过智能技术生成

c++ tokenizer

Download source code - 107.6 KB
下载源代码107.6 KB

介绍 (Introduction)

This is a follow up to How to Build a Regex Engine. This will use what we've developed, and expand on what we've done to create a full fledged lexer generator.

这是如何构建正则表达式引擎的后续文章。这将使用我们已经开发的内容，并扩展我们为创建完整的词法分析器生成器所做的工作。

First, what the heck is a lexer? Briefly, lexers are useful to parsers. Parsers use them to break an input text stream into lexemes tagged with symbols so it can identify the "type" of a particular chunk of text. If you don't know what one is yet, see the previous article from above, because it explains lexing/tokenization. You're really best off starting there in any case. Plus you'll get to go over some neat code in it. As I said, we're building on what we've done there. I've included the source with it here.

首先，什么是词法分析器？简而言之，词法分析器对解析器很有用。解析器使用它们将输入文本流分解为带有符号标记的词素，以便它可以识别特定文本块的“类型”。如果您还不知道是什么，请从上方查看上一篇文章，因为它解释了词法处理/令牌化。无论如何，您最好从那里开始。另外，您还将了解其中的一些简洁代码。就像我说的，我们在此基础上继续前进。我在这里包括了源代码。

Tokenizers/Lexers are almost always used for parsers, but they don't have to be. They can be used any time you need to break up text into symbolic pieces.

标记器/词法分析器几乎总是用于解析器，但并非必须如此。您可以在需要将文本分解为符号碎片的任何时间使用它们。

背景 (Background)

We're going to build Rolex, the "gold plated" lexer. It has some unique features, hence the "gold plating".

我们将构建“镀金”词法分析器Ro lex 。它具有一些独特的功能，因此称为“镀金”。

For starters, it can create lexers that have no external dependencies, which is rare or maybe unheard of in the limited .NET field of lexer generators. It can generate its entire dependency code as source, and do this in any language that the CodeDOM will reasonably support, so it requires no external libraries. However, you can reference Rolex.exe in your projects like you would any assembly, and your tokenizer code can use that as an external library, if desired.

对于初学者，它可以创建没有外部依赖项的词法分析器，这在词法分析器生成器的有限.NET字段中很少见或闻所未闻。它可以将其整个依赖代码生成为源，并以CodeDOM合理支持的任何语言来执行此操作，因此不需要任何外部库。但是，您可以像在任何程序集中一样在项目中引用Rolex.exe ，并且如果需要，令牌生成器代码可以将其用作外部库。

Next, it has a feature called "block ends" which make it easy to match things with multicharacter ending conditions, like C block comments, and markup comments and CDATA sections.

接下来，它具有称为“块结束”的功能，该功能使匹配具有多字符结束条件的内容变得容易，例如C块注释，标记注释和CDATA节。

It can also hide tokens, like comments and whitespace.

它还可以隐藏标记，例如注释和空格。

It exposes all of this through an attributed specification format:

它通过指定的规范格式公开所有这些内容：

Digits='[0-9]+'
Word='[A-Za-z]+'
Whitespace='\s'

Here's our lexer we used from the previous article, as a Rolex specification file (demo.rl).

这是上一篇文章中用作劳力士规格文件( demo.rl )的词法分析器。

For laughs and illustrative purposes, we're going to add a C style block comment to this. Instead of a complicated regular expression, we can simply edit the file to appear as follows:

出于嘲笑和说明目的，我们将为此添加C样式块注释。代替复杂的正则表达式，我们可以简单地将文件编辑为如下所示：

Digits='[0-9]+'
Word='[A-Za-z]+'
Whitespace='\s'
Comment<blockEnd="*/",hidden>="/*"

Notice the presence of both single and double quotes in the above. Single quotes denote a regular expression, while double quotes denote a literal. Block ends must be literal, but the start expression for it doesn't have to be. Remember to escape your regular expressions and strings as necessary.

注意上面有单引号和双引号。单引号表示正则表达式，而双引号表示文字。块结尾必须是文字，但不必是它的开始表达式。请记住，如有必要，请转义正则表达式和字符串。

Also see how comment is declared with angle braces. Things inside those are attributes and they can specify modifiers for any lexer rule. We've specified two on Comment: blockEnd, and hidden. These specify the termination string, and hide the rule from the tokenizer output, respectively.

另请参见如何使用尖括号声明注释。这些里面的东西是属性，它们可以为任何词法分析器规则指定修饰符。我们在Comment上指定了两个： blockEnd和hidden 。它们指定终止字符串，并分别从令牌生成器输出中隐藏规则。

We can also explicitly specify numeric ids for the symbols with the id attribute which takes a numeric literal (no quotes), but don't go too crazy with huge numbers, as each one is an array index, so if one is 5000, it will create an array with 5000 elements in it.

我们还可以通过id属性为符号明确指定数字id ， id属性采用数字文字(不带引号)，但是不要对大数字太着迷，因为每个数字都是一个数组索引，所以如果一个是5000，它将创建一个包含5000个元素的数组。

Those are the only three attributes we currently support, but it should be enough for most purposes. If not, hopefully drilling down here will help you extend it.

这些是我们当前支持的仅有的三个属性，但是对于大多数目的来说应该足够了。如果不是这样，希望在这里进行深入研究可以帮助您扩展它。

We make extensive use of CodeDomUtility so it might be a good idea to give that Code Project article a quick read - it's short.

我们广泛使用CodeDomUtility因此快速阅读Code Project文章可能是一个好主意-简短。

编码此混乱 (Coding this Mess)

The bulk of our command line processing and core application is in Program.cs:

我们的大部分命令行处理和核心应用程序位于Program.cs中 ：

string inputfile = null;
string outputfile = null;
string name = null;
var codelanguage = "cs";
string codenamespace = null;
bool noshared = false;
bool lib = false;

These are our application parameters we read from the command line. Most of the parameters are self explanatory, except name, noshared, and lib. They specify the name of the generated class, whether or not to generate the dependency code - specifying noshared or lib prevents the dependency code from being generated. The former is useful if you're generating several tokenizers in the same project. You'd generate the first one without noshared, and then subsequent ones with noshared, so that there's only one copy of the base classes and other structures. The latter is useful if we want to use Rolex.exe as an external referenced assembly instead of putting the shared code directly in the project. It makes the generated code use the external library.

这些是我们从命令行读取的应用程序参数。除了name ， noshared和lib之外，大多数参数都是自我解释的。它们指定生成的类的名称，而不论是否生成依赖代码-指定noshared或lib阻止生成依赖代码。如果要在同一项目中生成多个标记生成器，则前者很有用。您将生成第一个不带noshared ，然后再生成一个不带noshared ，因此只有基类和其他结构的一个副本。如果我们要使用Rolex.exe作为外部引用程序集，而不是将共享代码直接放在项目中，则后者很有用。它使生成的代码使用外部库。

We switch in a loop in order to parse each parameter after the first one, using an extra easy way to specify command line argument values - we just use the argument that follows it in the args[] array. So if we see "/namespace", the next argument is what we fill the codenamespace variable with.

我们循环switch以便解析第一个参数之后的每个参数，使用一种额外的简单方法来指定命令行参数值-我们只需要在args[]数组中使用紧随其后的参数即可。因此，如果看到“ / namespace”，则下一个参数是我们用codenamespace变量填充的内容。

for(var i = 1;i<args.Length;++i)
{
    switch(args[i])
    {
        case "/output":
            if (args.Length - 1 == i) // check if we're at the end
                throw new ArgumentException(string.Format
                ("The parameter \"{0}\" is missing an argument", args[i].Substring(1)));
            ++i; // advance 
            outputfile = args[i];
            break;
        case "/name":
            if (args.Length - 1 == i) // check if we're at the end
                throw new ArgumentException(string.Format
                ("The parameter \"{0}\" is missing an argument", args[i].Substring(1)));
            ++i; // advance 
            name = args[i];
            break;
        case "/language":
            if (args.Length - 1 == i) // check if we're at the end
                throw new ArgumentException(string.Format
                ("The parameter \"{0}\" is missing an argument", args[i].Substring(1)));
            ++i; // advance 
            codelanguage = args[i];
            break;
        case "/namespace":
            if (args.Length - 1 == i) // check if we're at the end
                throw new ArgumentException(string.Format
                ("The parameter \"{0}\" is missing an argument", args[i].Substring(1)));
            ++i; // advance 
            codenamespace = args[i];
            break;
        case "/noshared":
            noshared = true;
            break;
        case "/lib":
            lib = true;
            break;
        default:
            throw new ArgumentException(string.Format("Unknown switch {0}", args[i]));
    }
}

Note that we're throwing exceptions here, and (not shown above) we've wrapped a try/catch block in a DEBUG conditional. This is so in debug builds we get the exception instead of it being suppressed and sending us to the usage screen with an error message - which is what happens in release builds. We report all errors by throwing. This makes things cleaner.

请注意，我们在这里抛出异常，并且(上面未显示)我们将try / catch块包装在DEBUG条件中。因此，在调试版本中，我们得到的是异常而不是被抑制，并通过错误消息将我们发送到使用情况屏幕-这是发行版本中发生的情况。我们通过抛出来报告所有错误。这使事情更清洁。

We could have probably used a const above for the error messages, but I just copied and pasted for this demo. In the real world, we'd probably want to hold messages in a resource string table anyway.

我们可能在错误消息中使用了上面的const ，但是我只是为此示例复制并粘贴了。在现实世界中，无论如何我们可能希望将消息保存在资源字符串表中。

The next lines of code that are of any interest are:

有趣的下几行代码是：

var rules = _ParseRules(input);
_FillRuleIds(rules);

These parse the rules, and fill the rule ids, but we'll get to that in a moment. First, what is a rule? Each rule is one line in that spec file we looked at before, giving us four rules. The structure in code is as follows:

这些将解析规则，并填充规则ID，但稍后我们将进行介绍。首先，有什么规则？每条规则是我们之前查看的规范文件中的一行，为我们提供了四个规则。代码结构如下：

class _LexRule
{
    public int Id;
    public string Symbol;
    public KeyValuePair<string, object>[] Attributes;
    public RegexExpression Expression;
    public int Line;
    public int Column;
    public long Position;
}

You can see each rule has quite a few properties. The last three fields aren't used as part of the lex, they just tell us where in the document the rule was, for error reporting. Symbol is the identifier on the left hand side of the equals sign, Attributes are the named values between <>, and Expression is the regular expression DOM/AST that represents the right hand side of the rule. See the prior article for how to use that DOM. We're really only going to use one method off of it though.

您可以看到每个规则都有很多属性。最后三个字段不用作lex的一部分，它们只是告诉我们该规则在文档中的哪个位置，用于错误报告。 Symbol是等号左侧的标识符， Attributes是<>之间的命名值， Expression是表示规则右侧的正则表达式DOM / AST。有关如何使用该DOM的信息，请参阅上一篇文章。不过，我们实际上仅会使用一种方法。

The parsing routine uses latest, as of yet unposted version of ParseContext exposed from the Regex library. It's functionally the same as the one in this article, but with a minor bugfix or two (mostly naming), and better factoring. It uses a cheesy but expedient move by treating the attribute values as JSON and parsing them as though they were JSON, so JSON string escapes and all JSON literals and such are all valid here. Anyway, the overall routine just uses recursive descent parsing with the parse context to read the input file into a list of lex rules. Once you're familiar with how ParseContext works, the only real weird part of the routine is when we parse the regular expressions:

解析例程使用从Regex库公开的ParseContext最新的，尚未发布的版本。它的功能与本文中的功能相同，但有一个或两个较小的错误修正(主要是命名)，并且分解系数更好。通过将属性值视为JSON并将其解析为JSON来使用，这是一种俗气但权宜的举动，因此JSON字符串转义以及所有JSON文字等在这里均有效。无论如何，整个例程仅使用带有解析上下文的递归下降解析将输入文件读入lex规则列表。熟悉ParseContext工作原理后，该例程唯一真正的怪异部分就是我们解析正则表达式时：

if ('\'' == pc.Current)
{
    pc.Advance();
    pc.ClearCapture();
    pc.TryReadUntil('\'', '\\', false);
    pc.Expecting('\'');
    var pc2 = ParseContext.Create(pc.GetCapture());
    pc2.EnsureStarted();
    pc2.SetLocation(pc.Line, pc.Column, pc.Position);
    var rx = RegexExpression.Parse(pc2);
    pc.Advance();
    rule.Expression = rx;
}

We've created a new ParseContext on the capture of our main ParseContext. What this does is make sure our new parse context only "sees" the text between single quotes. We've effectively scanned and captured between quotes, treated that as its own string, and then created a parse context on that like we would normally. We update the second parse context's position information, so it will report error locations properly. We then used that parse context to pass to the RegexExpression.Parse() method. That's how we tell that class when to stop parsing, since it doesn't know anything about our file format. It just sees regex.

我们已经创建了一个新的ParseContext我们主要的捕获ParseContext 。这是为了确保我们新的解析上下文仅“看到”单引号之间的文本。我们已经有效地扫描和捕获了引号之间的内容，将其视为自己的字符串，然后像往常一样在其上创建了一个解析上下文。我们更新了第二个解析上下文的位置信息，因此它将正确报告错误位置。然后，我们使用该解析上下文传递给RegexExpression.Parse()方法。这就是我们告诉该类何时停止解析的方式，因为它对我们的文件格式一无所知。它只是看到正则表达式。

Moving on to _FillRuleIds(): We have to fill in the ids for each of our rules. Some might already be filled in the input file through id attributes. We have to preserve those, and fill in the ids, sequentially "around them", such that if we specified a rule at id of 5, we have to create new rules without using the 5 id again. We also have to number them upward. What we do is move such that the last id we saw becomes our new start, and then we increment one for each rule, skipping any that are already declared, and checking for duplicates, which we cannot have. It's kind of complicated to describe but intuitive to use and easy to code:

继续_FillRuleIds() ：我们必须为每个规则填写id。有些文件可能已经通过id属性填充到了输入文件中。我们必须保留这些ID，然后依次在它们周围“”填充ID，这样，如果我们将ID指定为5的规则，就必须创建新规则，而不必再次使用5 ID。我们还必须向上编号。我们要做的是移动，使我们看到的最后一个id成为新的起点，然后为每个规则增加一个，跳过任何已声明的规则，并检查我们不能拥有的重复项。描述起来有点复杂，但直观易用，易于编写代码：

static void _FillRuleIds(IList<_LexRule> rules)
{
    var ids = new HashSet<int>();
    for (int ic = rules.Count, i = 0; i < ic; ++i)
    {
        var rule = rules[i];
        if (int.MinValue!=rule.Id && !ids.Add(rule.Id))
            throw new InvalidOperationException(string.Format
            ("The input file has a rule with a duplicate id at line {0}, 
            column {1}, position {2}", rule.Line, rule.Column, rule.Position));
    }
    var lastId = 0;
    for (int ic = rules.Count, i = 0; i < ic; ++i)
    {
        var rule = rules[i];
        if(int.MinValue==rule.Id)
        {
            rule.Id = lastId;
            ids.Add(lastId);
            while(ids.Contains(lastId))
                ++lastId;
        } else
        {
            lastId = rule.Id;
            while (ids.Contains(lastId))
                ++lastId;
        }
    }
}

We have one more important helper method to cover:

我们还有另一种重要的帮助方法可以解决：

static CharFA<string> _BuildLexer(IList<_LexRule> rules)
{
    var exprs = new CharFA<string>[rules.Count];
    for(var i = 0;i<exprs.Length;++i)
    {
        var rule = rules[i];
        exprs[i]=rule.Expression.ToFA(rule.Symbol);
    }
    return CharFA<string>.ToLexer(exprs);
}

Note that this uses our CharFA<TAccept> class from Regex. What it does is take the regular expressions for each rule, and then tell Regex to turn that into a lexer. Again, see the previous article for more about how that works. It's important-ish but it's too lengthy a topic to cover here. The other _BuildXXXX() methods also take information from the rules and build data from them we use later.

请注意，这使用了Regex中的 CharFA<TAccept>类。它的作用是为每个规则获取正则表达式，然后告诉Regex将其转换为词法分析器。同样，请参见上一篇文章，以了解有关其工作原理的更多信息。这很重要，但是这里的话题太长了。其他_BuildXXXX()方法也从规则中获取信息，并从稍后使用的规则中构建数据。

Let's move on to the meat of Program.cs now that we've processed our command line and read our rules from the input file:

现在，我们已经处理了命令行并从输入文件中读取了规则，现在继续看一下Program.cs的内容 ：

var ccu = new CodeCompileUnit();
var cns = new CodeNamespace();
if (!string.IsNullOrEmpty(codenamespace))
    cns.Name = codenamespace;
ccu.Namespaces.Add(cns);
var fa = _BuildLexer(rules);
var symbolTable = _BuildSymbolTable(rules);
var blockEnds = _BuildBlockEnds(rules);
var nodeFlags = _BuildNodeFlags(rules);
var dfaTable = fa.ToDfaStateTable(symbolTable);
if (!noshared && !lib)
{
    cns.Types.Add(CodeGenerator.GenerateTokenStruct());
    cns.Types.Add(CodeGenerator.GenerateTableTokenizerBase());
    cns.Types.Add(CodeGenerator.GenerateTableTokenizerEnumerator());
    cns.Types.Add(CodeGenerator.GenerateDfaEntry());
    cns.Types.Add(CodeGenerator.GenerateDfaTransitionEntry());
}
cns.Types.Add(CodeGenerator.GenerateTableTokenizer
             (name,dfaTable,symbolTable,blockEnds,nodeFlags));
if (lib)
    cns.Imports.Add(new CodeNamespaceImport("Rolex"));
var prov = CodeDomProvider.CreateProvider(codelanguage);
var opts = new CodeGeneratorOptions();
opts.BlankLinesBetweenMembers = false;
opts.VerbatimOrder = true;
prov.GenerateCodeFromCompileUnit(ccu, output, opts);

Mostly here, we're just building all of our data, telling Regex to make us a DFA state table, optionally generating the shared source base classes, and then passing everything in that mess we just made to GenerateTableTokenizer() in CodeGenerator.cs which we'll cover shortly. We take the stuff that was built, and then add it to our CodeCompileUnit which we generate our output from. That covers all the important aspects of Program.

通常，在这里，我们只是在构建所有数据，告诉Regex使我们成为DFA状态表，可以选择生成共享的源基类，然后将我们刚才GenerateTableTokenizer()所有混乱情况传递给CodeGenerator.cs中的GenerateTableTokenizer() ，我们会尽快介绍。我们将生成的内容，然后将其添加到我们从中生成输出的CodeCompileUnit 。这涵盖了Program所有重要方面。

CodeGenerator.cs is on the larger side, but would be much larger still without CodeDomUtility, here aliased as CD in this file. Reading the code that calls on CodeDomUtility can be a bit difficult at first, but as your eyes adjust so to speak, it gets a good deal easier. Most of these routines spit out static code, which are simply generated versions of the reference and library code in Tokenizer.cs. The reason we generate it even though it's not dynamic is because that way it can be in any .NET language for which there is a compliant CodeDOM provider.

CodeGenerator.cs在较大的一侧，但是如果没有CodeDomUtility (在此文件中别名为CD ，它将更大。 CodeDomUtility ，读取调用CodeDomUtility的代码可能会有些困难，但是随着您的视线调整，可以轻松CodeDomUtility 。这些例程中的大多数会吐出静态代码，这些代码是Tokenizer.cs中参考和库代码的简单生成的版本。即使它不是动态的，我们生成它的原因是因为这样，它可以在任何具有兼容CodeDOM提供程序的.NET语言中使用。

The main public routine that generates dynamic code is GenerateTableTokenizer(), and what it does is serialize the passed in arrays to static fields on a new class that inherits from TableTokenizer, and simply passes the constructor arguments along to the base class's constructor. The base class is either present in the source, or included as part of the referenced assembly, Rolex.exe if you're using it that way. Each of the static generation methods, meanwhile, has the equivalent implementation in C# in Tokenizer.cs, so let's explore that:

生成动态代码的主要公共例程是GenerateTableTokenizer() ，它的作用是将传入的数组序列化为继承自TableTokenizer的新类上的静态字段，并将简单的构造函数参数传递给基类的构造函数。如果您以这种方式使用基类，则该基类要么在源中存在，要么作为引用的程序集Rolex.exe的一部分包含在内。同时，每种静态生成方法在Tokenizer.cs中的C＃中都有等效的实现，因此让我们来探索一下：

public class TableTokenizer : IEnumerable<Token>
{
    // our state table
    DfaEntry[] _dfaTable;
    // our block ends (specified like comment<blockEnd="*/">="/*" in a rolex spec file)
    string[] _blockEnds;
    // our node flags. Currently only used for the hidden attribute
    int[] _nodeFlags;
    // the input cursor. We can get this from a string, a char array, or some other source.
    IEnumerable<char> _input;
    /// <summary>
    /// Retrieves an enumerator that can be used to iterate over the tokens
    /// </summary>
    /// <returns>An enumerator that can be used to iterate over the tokens</returns>
    public IEnumerator<Token> GetEnumerator()
    {
        // just create our table tokenizer's enumerator, passing all of the relevant stuff
        // it's the real workhorse.
        return new TableTokenizerEnumerator
               (_dfaTable, _blockEnds, _nodeFlags, _input.GetEnumerator());
    }
    // legacy collection support (required)
    System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()
        => GetEnumerator();
    /// <summary>
    /// Constructs a new instance
    /// </summary>
    /// <param name="dfaTable">The DFA state table to use</param>
    /// <param name="blockEnds">The block ends table</param>
    /// <param name="nodeFlags">The node flags table</param>
    /// <param name="input">The input character sequence</param>
    public TableTokenizer(DfaEntry[] dfaTable,
           string[] blockEnds,int[] nodeFlags,IEnumerable<char> input)
    {
        if (null == dfaTable)
            throw new ArgumentNullException(nameof(dfaTable));
        if (null == blockEnds)
            throw new ArgumentNullException(nameof(blockEnds));
        if (null == nodeFlags)
            throw new ArgumentNullException(nameof(nodeFlags));
        if (null == input)
            throw new ArgumentNullException(nameof(input));
        _dfaTable = dfaTable;
        _blockEnds = blockEnds;
        _nodeFlags = nodeFlags;
        _input = input;
    }
}

What the heck? Strip away the comments and there's almost nothing there but some member fields and a constructor. Note however, that the GetEnumerator() method returns a TableTokenizerEnumerator that it passes all of its member data to. If you've ever written a collection class, then you already know that this method enables you to write foreach(var token in myTokenizer) to lex the input, which we'll get to, but first, let's look at this monster, our token enumerator.

有没有搞错？除去注释，除了成员字段和构造函数之外，几乎没有其他内容。但是请注意， GetEnumerator()方法返回一个TableTokenizerEnumerator ，它将其所有成员数据传递到该表。如果您曾经编写过collection类，那么您已经知道该方法使您能够编写foreach(var token in myTokenizer)对输入进行lex处理，但首先，让我们看一下这个怪物，我们的令牌枚举器。

class TableTokenizerEnumerator : IEnumerator<Token>
{
    // our error symbol. Always -1
    public const int ErrorSymbol= -1;
    // our end of stream symbol - returned by _Lex() and used internally but not reported
    const int _EosSymbol = -2;
    // our disposed state indicator
    const int _Disposed = -4;
    // the state indicates the cursor is before the beginning (initial state)
    const int _BeforeBegin = -3;
    // the state indicates the cursor is after the end
    const int _AfterEnd = -2;
    // the state indicates that the inner input enumeration has finished 
    // (we still have one more token to report)
    const int _InnerFinished = -1;
    // indicates we're currently enumerating. 
    // We spend most of our time and effort in this state
    const int _Enumerating = 0;
    // indicates the tab width, used for updating the Column property when we encounter a tab
    const int _TabWidth = 4;
    // the DFA state table to use.
    DfaEntry[] _dfaTable;
    // the blockEnds to use
    string[] _blockEnds;
    // the nodeFlags to use
    int[] _nodeFlags;
    // the input cursor
    IEnumerator<char> _input;
    // our state 
    int _state;
    // the current token
    Token _current;
    // a buffer used primarily by _Lex() to capture matched input
    StringBuilder _buffer;
    // the one based line
    int _line;
    // the one based column
    int _column;
    // the zero based position
    long _position;
...

What a beast! In here, we have several constants, the DFA table, block ends, and node flags, the input cursor, a state indicator, a current Token, a string buffer, and some text location information. Yes, Ramona, we need it all.

真是野兽！在这里，我们有几个常量，DFA表，块结尾和节点标志，输入光标，状态指示器，当前Token ，字符串缓冲区以及一些文本位置信息。是的，拉莫纳，我们需要所有这些。

Like most any enumerator classes, the heart of ours is MoveNext(), which simply reads the next token:

像大多数枚举器类一样，我们的核心是MoveNext() ，它仅读取下一个标记：

public bool MoveNext()
{
    // if we're not enumerating
    if(_Enumerating>_state)
    {
        if (_Disposed == _state)
            _ThrowDisposed();
        if (_AfterEnd == _state)
            return false;
        // we're okay if we got here
    }
    _current = default(Token);
    _current.Line = _line;
    _current.Column = _column;
    _current.Position = _position;
    _buffer.Clear();
    // lex the next input
    _current.SymbolId = _Lex();
    // now look for hiddens and block ends
    var done = false;
    while (!done)
    {
        done = true;
        // if we're on a valid symbol
        if (ErrorSymbol < _current.SymbolId)
        {
            // get the block end for our symbol
            var be = _blockEnds[_current.SymbolId];
            // if it's valid
            if (!string.IsNullOrEmpty(be))
            {
                // read until we find it or end of input
                if (!_TryReadUntilBlockEnd(be))
                    _current.SymbolId = ErrorSymbol;
            } 
            // node is hidden?
            if (ErrorSymbol<_current.SymbolId && 0 != (_nodeFlags[_current.SymbolId] & 1))
            { 
                // update the cursor position and lex the next input, skipping this one
                done = false;
                _current.Line = _line;
                _current.Column = _column;
                _current.Position = _position;
                _buffer.Clear();
                _current.SymbolId = _Lex();
            }
        }    
    }
    // get what we captured
    _current.Value = _buffer.ToString();
    // update our state if we hit the end
    if (_EosSymbol == _current.SymbolId)
        _state = _AfterEnd;
    // return true if there's more to report
    return _AfterEnd!=_state;
}

Almost all of the complication here involves dealing with block ends and hidden tokens toward the end of the routine. Indeed, there's not much aside from that and some bookkeeping - oh, and the calls to _Lex(), the heart and soul of our tokenizer/lexer. This routine simply scans the input using the DFA table, and reports what it found each time it is called, advancing the cursor. It reports the text and line information in _buffer, _line, _column, and _position and the symbol id of the match as its return value - on error returning -1 (_ErrorSymbol).

这里几乎所有的复杂性都涉及到在程序结束时处理块末尾和隐藏令牌。确实，除了这些之外，还没有其他簿记功能-哦，还有对_Lex()的调用，这是我们标记程序/词法分析器的核心和灵魂。该例程仅使用DFA表扫描输入，并在每次调用时报告找到的内容，从而前进游标。它报告_buffer ， _line ， _column和_position的文本和行信息以及_position的符号ID作为其返回值-错误返回-1 ( _ErrorSymbol )。

// lex the next token
public int _Lex()
{
    // our accepting symbol id
    int acceptSymbolId;
    // the DFA state we're currently in (start at zero)
    var dfaState = 0;
    // corner case for beginning
    if (_BeforeBegin == _state)
    {
        if (!_MoveNextInput()) // empty input.
        {
            // if we're on an accepting state, return that
            // otherwise, error
            acceptSymbolId = _dfaTable[dfaState].AcceptSymbolId;
            if (-1 != acceptSymbolId)
                return acceptSymbolId;
            else
                return ErrorSymbol;
        }
        // we're enumerating now
        _state = _Enumerating;
    }
    else if (_InnerFinished == _state || _AfterEnd == _state)
    {
        // if we're at the end just return the end symbol
        return _EosSymbol;
    }
    // Here's where we run most of the match. we run one iteration of the DFA state machine.
    // We match until we can't match anymore (greedy matching) 
    // and then report the symbol of the last 
    // match we found, or an error ("#ERROR") if we couldn't find one.
    var done = false;
    while (!done)
    {
        var nextDfaState = -1;
        // go through all the transitions
        for (var i = 0; i < _dfaTable[dfaState].Transitions.Length; i++)
        {
            var entry = _dfaTable[dfaState].Transitions[i];
            var found = false;
            // go through all the ranges to see if we matched anything.
            for (var j = 0; j < entry.PackedRanges.Length; j++)
            {
                var ch = _input.Current;
                // grab our range from the packed ranges into first and last
                var first = entry.PackedRanges[j];
                ++j;
                var last = entry.PackedRanges[j];
                // do a quick search through our ranges
                if ( ch <= last)
                {
                    if (first <= ch) 
                        found = true;
                    j = int.MaxValue - 1; // break
                }
            }
            if (found)
            {
                // set the transition destination
                nextDfaState = entry.Destination;
                i = int.MaxValue - 1; // break
            }
        }

        if (-1 != nextDfaState) // found a valid transition
        {
            // capture our character
            _buffer.Append(_input.Current);
            // and iterate to our next state
            dfaState = nextDfaState;
            if (!_MoveNextInput())
            {
                // end of stream, if we're on an accepting state,
                // return that, just like we do on empty string
                // if we're not, then we error, just like before
                acceptSymbolId = _dfaTable[dfaState].AcceptSymbolId;
                if (-1 != acceptSymbolId) // do we accept?
                    return acceptSymbolId;
                else
                    return ErrorSymbol;
            }
        }
        else
            done = true; // no valid transition, we can exit the loop
    }
    // once again, if the state we're on is accepting, return that
    // otherwise, error, almost as before with one minor exception
    acceptSymbolId = _dfaTable[dfaState].AcceptSymbolId;
    if(-1!=acceptSymbolId)
    {
        return acceptSymbolId;
    }
    else
    {
        // handle the error condition
        // we have to capture the input 
        // here and then advance or the 
        // machine will never halt
        _buffer.Append(_input.Current);
        _MoveNextInput();
        return ErrorSymbol;
    }
}

It looks like it's doing a lot, but it's really not, nor should it, as this code is speed critical. This is an inner loop of an inner loop of an inner loop when used in a parser. It has to be fast.

看起来好像做了很多，但实际上不是，也不应该这样做，因为此代码对速度至关重要。当在解析器中使用时，这是内部循环的内部循环的内部循环。它必须快。

How it works is based on those graphs we saw in the previous article. Those are baked into our state table, dfaTable[]. Each entry in the table contains Transitions and an AcceptSymbolId. Each transition contains a set of packed character ranges (stored as adjacent pairs of characters in a character array) and a destination state index. In the inner for loops, we're traversing the packed ranges of our transitions to see if the character falls between any one of them. If we find one, we set the dfaState variable to the next state and continue. We do this until we can't match any transitions or we run out of input. Once that happens, we either report success if we're on an accepting state (_dfaTable[dfaState].AcceptSymbolId is not -1) we report that symbol. If we're not, we report _ErrorSymbol (-1). It's actually straightforward to traverse those states. The trick is generating the data for those tables in the first place, but Regex did all the work for us, creating those from ToLexer() as we saw earlier, which in turn was created during _ParseRules() via those RegularExpression Expression fields on those _LexRules. Talk about the house that Jack built! Still, here we are finally.

它是如何工作的基于我们在上一篇文章中看到的那些图。这些被放入我们的状态表dfaTable[] 。表中的每个条目都包含Transitions和一个AcceptSymbolId 。每个转换都包含一组压缩字符范围(存储为字符对中的相邻字符对)和目标状态索引。在内部for循环中，我们遍历转换的压缩范围，以查看字符是否落在其中任何一个之间。如果找到一个，则将dfaState变量设置为next状态并继续。我们这样做直到我们无法匹配任何过渡，或者我们用尽了所有输入。一旦发生这种情况，如果我们处于接受状态( _dfaTable[dfaState].AcceptSymbolId不为-1 )，我们将报告成功，我们将报告该符号。如果不是，则报告_ErrorSymbol ( -1 )。遍历这些状态实际上很简单。诀窍是首先为这些表生成数据，但是Regex为我们完成了所有工作，如我们先前ToLexer() ，是从ToLexer()创建数据的，而ToLexer()又是在_ParseRules()期间通过那些表中的RegularExpression Expression字段创建的_LexRule的。谈论杰克建的房子！尽管如此，我们终于到了。

The rest of the code in that reference implementation is support code, like _MoveNextInput() which advances _inner's cursor position by one, and tracks text position, line and column, or _TryReadUntilBlockEnd() which does exactly what it says - it tries to read until it matches the specified block end (see earlier).

该参考实现中的其余代码为支持代码，例如_MoveNextInput() ， _inner的光标位置前进一个，并跟踪文本位置，行和列，或者_TryReadUntilBlockEnd()完全按照其说明执行-它试图读取直到匹配指定的块末尾(请参见前面)。

Let's finally delve into more of CodeGenerator.cs.

最后，让我们进一步研究CodeGenerator.cs 。

static CodeMemberMethod _GenerateLexMethod()
{
    var state = CD.FieldRef(CD.This, "_state");
    var input = CD.FieldRef(CD.This, "_input");
    var inputCurrent = CD.PropRef(input, "Current");
    var dfaTable = CD.FieldRef(CD.This, "_dfaTable");
    var dfaState = CD.VarRef("dfaState");
    var acc = CD.VarRef("acc");
    var done = CD.VarRef("done");
    var currentDfa = CD.ArrIndexer(dfaTable, dfaState);
    var invMoveNextInput = CD.Invoke(CD.This, "_MoveNextInput");
    var result = CD.Method(typeof(int), "_Lex");
    result.Statements.AddRange(new CodeStatement[] {
        CD.Var(typeof(int),"acc"),
        CD.Var(typeof(int),"dfaState",CD.Zero),
        CD.IfElse(CD.Eq(CD.Literal(_BeforeBegin),state),new CodeStatement[] {
            CD.If(CD.Not(invMoveNextInput),
                CD.Let(acc,CD.FieldRef(currentDfa,"AcceptSymbolId")),
                CD.IfElse(CD.NotEq(CD.NegOne,acc),new CodeStatement[] {
                    CD.Return(acc)
                },
                    CD.Return(CD.Literal(_ErrorSymbol))
                )
            ),
            CD.Let(state,CD.Literal(_Enumerating))
        },
            CD.If(CD.Or(CD.Eq(CD.Literal(_InnerFinished),state),
                  CD.Eq(CD.Literal(_AfterEnd),state)),
                CD.Return(CD.Literal(_EosSymbol))
            )
        ),
        CD.Var(typeof(bool),"done",CD.False),
        CD.While(CD.Not(done),
            CD.Var(typeof(int),"next",CD.NegOne),
            CD.For(CD.Var(typeof(int),"i",CD.Zero),CD.Lt(CD.VarRef("i"),
            CD.PropRef(CD.FieldRef(currentDfa,"Transitions"),"Length")),
            CD.Let(CD.VarRef("i"),CD.Add(CD.VarRef("i"),CD.One)),
                CD.Var("DfaTransitionEntry","entry",
                CD.ArrIndexer(CD.FieldRef(currentDfa,"Transitions"),CD.VarRef("i"))),
                CD.Var(typeof(bool),"found",CD.False),
                CD.For(CD.Var(typeof(int),"j",CD.Zero),CD.Lt(CD.VarRef("j"),
                CD.PropRef(CD.FieldRef(CD.VarRef("entry"),"PackedRanges"),"Length")),
                CD.Let(CD.VarRef("j"),CD.Add(CD.VarRef("j"),CD.One)),
                    CD.Var(typeof(char),"ch",inputCurrent),
                    CD.Var(typeof(char),"first",CD.ArrIndexer(CD.FieldRef(CD.VarRef("entry"),
                    "PackedRanges"),CD.VarRef("j"))),
                    CD.Let(CD.VarRef("j"),CD.Add(CD.VarRef("j"),CD.One)),
                    CD.Var(typeof(char),"last",CD.ArrIndexer(CD.FieldRef(CD.VarRef("entry"),
                    "PackedRanges"),CD.VarRef("j"))),
                    CD.If(CD.Lte(CD.VarRef("ch"),CD.VarRef("last")),
                        CD.If(CD.Lte(CD.VarRef("first"),CD.VarRef("ch")),
                            CD.Let(CD.VarRef("found"),CD.True)
                        ),
                        CD.Let(CD.VarRef("j"),CD.Literal(int.MaxValue-1))
                    )
                ),
                CD.If(CD.Eq(CD.VarRef("found"),CD.True),
                    CD.Let(CD.VarRef("next"),
                    CD.FieldRef(CD.VarRef("entry"),"Destination")),
                    CD.Let(CD.VarRef("i"),CD.Literal(int.MaxValue-1))
                )
            ),
            CD.IfElse(CD.NotEq(CD.VarRef("next"),CD.NegOne),new CodeStatement[] {
                CD.Call(CD.FieldRef(CD.This,"_buffer"),"Append",inputCurrent),
                CD.Let(dfaState,CD.VarRef("next")),
                CD.If(CD.Not(invMoveNextInput),
                    CD.Let(acc,CD.FieldRef(currentDfa,"AcceptSymbolId")),
                    CD.IfElse(CD.NotEq(acc,CD.NegOne), new CodeStatement[] {
                        CD.Return(acc)
                    },
                        CD.Return(CD.Literal(_ErrorSymbol))
                    )
                )
            },
                CD.Let(done,CD.True)
            )
        ),
        CD.Let(acc,CD.FieldRef(currentDfa,"AcceptSymbolId")),
        CD.IfElse(CD.NotEq(acc,CD.NegOne), new CodeStatement[] {
            CD.Return(acc)
        },
            CD.Call(CD.FieldRef(CD.This,"_buffer"),"Append",inputCurrent),
            CD.Call(CD.This,"_MoveNextInput"),
            CD.Return(CD.Literal(_ErrorSymbol))
        )
    });
    return result;
}

Wow, what a cryptic nightmare. Wait. Put on your x-ray specs, and just compare it to the method it generates (which I provided in the code segment just prior). Look at how it matches the structure of that code. You'll see things like:

哇，真是个神秘的噩梦。等待。穿上您的X射线规格，然后将其与它生成的方法(我之前在代码段中提供的方法)进行比较。查看它如何与该代码的结构匹配。您会看到类似以下内容：

CD.Call(CD.FieldRef(CD.This,"_buffer"),"Append",inputCurrent),
CD.Let(dfaState,CD.VarRef("next")),

This translates literally (at least in C#) to:

字面意义上(至少在C＃中)转换为：

_buffer.Append(input.Current);
dfaState = next;

We cheated a little and predeclared next and inputCurrent but the structure of the code is there. The entire routine is laid out like this. It helps to be familiar with the CodeDOM. The entire class is documented so you should get tool tips on what each method does. If you compare it to the reference code, you'll see the overall structure is the same. It's kind of verbose, but nowhere near as bad as the CodeDOM. You'll see I'm in the habit of declaring things inline, even using inline arrays when I have to. Laying it out like this allowed me to directly mirror the structure of the reference code, and it made creating this, and makes maintaining it easier, since the reference source is now an effective "master document". I hope that makes sense. Normally, using the CodeDOM, your code generation routine looks nothing like the code it's supposed to generate. This aims to rectify some of that. Now the generated code looks a lot, structurally, like the code it's supposed to generate. All the ifs and other declarations are inline where they would be in the generated source.

我们作弊了一点，并预先声明了next和inputCurrent但是代码的结构在那里。整个例程是这样布置的。熟悉CodeDOM会有所帮助。整个类都有文档记录，因此您应该获得每种方法的工具提示。如果将其与参考代码进行比较，您将看到总体结构相同。它有点冗长，但远不及CodeDOM糟糕。您会发现我习惯于内联声明事物，甚至在必须时使用内联数组也是如此。像这样进行布局使我可以直接镜像参考代码的结构，这使得创建代码并使维护更加容易，因为参考源现在是有效的“主文档”。我希望这是有道理的。通常，使用CodeDOM ，您的代码生成例程看起来与应该生成的代码完全不同。目的在于纠正其中的一些问题。现在，生成的代码在结构上看起来很多，就像它应该生成的代码一样。所有的if s和其他声明都是内联的，它们将在生成的源中。

See? What seemed like black magic is just a little spell to make maintenance less of a burden.

看到？看起来像黑魔法的魔咒只是使维护负担减轻的一个小法术。

It's kind of funny, but we've gotten this far and we still haven't covered using the tool.

这有点好笑，但是我们已经走了很远，我们仍然没有使用该工具。

inputfile - required, indicates the input lexer specification as described earlier.
inputfile必需，指示输入词法器规范，如前所述。
outputfile - optional, the output file to generate - defaults to stdout. Using stdout isn't recommended because the console can "cook" special characters, like unicode characters, scrambling them, and ruining DFA tables. It doesn't always happen, but when it does it's no fun, and can be hard to track down.
outputfile可选，要生成的输出文件-默认为stdout。不建议使用stdout，因为控制台可以“烹饪”特殊字符(例如unicode字符)，对其进行加扰并破坏DFA表。它并非总是会发生，但是当它发生时就没有乐趣了，并且很难追踪。
name - optional, this is the name of the class to generate. It can be taken from the filename if not specified.
name可选，这是要生成的类的名称。如果未指定，则可以从文件名中获取。
codelanguage - optional, the language of the code to generate. Most systems support VB and CS (C#), but some systems may have other languages. YMMV, as this should work with many, if not most, but it's hard to know in advance which languages it will work with.
codelanguage可选，生成的代码语言。大多数系统支持VB和CS(C＃)，但是某些系统可能具有其他语言。 YMMV，因为它应该与许多(即使不是大多数)一起使用，但是很难事先知道它将使用哪种语言。
codenamepace - optional, if specified, indicates the .NET namespace under which the code will be generated
codenamepace可选，如果指定，则指示将在其下生成代码的.NET命名空间
noshared - optional, if specified, indicates that no shared code will be included with the generated code. This can be useful when one has more than one tokenizer per project. The first tokenizer source file can be generated without this switch, and then subsequent tokenizer source files should be generated with this switch, so there's only one copy of the shared code. If lib is specified, shared code will never be generated.
noshared可选(如果指定)表示生成的代码中不包含共享代码。当每个项目有一个以上的令牌生成器时，此功能很有用。无需此开关即可生成第一个令牌生成器源文件，然后应使用此开关生成后续的令牌生成器源文件，因此共享代码只有一个副本。如果指定了lib ，则将永远不会生成共享代码。
lib - optional, if specified, indicates the use of rolex.exe as an assembly reference in the project. This will not add the assembly reference, but will set up the code so that it can use it. Using rolex.exe probably requires Regex.dll as well in the end distribution, but I haven't verified that for certain. I don't really recommend this, as it's just extra baggage for an easy to generate bit of shared code that one can just include in their project. That's why, by default, this option isn't on. You can actually make your own shared library, simply by spitting a rolex spec into the "Rolex" namespace and deleting the non-shared code, then compiling that. Doing so doesn't require the above assemblies and will still allow you to use the lib option with this new library you just built.
lib -可选的，如果指定，指示使用rolex.exe的作为项目的组件参考。这不会添加程序集引用，但是会设置代码以便可以使用它。在最终发行版中，使用rolex.exe可能也需要Regex.dll ，但我还没有确定这一点。我真的不建议这样做，因为这只是额外的负担，因为它易于生成可以只包含在他们的项目中的共享代码。这就是默认情况下未启用此选项的原因。实际上，您只需将一个Rolex规范分散到“ Rolex ”命名空间中，然后删除非共享代码，然后进行编译，就可以创建自己的共享库。这样做不需要上面的程序集，并且仍然允许您将lib选项与刚刚构建的这个新库一起使用。

Well, now that we've covered how to generate from the command line, let's explore using the code - we've generated this from the lexer spec above but have removed the "hidden" attribute so that comments show up:

好了，既然我们已经介绍了如何从命令行生成，现在让我们探索使用代码-我们是根据上面的lexer规范生成的，但是删除了“ hidden ”属性，以便显示注释：

// for display in the console - your code doesn't need this
// as you can get to the constants from demo.Digits/demo.Word, etc.
var syms = new string[] { "Digits", "Word", "Whitespace", "Comment" };
// a test string with a deliberate error at the end 
// (just for completeness)
var test = "baz123/***foo/***/ /**/bar1foo/*/";

// create our tokenizer over the test string
var tokenizer = new demo(test);
// enumerate the tokens and dump them to the console
foreach (var token in tokenizer)
    Console.WriteLine(
        "{0}: \"{1}\" at line {3}, column {4}, position {2}", 
        token.SymbolId != -1 ? syms[token.SymbolId] : "#ERROR", 
        token.Value, 
        token.Position, 
        token.Line, 
        token.Column);

This results in the following output:

结果为以下输出：

Word: "baz" at line 1, column 1, position 0
Digits: "123" at line 1, column 4, position 3
Comment: "/***foo/***/" at line 1, column 7, position 6
Whitespace: " " at line 1, column 19, position 18
Comment: "/**/" at line 1, column 20, position 19
Word: "bar" at line 1, column 24, position 23
Digits: "1" at line 1, column 27, position 26
Word: "foo" at line 1, column 28, position 27
#ERROR: "/*/" at line 1, column 31, position 30

That's easy, however, in a real parser, you probably won't use foreach. Instead, you'll use IEnumerator<Token> manually, just as our tokenizer uses IEnumerator<char> manually. Still, it's only a couple of methods and a property that matter, so it's not a big deal either.

这很容易，但是，在真正的解析器中，您可能不会使用foreach 。相反，您将手动使用IEnumerator<Token> ，就像我们的令牌生成IEnumerator<char>手动使用IEnumerator<char> 。尽管如此，它仅是几个方法和一个属性，所以也没什么大不了的。

We'll cover that in a future article on how to build a parser (again).

我们将在以后的文章中再次讨论如何构建解析器。

I hope this demystified tools like lex/flex and bison if nothing else. They do essentially the same thing. Also, maybe you'll go on to make a better lexer on your own, now that you hopefully know how this one works.

我希望像lex / flex和bison这样的神秘工具，如果没有别的话。他们基本上做同样的事情。另外，也许您将继续自己制作一个更好的词法分析器，因为您希望知道它是如何工作的。

历史 (History)

27^th November, 2019 - Initial submission
2019年11月27 ^日 -初次提交

翻译自: https://www.codeproject.com/Articles/5252200/How-to-Build-a-Tokenizer-Lexer-Generator-in-Csharp

c++ tokenizer

cunhan4654

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
c++ tokenizer_如何在C＃中构建Tokenizer / Lexer生成器

c++ tokenizerDownload source code - 107.6 KB 下载源代码107.6 KB 介绍 (Introduction)This is a follow up to How to Build a Regex Engine. This will use what we've developed, and expand on what we've done to c...
复制链接

扫一扫