C＃中的正则表达式引擎

最新推荐文章于 2024-08-21 19:30:21 发布

寒冰屋

最新推荐文章于 2024-08-21 19:30:21 发布

阅读量618

点赞数

分类专栏： CSharp.NET 架构及框架文章标签： NFA DFA 正则表达式

CSharp.NET 同时被 2 个专栏收录

1245 篇文章 61 订阅

订阅专栏

架构及框架

635 篇文章 16 订阅

订阅专栏

下载示例，实现源和二进制文件 - 109.8 KB

介绍

为什么还要在.NET包含丰富的正则表达式库时生成正则表达式引擎？

首先，正则表达式引擎有两大类——回溯和非回溯。大多数现代正则表达式引擎都会进行回溯。回溯引擎比非回溯引擎具有更强的表现力，但是以性能为代价。

为什么还要制作非回溯引擎？

好吧，对于初学者来说，他们一遍又一遍地在短线上执行回溯引擎。其次，它们更适合直接在流式源上操作，而回溯解析器通常需要将文本预加载到内存中。

更一般地说，.NET提供的正则表达式引擎的样式适用于匹配长字符串中的模式。它不适合stream中的词法分析令牌。词法分析对于解析器至关重要，因此用于解析终端的正则表达式通常使用适合词法分析的非回溯引擎来实现。

背景

词法分析是将输入的字符流分解为“标记”的过程——string具有与它们相关联的“符号”的字符。符号表示它的string类型。例如，字符串“124.35”可能会被报告为符号“NUMBER”，而string “foo”可能会被报告为符号“IDENTIFIER”。

解析器通常使用下面的词法分析器，然后将传入的符号组合成语法树。因为在核心解析代码中调用词法分析器，所以词法分析操作必须合理有效。.NET正则表达式在这里并不适用，虽然它可以工作，但实际上会增加代码复杂性，同时降低性能。

此项目中包含一个名为“FA.cs” 的文件，其中包含使用有限自动机的正则表达式引擎的核心代码，该引擎解析为简陋且无处不在的状态机。

有限状态机由状态图组成。每个状态可以使用“输入转换”或“epsilon转换”引用另一个状态。输入转换在移动到目标之前消耗指定的输入。epsilon转换移动到目标状态而不消耗或检查任何输入。具有一个或多个epsilon转换的机器被认为是“非确定性的”（NFA），并且在任何给定时间可以处于多于一种状态。非确定性状态机更复杂，效率更低，但幸运的是，任何非确定性状态机都有一个确定性状态机（DFA），以及使用称为幂集构造的技术将NFA转换为DFA的算法——数学，这超出了探索的范围。

状态的闭包是状态本身以及从该状态到达的所有状态的唯一集合，在epsilon或输入转换上。闭包表示根状态指示的状态机。这些图可以是递归的。

匹配“foo” 的状态机图表如下：

q 0的闭包是{q 0，q 1，q 2，q 3 }，因为这些状态中的每一个直接或间接地连接到q 0。

每次越过黑色箭头时，输入必须与箭头上方的字符匹配才能沿着该路径前进。一旦遇到双圈（q 3），机器“接受”输入为有效/匹配并返回状态名称下的令牌，在这种情况下，“foo”在q 3下，因此一旦输入将返回匹配。

现在让我们匹配表单的标识符 [A-Z_a-z][0-9A-Z_a-z]*

如您所见，循环非常简单。

现在让我们得到有用的。词法分析器（aka tokenizers）区分了许多不同输入模式中的一种。让我们构建一个表示标识符，整数或空格的表达式。

您会注意到有两种方法可以匹配int（q 2，q 4），但它们每个都会返回相同的符号“ int”。

所呈现的每台机器都是确定性的（每个都是DFA）。DFA机器一次只能处于一种状态。还有非确定性机器或NFA，它们可以同时处于多个状态！

你会注意到这台机器上有灰色虚线。这些是epsilon上的转换或epsilon转换。在这种情况下，Epsilon只是意味着“没有输入”。每次机器遇到虚线箭头时，它自动处于该箭头指向的状态以及它自身。可以说它没有输入就转换了。

从q 0开始也会让你进入q 1，q 4，q 5，q 7，q 8和q 11！这组状态，即在epsilon转换时从该状态可到达的状态集称为epsilon闭包。

这些NFA机器更容易由其他几台机器组成/构建，但由于它们更复杂，因此难以运行。

如上所述，每台NFA机器都有一台DFA机器，还有一台将NFA转换为DFA的算法。

这段代码允许您轻松创建这些机器，制作如上所述的漂亮图形（借助Graphviz），生成超快速代码以运行给定状态机，在运行时运行状态机而不首先生成代码，将正则表达式转换为机器并再次返回，以及检查和组合机器。

使用代码

以下内容包含在示例项目中，并进行了大量注释，以向您展示使用此库的基础知识：

//
// See the included project for a full sample
//
var lexer = FA.Lexer(
	// our id from the article
	FA.Parse(@"[A-Z_a-z][0-9A-Z_a-z]*", "id"),
	// our int from the article
	FA.Parse(@"0|([\-]?[1-9][0-9]*)", "int"),
	// our space from the article
	FA.Parse(@" ", "space")
	);

// transform the lexer into its DFA equivalent
var dfaLexer = lexer.ToDfa();

// make sure there are no duplicate states left after the transformation.
// this can be an expensive op and isn't strictly necessary, so it's a 
// separate step. It *should* be done before code generation.
// If code generation is done on an NFA, it is automatically transformed to
// a DFA and duplicates are removed during that process
dfaLexer.TrimDuplicates();

Console.WriteLine("Rendering graphs");
try
{
	// try to render the graphs (requires GraphViz - specifically dot.exe)
	// to change the format, simply use a different extension like PNG or SVG
	lexer.RenderToFile(@"..\..\..\lexer_nfa.jpg");
	dfaLexer.RenderToFile(@"..\..\..\lexer_dfa.jpg");
}
catch
{
	Console.WriteLine("GraphViz is not installed.");
}

// generate the code (can generate from either an NFA or DFA, DFA is slightly faster)

// generate code direct to C# 
// we can use the CodeDOM here if CODEDOM compile constant is defined
// but to keep things simple we won't here.
Console.WriteLine("Generating C# Code");
using (var tw = new StreamWriter(@"..\..\..\Generated.cs"))
{
	tw.WriteLine("namespace {0}", typeof(Program).Namespace);
	tw.WriteLine("{");
	tw.WriteLine("partial class Program {");
	// generate TryLexValue
	dfaLexer.WriteCSharpTryLexMethodTo(tw, "Value");
	tw.WriteLine();
	// generate LexValue
	dfaLexer.WriteCSharpLexMethodTo(tw, "Value");
	tw.WriteLine();
	tw.WriteLine("}");
	tw.WriteLine("}");
}

// our test string
var test = "foo 456 bar 123 baz";

Console.WriteLine("Runtime Lexing:");

// Create a ParseContext over the string
// See https://www.codeproject.com/Articles/1280230/Easier-Hand-Rolled-Parsers
// ParseContext handles the input Cursor, the Position, Line and Column counting 
// as well as error handing. It operates over an enumeration of characters, a string
// or a TextReader. The main members are Advance(), Current, Expecting(), Position
// Line, Column, Capture and CaptureCurrent()

// it's easy enough to add code to generate lexers that do not require this class
// but we would lose error handling and position tracking, so it's not implemented.
// to use other interfaces, wrap them with ParseContext.Create()
var pc = ParseContext.Create(test);

pc.EnsureStarted(); // make sure we're on the first character.

try
{
	// ParseContext.Current works like a TextReader.Peek() does, returning an int
	// without advancing the cursor. Unline Peek() it is not problematic.
	while (-1 != pc.Current) // loop until end of input. 
	{
		// runtime lexing (can be NFA or DFA - DFA is intrinsically more efficient)
		// if we pass it a StringBuilder, it will reuse it - perf is better, 
		// but we don't bother here.
		var tok = dfaLexer.Lex(pc); // lex the next token
									// symbol is in tok.Key, value is in tok.Value
		Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
	}
}
catch(ExpectingException eex)
{
	// We got a lexing error, so report it. The message might be long, but you can always
	// build your own using ExpectingException members.
	Console.WriteLine("Error: {0}", eex.Message);
}
Console.WriteLine();

Console.WriteLine("Compiled Lexing:");

pc = ParseContext.Create(test); // restart the cursor
pc.EnsureStarted();
try
{				
	// reusing a stringbuilder allows the lex routine to run a little faster
	// don't use it for anything else if you're passing it to lex.
	var sb = new StringBuilder(); // don't have to use this, can pass null
	
	while (-1 != pc.Current) // loop until end of input. 
	{
		// compiled lexing (always DFA - most efficient option)
		// Generated.cs must be present and valid
		var tok = LexValue(pc,sb); // lex the next token
		Console.WriteLine("Token: {0}, Value: {1}", tok.Key, tok.Value);
	}
}
catch (ExpectingException eex)
{
	Console.WriteLine("Error: {0}", eex.Message);
}

我们现在来看看生成的代码：

q1:
    if((pc.Current>='0'&& pc.Current<='9')||
        (pc.Current>='A'&& pc.Current<='Z')||
        (pc.Current=='_')||
        (pc.Current>='a'&& pc.Current<='z')) {
        sb.Append((char)pc.Current);
        pc.Advance();
        goto q1;
    }
    return new System.Collections.Generic.KeyValuePair<System.String,string>("id",sb.ToString());

如您所见，这是状态q 1，它对应于上图中我们的“id”标记的最后一部分。

您可以看到从此状态编码到if的输出转换。

你可以看到箭头指向自己使用goto。

您可以看到，如果if 条件不满足，则以成功返回令牌的结果结束。

如果您查找它们，所有这些属性都会反映在上图中。