正则表达式引擎 源码 c#
更新: (Update:)
See my Unicode enabled offering here
在这里查看启用Unicode的产品
先决条件 (Prerequisites)
The project solution is in Visual Studio 2017 format. It's compiled as a standard library and a Core executable.
项目解决方案采用Visual Studio 2017格式。 它被编译为标准库和Core可执行文件。
Furthermore, in order to get rendering support, you'll need to install GraphViz which this code uses to render state graphs into images. All you should have to do is run the installer for GraphViz, and it should be in your path. You do not need GraphViz for using most of the regex functionality. You only need it to make pretty pictures, like the images in this article.
此外,为了获得渲染支持,您需要安装GraphViz ,此代码用于将状态图渲染到图像中。 您所需要做的就是运行GraphViz的安装程序,它应该在您的路径中。 您不需要GraphViz即可使用大多数正则表达式功能。 您只需要用它来制作漂亮的图片,就像本文中的图片一样。
Since we're using .NET Standard and .NET Core, we also reference the Code DOM NuGet package. This is necessary for the code generation facilities.
由于我们使用的是.NET Standard和.NET Core,因此我们还引用了Code DOM NuGet包。 这对于代码生成工具是必需的。
介绍 (Introduction)
This is an ambitious article. The goal is to walk you through the building of a fully featured regular expression engine and code generator. The code contains a complete and ready to use regular expression engine, with plenty of comments and factoring to help you through the source code.
这是一篇雄心勃勃的文章。 目的是引导您逐步构建功能齐全的正则表达式引擎和代码生成器。 该代码包含一个完整且随时可以使用的正则表达式引擎,并带有大量注释和分解因素以帮助您完成源代码。
First of all, you might be wondering why we would develop one in the first place. Aside from the joy of learning how regular expressions work under the hood, there's also a gap in the .NET framework's regular expression classes which this project fills nicely. This will be explained in the next section.
首先,您可能想知道为什么我们会首先开发一个。 除了学习正则表达式如何在后台工作的乐趣之外,.NET框架的正则表达式类也存在一个空白,该项目很好地填补了这一空白。 下一节将对此进行解释。
I've previously written a regular expression engine for C# which was published here, but I did not explain the mechanics of the code. I just went over a few of the basic principles. Here, I aim to drill down into a newer, heavily partitioned library that should demystify the beast enough that you can develop your own or extend it.
我之前已经为C#编写了一个正则表达式引擎,该引擎已在此处发布,但是我没有解释代码的机制。 我只介绍了一些基本原理。 在这里,我的目标是深入研究一个新的,高度分区的库,该库应足以使野兽神秘化,以便您可以开发自己的或扩展它。
I didn't skimp on optimizations, despite the added complication in the source. I wanted you to have something you could potentially use "out of the box." That said, providing you with the end library is a secondary goal. The primary goal is to teach you how to build one yourself.
尽管源代码中增加了复杂性,但我也没有对优化进行过任何了解。 我希望您拥有可以“开箱即用 ” 使用的功能。 也就是说,为您提供最终库是次要目标。 主要目标是教您如何自己打造自己。
概念化这个混乱 (Conceptualizing this Mess)
This article assumes passing familiarity with common regular expression notation. You don't have to be an expert, but knowing what ?
, *
, +
, |
, []
, and ()
do is very helpful. If you only know a subset, that's okay too, as long as you understand the basic idea behind regular expressions and what they're for.
本文假定已通过常见的正则表达式表示法。 您不必是专家,但是知道什么?
, *
, +
, |
, []
和()
做的工作非常有帮助。 如果您只知道一个子集,那也可以,只要您了解正则表达式背后的基本概念以及它们的用途。
What you might not know, is the theory behind their operation, which we'll cover here.
您可能不知道的是它们操作背后的理论,我们将在这里进行介绍。
At its core, a regular expression is a finite state machine based on a directed graph. That is to say, it is a network of nodes, connected to other nodes along directional paths. This is easier to visualize than to describe:
正则表达式的核心是基于有向图的有限状态机 。 也就是说,这是一个节点网络,沿着方向路径连接到其他节点。 这比描述起来更容易可视化:
This regular expression graph matches "foo
" or "bar
". The regular expression for this would be foo|bar
.
此正则表达式图匹配“ foo
”或“ bar
”。 正则表达式为foo|bar
。
Here's how it works: We always start at q0. From there, we have a series of black arrows pointing away. Each one is labeled with a character. Well, we examine the first character of the string to match. If it is "f" we proceed to q1. If it is "b" we proceed to q4. Any time we move along a black arrow, we increment the input string's cursor position by one. On q1 or q4, we will be examining the next input character, and so on. This continues until we can no longer match any characters. If at that point, we are on a state with a double circle (an "accepting state"), we have matched. If we are on some single circle state at that point, then we have not matched.
它是这样工作的:我们总是从q0开始。 从那里,我们有一系列黑色的箭头指向。 每个标签都标有一个字符。 好吧,我们检查要匹配的字符串的第一个字符。 如果为“ f”,则继续进行q1 。 如果它是“ b”,则进行到q4 。 每当我们沿黑色箭头移动时,我们都会将输入字符串的光标位置增加一。 在q1或q4上 ,我们将检查下一个输入字符,依此类推。 这一直持续到我们不能再匹配任何字符为止。 如果在这一点上,我们处于一个带有双圆圈的状态(“接受状态”),那么我们已经匹配。 如果我们此时处于某个单圈状态,则我们没有匹配。
We can create these graphs based on parsing a regular expression, and then use code to "run" these graphs, such that they can match input strings.
我们可以基于解析正则表达式创建这些图,然后使用代码“运行”这些图,以便它们可以匹配输入字符串。
Speaking of matching, there are two significantly different ways to use a regular expression engine. The first, and perhaps the most common is to scan through a long string of text looking for certain patterns. This is what you do in the Search and Replace dialog in Visual Studio if you enable regular expression matching for example. This is well covered ground for Microsoft's regular expression engine, and it is generally best practice to use it in these cases. For long strings, it can outperform our library since it implements things like Boyer-Moore searches whereas ours does not. Generally, it's a good idea to rely on Microsoft's engine for this style of matching.
说到匹配,使用正则表达式引擎有两种截然不同的方式。 第一种,也许是最常见的方法是扫描一长串文本以查找某些模式。 例如,如果启用了正则表达式匹配,则可以在Visual Studio的“搜索和替换”对话框中执行此操作。 对于Microsoft的正则表达式引擎,这是一个很好的基础,通常在这些情况下使用它是最佳实践。 对于长字符串,由于它实现了类似Boyer-Moore搜索的功能,而我们却没有,因此它的性能可能优于我们的库。 通常,最好使用Microsoft的引擎来进行这种匹配。
The other way to use a regular expression engine is called tokenizing. This basically moves through an input stream, and it attempts to match a string under the cursor to a particular pattern, and then it reports the pattern that was matched, before advancing. Basically, it uses a compound regular expression (combined with ors) and matches text to one of the particular expressions - and tells you which one it was that matched. This is very useful for lexers/tokenizers which are used by parsers. Microsoft's regular expression engine is not suited to this task, and attempts to use it for this come with a number of disadvantages including additional code complexity and a considerable performance hit. This gap in functionality is enough to justify a custom regular expression engine if one is crafting parsers or tokenizing for some other reason.
使用正则表达式引擎的另一种方法称为令牌化 。 这基本上在输入流中移动,并且尝试将光标下的字符串与特定模式匹配,然后在前进之前报告匹配的模式。 基本上,它使用复合正则表达式(与ors组合)并将文本与特定表达式之一匹配-并告诉您与哪个表达式匹配。 这对于解析器使用的词法分析器/标记器非常有用。 Microsoft的正则表达式引擎不适合此任务,并且尝试将其用于此存在许多缺点,包括额外的代码复杂性和相当大的性能损失。 如果有人出于其他原因而设计解析器或标记化功能,则此功能上的差距足以证明自定义正则表达式引擎的合理性。
Unlike Microsoft's regular expression engine, this engine does not backtrack. That is, it will never examine a character more than once. This leads to better performance, but non-backtracking expressions aren't quite as expressive/powerful as backtracking ones. Things like atomic zero width assertions are not possible using this engine, but the trade-off is more than worth it, especially when used for tokenizing.
与Microsoft的正则表达式引擎不同,此引擎不会回溯。 也就是说,它永远不会检查一个字符以上。 这样可以带来更好的性能,但是非回溯表达式并不像回溯表达式那样表现力强。 使用此引擎无法实现原子零宽度断言之类的事情,但要付出的代价却不那么值得,尤其是在用于标记化时。
Returning to the graphs, as alluded to, each one of these represents a finite state machine. The state machine has a root always represented by q0. If we were to take any subset of the graph, like, say starting at q4 and ignoring anything not pointed to directly or indirectly by q4, then that is also a state machine. Basically, a finite state machine is a set of interconnected state machines!
返回到图,如前所述,每个图都代表一个有限状态机。 状态机的根始终由q0表示。 如果我们把图的任何子集一样,说开始于Q4和忽略任何不指向直接或间接Q4,那么这也是一个状态机。 基本上,有限状态机是一组相互连接的状态机!
To find your entire state machine from the node you're presently on, you take the closure of a state. The closure is the set of all states reachable directly or indirectly from some state, including itself. The closure for q0 is always the entire graph. In this case, the closure for q4 is { q4, q5, q3 }. Note we never move backward along an arrow. The graph is directed. Taking the closure is a core operation of most any finite state machine code.
要从当前所在的节点中找到整个状态机,请关闭状态。 闭包是可以从某个状态(包括自身)直接或间接到达的所有状态的集合。 q0的闭合始终是整个图。 在这种情况下, q4的闭包为{ q4 , q5 , q3 }。 请注意,我们永远不会沿箭头向后移动。 该图是有向的。 闭包是大多数有限状态机代码的核心操作。
Let's move on to a slightly more complicated example: (foo|bar)+
. Basically, we're just adding a loop to the machine so it can match "foobarfoo" or "foofoobar" or "barbarfoofoobar", etc.
让我们继续一个稍微复杂的示例: (foo|bar)+
。 基本上,我们只是在机器上添加一个循环,以便它可以匹配“ foobarfoo”或“ foofoobar”或“ barbarfoofoobar”等。
This graph looks totally different! If you follow it however, it will make sense. Also note that we have outgoing arrows from the double circled accepting state. The implication is such that we do not stop once we reach the accepting state until we can't continue - either because we can't find the next character match or because we ran out of input. We only care that it's accepting after we've matched as much as we can. This is called greedy matching. You can simulate lazy matching with more complicated expressions but there is no built in facility for that in non-backtracking regular expressions.
此图看起来完全不同! 但是,如果您遵循它,它将是有道理的。 还要注意,我们有从双圈接受状态发出的箭头。 这意味着一旦到达接受状态,我们就不会停止,直到我们无法继续为止-要么是因为我们找不到下一个字符匹配,要么是因为我们用光了输入。 我们只在乎我们尽可能地匹配后才接受它。 这称为贪婪匹配 。 您可以使用更复杂的表达式模拟延迟匹配 ,但是在非回溯正则表达式中没有内置的功能。
Building these graphs is a bit of a trick. The above graphs are known as DFAs but it's much easier to programmatically build graphs using NFAs. To implement NFAs, we add an additional type of line to the graph - dashed lines, we call epsilon transitions. Unlike input transitions, epsilon transitions do not consume any input or advance the input cursor. As soon as you are in a state, you are also in every state connected to it by a dashed line. Yes, we can be in more than one state at once now.
建立这些图有点技巧。 上面的图被称为DFA,但是使用NFA以编程方式构建图要容易得多。 为了实现NFA,我们在图形虚线中添加了另一种类型的线,我们将其称为epsilon transitions 。 与输入过渡不同,ε过渡不会消耗任何输入或使输入光标前进。 进入状态后,您还处于由虚线连接到它的每个状态中。 是的,我们现在可以一次处于多个状态。
Two things that immediately stand out are the addition of the gray dashed lines and how much more intuitive the graph is. Remember that I said you automatically move into the states connected by epsilon transitions which means as soon as you are in q0, you are also in q1, q2, and q8. It can be said that the epsilon closure of q0 is { q0, q1, q2, q8 }. This is because of the dashed arrows leading out from each of the relevant states.
立即突出的两件事是添加了灰色虚线,并且图形更加直观。 记住,我说过您会自动进入由epsilon过渡连接的状态,这意味着您一旦进入q0 ,便同时处于q1 , q2和q8 。 可以说, q0的ε闭合为{ q0 , q1 , q2 , q8 }。 这是因为从各个相关状态引出的虚线箭头。
The above graph is much more expensive for a computer to navigate than the one that precedes it. The reason is that those epsilon transitions introduce the ability to be in multiple states at once. It is said to be non-deterministic.
上图对于一台计算机进行导航比其之前的图形贵得多。 原因是那些ε过渡引入了同时处于多种状态的能力。 据说这是不确定的 。
The former graph is deterministic. It can only be in one state at a time, and that makes it much simpler for a computer to navigate it quickly.
前一个图是确定性的 。 一次只能处于一种状态,这使得计算机更容易快速浏览它。
An interesting property of these non-deterministic graphs is there's 1 equivalent deterministic graph. That is, there is one equivalent graph with no dashed lines/epsilon transitions in it. In other words, for any NFA there is 1 DFA that is equivalent. We'll be exploiting this property later.
这些非确定性图的一个有趣特性是存在1个等效确定性图。 也就是说,存在一个等效的图形,其中没有虚线/ε过渡。 换句话说,对于任何NFA,都有1个等效的DFA。 稍后我们将利用此属性。
Now that we know we can create these NFAs (with epsilon transitions) and later transform them to DFAs (without epsilon transitions), we'll be using NFA graphs for all of our state machine construction, only turning them into DFAs once we're finished.
现在我们知道我们可以创建这些NFA(具有epsilon转换),然后将它们转换为DFA(不具有epsilon转换),我们将在所有状态机构造中使用NFA图,一旦我们将它们转换为DFA完成。
We're going to use a variant of Thompson's construction algorithm to build our graphs. This works by defining a few elemental subgraphs and then composing those into larger graphs:
我们将使用汤普森(Thompson)构造算法的变体来构造我们的图。 通过定义一些元素子图,然后将它们组合成更大的图,可以工作:
First, we have a literal - regex of ABC
:
首先,我们有一个文字 ABC
正则表达式:
Next, we have a character set - regex of [ABC]
:
接下来,我们有一个字符集 - [ABC]
正则表达式:
Next, we have concatenation - regex of ABC(DEF)
: (this is typically implicit)
接下来,我们进行串联 ABC(DEF)
正则表达式:(这通常是隐式的)
Now, we have or - regex of ABC|DEF
:
现在,我们已经或 -正则表达式的ABC|DEF
:
Next we have optional - regex of (ABC)?
:
接下来,我们有可选的 - (ABC)?
正则表达式(ABC)?
:
Finally, we have repeat - regex of (ABC)+
:
最后,我们重复一遍 - (ABC)+
正则表达式:
Every other pattern you can do can be composed of these. We put these together a bit like legos to build graphs. The only issue is when we stick them together, we must make the former accepting states non accepting and then create a new accepting state for the final machine. Repeat this process, nested to compose graphs.
您可以执行的所有其他模式都可以由这些组成。 我们将它们像乐高玩具一样放在一起以构建图形。 唯一的问题是,当我们将它们粘在一起时,我们必须使先前的接受状态变为非接受状态,然后为最终机器创建新的接受状态。 重复此过程,嵌套以组成图形。
Note the grayed states with only a single outgoing dashed line. These are neutral states. They effectively do not do anything to change what the machine accepts, but they are introduced during the Thompson construction. This is fine, as it is a byproduct that will be eliminated once we transform the graph to a DFA.
请注意,灰色状态只有一条虚线。 这些是中立状态。 它们实际上并没有做任何事情来改变机器可以接受的东西,但是它们是在汤普森建造期间引入的。 很好,因为一旦将图形转换为DFA,它就会被消除。
Final states meanwhile, are simply states that have no outgoing transition. Generally, these states are also accepting.
同时, 最终状态就是没有外向过渡的状态。 通常,这些状态也可以接受。
These finite state machines are the heart of our regular expression engine.
这些有限状态机是我们正则表达式引擎的核心。
We'll be supporting tokenizing as a primary goal so I will attempt to explain it. Above, we have used "Accept" as our symbol - it appears below the state label (qX) in the double circled states. We can use other symbols, and we can have many different symbols per graph. Usually, such machines are called lexers. They are technically a hack of the math of a DFA, but they're a well worn hack used by every FSM based tokenizer under the sun. All it means is you can't represent it as a textual regex expression - it can only be created in code. Here's one that matches a series of digits, a series of letters (a "word"), or a string of whitespace, and it will report which of the 3 it is:
我们将支持标记化为主要目标,因此我将尝试对其进行解释。 上面,我们使用“ Accept ”作为符号-它在双圈状态下显示在状态标签(qX)下方。 我们可以使用其他符号,每个图形可以有许多不同的符号。 通常,这种机器称为词法分析器。 从技术上讲,它们是DFA的数学技巧,但对于在阳光下每个基于FSM的令牌生成器来说,它们都是陈旧的技巧。 这意味着您不能将其表示为文本正则表达式-只能在代码中创建。 这是一个与一系列数字,一系列字母(一个“单词”)或一串空格匹配的字符,它将报告3个字符中的哪一个:
There is no true textual representation of a regular expression for this. It's a combination of other regular expressions. Each one has its own accept symbol associated with it. Furthermore, instead of running this machine once over a long string of text like we would with a traditional find-first match, we have to run this machine each time we "move" through the input. Basically, each time the machine is run, it advances the cursor over the input, so we just run the machine over and over again to move. Each iteration reports a symbol or an error.
为此没有正则表达式的真实文本表示。 它是其他正则表达式的组合。 每个人都有自己的接受符号 。 此外,与其像传统的“查找优先”匹配那样在较长的文本字符串上运行机器,不如每次在“输入”中移动时都必须运行该机器。 基本上,每次机器运行时,它都会将光标移到输入上,因此我们只需一次又一次地运行机器即可移动。 每次迭代都会报告符号或错误。
So here, running it over the input string "foo123 bar
" will report:
因此,在输入字符串“ foo123 bar
”上运行它会报告:
Word: foo
Digits: 123
Whitespace:
Word: bar
This is why parsers typically rely on them. It basically breaks an input stream into a series of lexemes or tokens. Code that runs a machine like this is commonly referred to either as a tokenizer or a lexer. It's a specialty of our regular expression engine because that's the application where it's most useful under .NET.
这就是解析器通常依赖它们的原因。 它基本上将输入流分成一系列词素或标记 。 运行这样的机器的代码通常称为标记器或词法分析器。 这是我们正则表达式引擎的特长,因为它是.NET下最有用的应用程序。
Poweset construction once again, is how we turn an NFA into a DFA. We like this, because DFAs are much more efficient as we'll see later. What we need to do basically, is figure out all of the possible combinations of states we can be in. Each set of possible states will resolve roughly to one DFA state. The exception is when it's impossible such that two states transition to destination different states on the same input. At that point, we end up exploding the possibilities because we need to clone the path N-1 times where N is the number of conflicts. This is not easy to explain nor particularly easy to code. Hopefully, a figure will help:
Poweset的构建再次是我们将NFA转变为DFA的方式。 我们之所以这样,是因为DFA的效率要高得多,我们将在后面看到。 我们基本上需要做的是找出我们可能处于的所有可能状态组合。每组可能状态将大致解析为一个DFA状态。 唯一的例外是,当不可能将两个状态转换为同一输入上的目标不同状态时。 在这一点上,我们最终探讨了各种可能性,因为我们需要将路径克隆N-1次,其中N是冲突数。 这不容易解释,也不是特别容易编码。 希望有一个数字可以帮助:
The graph indicates which of the source states the DFA state is composed of. You'll see that each state encompasses several of the other states. Sometimes however, a couple of states can yield a lot more states under the right - or rather wrong circumstances. I don't think I can explain the algorithm to you very clearly, but the code should help.
该图指示DFA状态由哪些源状态组成。 您会看到每个状态都包含其他几个状态。 但是,有时候在正确或错误的情况下,几个州可能会产生更多的州。 我认为我无法很清楚地向您解释该算法,但是代码应该会有所帮助。
One thing you may have noticed is that the transition arrows sometimes have ranges above them. What it's actually doing is combining many arrows into one, and then assigning a range to that.
您可能已经注意到的一件事是,过渡箭头有时在其上方具有范围。 它的实际作用是将许多箭头组合成一个,然后为其分配一个范围。
This is usually cosmetic, although the code is range aware and sometimes uses it for optimization, primarily during code generation or DFA state table traversal.
尽管代码可以识别范围,并且有时将代码用于优化(主要是在代码生成或DFA状态表遍历期间),但这通常是修饰性的。
Speaking of DFA state tables, one nice thing about DFAs is it is relatively easy to reduce them to a state table, which allows for more efficient traversal. It's also easy to generate code from DFAs, but none of this is easy to do with NFAs.
说到DFA状态表,关于DFA的一件好事是将它们简化为状态表相对容易,这使得遍历更加有效。 从DFA生成代码也很容易,但是使用NFA都不容易。
A DFA state table looks like the following:
DFA状态表如下所示:
State | Accept | Inputs | Destination |
q0 | n/a | [0-9] | q1 |
q0 | n/a | [A-Za-z] | q2 |
q0 | n/a | \s | q3 |
q1 | Digits | [0-9] | q1 |
q2 | Word | [A-Za-z] | q2 |
q3 | Whitespace | \s | q3 |
州 | 接受 | 输入项 | 目的地 |
00 | 不适用 | [0-9] | 11 |
00 | 不适用 | [A-Za-z] | 22 |
00 | 不适用 | \ s | 第3季 |
11 | 位数 | [0-9] | 11 |
22 | 字 | [A-Za-z] | 22 |
第3季 | 空格 | \ s | 第3季 |
In our own classes, we use nested arrays to make this slightly more efficient, but the concept is the same.
在我们自己的类中,我们使用嵌套数组使此方法效率更高,但概念是相同的。
Here's q2, as generated code:
这是q2 ,作为生成的代码:
q2:
if ((((context.Current >= 'A')
&& (context.Current <= 'Z'))
|| ((context.Current >= 'a')
&& (context.Current <= 'z'))))
{
// capture the current character under the cursor
context.CaptureCurrent();
// advance the input position by one
context.Advance();
// goto the next state (ourself)
goto q2;
}
return 1; // "Word" symbol id
Either way, using a table, or using generated code, you can accomplish the same thing.
无论哪种方式,使用表或使用生成的代码,您都可以完成相同的操作。
Anyway, hopefully this has made the concepts clear enough that we can start coding to them.
无论如何,希望这已经使概念足够清晰,我们可以开始对其进行编码了。
编码此混乱 (Coding this Mess)
I've partitioned the CharFA<TAccept>
class into several partial classes to make it easier to navigate.
我将CharFA<TAccept>
类划分为几个局部类,以使其更易于浏览。
Let's start with the basic data structure of a single state in the state machine in CharFA.cs.
让我们从CharFA.cs中状态机中单个状态的基本数据结构开始。
/// <summary>
/// Represents a single state in a character based finite state machine.
/// </summary>
/// <typeparam name="TAccept">The type of the accepting symbols</typeparam>
public partial class CharFA<TAccept>
{
// we use a specialized dictionary class both for performance and
// to preserve the order of the input transitions
/// <summary>
/// Indicates the input transitions.
/// These are the states that will be transitioned to on the specified input key.
/// </summary>
public IDictionary<char, CharFA<TAccept>> InputTransitions { get; }
= new _InputTransitionDictionary();
/// <summary>
/// Indicates the epsilon transitions.
/// These are the states that are transitioned to without consuming input.
/// </summary>
public IList<CharFA<TAccept>> EpsilonTransitions { get; }
= new List<CharFA<TAccept>>();
/// <summary>
/// Indicates whether or not this is an accepting state.
/// When an accepting state is landed on, this indicates a potential match.
/// </summary>
public bool IsAccepting { get; set; } = false;
/// <summary>
/// The symbol to associate with this accepting state.
/// Upon accepting a match, the specified symbol is returned which can identify it.
/// </summary>
public TAccept AcceptSymbol { get; set; } = default(TAccept);
/// <summary>
/// Indicates a user-defined value to associate with this state
/// </summary>
public object Tag { get; set; } = null;
/// <summary>
/// Constructs a new instance with the specified accepting value and accept symbol.
/// </summary>
/// <param name="isAccepting">Indicates whether or not the state is accepting</param>
/// <param name="acceptSymbol">Indicates the associated symbol to be used
/// when accepting.</param>
public CharFA(bool isAccepting,TAccept acceptSymbol=default(TAccept))
{
IsAccepting = isAccepting;
AcceptSymbol = acceptSymbol;
}
/// <summary>
/// Constructs a new non-accepting state
/// </summary>
public CharFA()
{
}
}
There's not a whole lot here yet once you strip away the comments. It's mainly just some data fields and a couple of constructors:
删除评论后,这里还没有很多东西。 它主要只是一些数据字段和几个构造函数:
First, we have InputTransitions
. This is a dictionary of type IDictionary<char,CharFA<TAccept>>
which represents the black arrows in the graph. The key is the character above the line. The value is a reference to the destination state. In practice, this is an optimized container that stores data internally by state, not by input character. However, this isn't necessary in theory. It's just that it yields better performance in practice under typical use cases.
首先,我们有InputTransitions
。 这是IDictionary<char,CharFA<TAccept>>
类型的字典,表示图中的黑色箭头。 关键是该行上方的字符。 该值是对目标状态的引用。 实际上,这是一个优化的容器,可以按状态而不是按输入字符在内部存储数据。 但是,从理论上讲,这不是必需的。 只是在典型的用例下,它在实践中会产生更好的性能。
Next, we have EpsilonTransitions
. This is a list of type IList<CharFA<TAccept>>
which holds references to destination states and represents the dashed arrows in the graph. For DFAs, this list will always be empty. As soon as one isn't, it's an NFA, not a DFA by definition.
接下来,我们有EpsilonTransitions
。 这是IList<CharFA<TAccept>>
类型的列表,其中包含对目标状态的引用,并表示图中的虚线箭头。 对于DFA,此列表将始终为空。 一旦不是,那就是NFA,而不是DFA。
IsAccepting
indicates whether the state is accepting. Double circles in the graphs are accepting states.
IsAccepting
指示状态是否正在接受。 图中的双圆圈表示接受状态。
AcceptSymbol
indicates the accept symbol to associate with this state. This is used for tokenization. Each symbol is basically an id for the pattern it represents. Above, they were "Word", "Digits" and "Whitespace".
AcceptSymbol
指示与该状态关联的接受符号。 这用于令牌化。 每个符号基本上都是其表示的图案的ID。 在上面,它们是“单词”,“数字”和“空白”。
Finally, we have Tag
. This property is for an arbitrary user defined value to associate with the state. This is typically used for debugging but it doesn't have to be. It has no effect on the machine.
最后,我们有Tag
。 此属性用于将用户定义的任意值与状态关联。 通常用于调试,但不是必须的。 它对机器没有影响。
Other than that, we simply have a convenience constructor and the default constructor.
除此之外,我们仅具有一个便捷构造函数和默认构造函数。
Now, we need a way to do the basics of even finding the boundaries of the state machine. We need our closure functions! We can see that and other computation methods in CharFA.Computation.cs.
现在,我们需要一种方法甚至可以找到状态机的边界。 我们需要关闭函数! 我们可以在CharFA.Computation.cs中看到它以及其他计算方法。
From now on, I will only highlight interesting portions of code as necessary, otherwise the article will be very long.
从现在开始,我将仅在必要时突出显示代码的有趣部分,否则本文将很长。
public IList<CharFA<TAccept>> FillClosure(IList<CharFA<TAccept>> result = null)
{
if (null == result)
result = new List<CharFA<TAccept>>();
else if (result.Contains(this))
return result;
result.Add(this);
// use a cast to get the optimized internal input mapping by FA state
foreach (var trns in InputTransitions as IDictionary<CharFA<TAccept>,ICollection<char>>)
trns.Key.FillClosure(result);
foreach (var fa in EpsilonTransitions)
fa.FillClosure(result);
return result;
}
Basically, what we're doing here is recursively calling FillClosure()
on any states we haven't seen before, adding them as we go. The check to see if we already have a state before recursing is important, otherwise a loop in the machine would cause a stack overflow in this function!
基本上,我们在这里要做的是在以前从未见过的任何状态上递归调用FillClosure()
,并在运行时添加它们。 在递归前检查是否已经有状态很重要,否则机器中的循环会导致此函数中的堆栈溢出!
FillEpsilonClosure()
works essentially the same way, but only traverses epsilon transitions.
FillEpsilonClosure()
工作原理基本相同,但仅遍历epsilon过渡。
Moving on, we have FillMove()
:
继续,我们有FillMove()
:
public static IList<CharFA<TAccept>> FillMove(IEnumerable<CharFA<TAccept>> states,
char input, IList<CharFA<TAccept>> result = null)
{
if (null == result) result = new List<CharFA<TAccept>>();
foreach (var fa in FillEpsilonClosure(states))
{
// examine each of the states reachable from this state on no input
CharFA<TAccept> ofa;
// see if this state has this input in its transitions
if (fa.InputTransitions.TryGetValue(input, out ofa))
foreach (var efa in ofa.FillEpsilonClosure())
if (!result.Contains(efa)) // if it does, add it if it's not already there
result.Add(efa);
}
return result;
}
Remember when I said you can be in more than one state at once? Well, this function takes the set of states we're currently in and moves as indicated by the specified input character, yielding the set of states that it resulted in, after moving along all the necessary dashed gray lines and the solid black line, if one is applicable. Basically, it works simply by querying each input transition dictionary in turn for the input character after performing an epsilon closure. It then performs the epsilon closure of those result states in order to complete the operation. This is the heart of running a regular expression as this represents a single iteration of the regex character matching. This works on NFAs and DFAs but there's a more efficient way to move along DFAs since they can only be in one state. Consider MoveDfa()
:
还记得我说过您一次可以处于多个状态吗? 好的,该函数采用我们当前所处的状态集,并按照指定的输入字符指示进行移动,并在沿着所有必要的虚线灰色和黑色实线移动后产生所产生的状态集,如果一个适用。 基本上,它简单地通过在执行epsilon闭合后依次查询每个输入过渡字典中的输入字符来工作。 然后,它对那些结果状态执行epsilon封闭,以完成操作。 这是运行正则表达式的核心,因为它表示正则表达式字符匹配的单个迭代。 这适用于NFA和DFA,但由于DFA只能处于一种状态,因此存在一种更有效的处理方式。 考虑MoveDfa()
:
public CharFA<TAccept> MoveDfa(char input)
{
CharFA<TAccept> fa;
if (InputTransitions.TryGetValue(input, out fa))
return fa;
return null;
}
There are no loops or collections involved here. There is simply a dictionary lookup. This is much more efficient, but later, we'll make it more efficient still. Note that you can try this on an NFA but it won't work right. Unfortunately, while it's possible to check if a machine is a DFA or not, there's no way to do it efficiently enough for this method. Be sure to only use it with DFAs.
这里没有循环或集合。 只是有一个字典查找。 这样效率更高,但是稍后,我们将使其更加高效。 请注意,您可以在NFA上尝试此操作,但无法正常运行。 不幸的是,尽管可以检查计算机是否为DFA,但是对于这种方法,没有足够的方法来进行高效处理。 确保仅将其与DFA一起使用。
FillInputTransitionRangesGroupedByState()
is a mouthful but it's descriptive. Each state is returned with a collection of inputs that lead to it. These inputs are returned as character ranges. This is primarily used for display and for code generation. The ranges are used to display the ranges above the input transitions and the code generator uses them for creating conditional statements.
FillInputTransitionRangesGroupedByState()
令人FillInputTransitionRangesGroupedByState()
但具有描述性。 每个状态都返回导致其的输入集合。 这些输入作为字符范围返回。 这主要用于显示和代码生成。 这些范围用于显示输入转换上方的范围,并且代码生成器使用它们来创建条件语句。
I'd like to direct your attention to our optimized input transition dictionary in CharFA.InputTransitionDictionary.cs:
我想将您的注意力转移到CharFA.InputTransitionDictionary.cs中我们优化的输入过渡字典:
// a specialized input transition container dictionary.
// this isn't required for a working regex engine but
// can make some common operations significantly faster.
partial class CharFA<TAccept>
{
/// <summary>
/// This is a specialized transition container
/// that can return its transitions in 3 different ways:
/// 1. a dictionary where each transition state is keyed
/// by an individual input character (default)
/// 2. a dictionary where each collection of inputs is keyed
/// by the transition state (used mostly by optimizations)
/// 3. an indexable list of pairs where the key is the transition state
/// and the value is the collection of inputs
/// use casts to get at the appropriate interface for your operation.
/// </summary>
private class _InputTransitionDictionary :
IDictionary<char, CharFA<TAccept>>, // #1
IDictionary<CharFA<TAccept>, ICollection<char>>, // #2
IList<KeyValuePair<CharFA<TAccept>, ICollection<char>>> // #3
{
IDictionary<CharFA<TAccept>, ICollection<char>> _inner =
new ListDictionary<CharFA<TAccept>, ICollection<char>>();
...
Internally, this class stores transitions as IDictionary<CharFA<TAccept>,ICollection<char>>
, which is basically the mapping in reverse, where the destination state is the key, and the inputs that point to it are the values. This provides a number of advantages for several operations throughout the code. You can get to this mapping by casting to the aforementioned interface. The other way to get at this data is similar to the previous, except instead of a dictionary it returns a ordered list of key value pairs. Cast to IList<KeyValuePair<CharFA<TAccept>, ICollection<char>>>
in order to enable this. By default however, the class maps to the familiar IDictionary<char, CharFA<TAccept>>
that we're used to. Note that this dictionary is actually slower than a standard dictionary for IDictionary<char, CharFA<TAccept>>
lookups, but that doesn't matter because typically we won't be using or FillMove()
or even MoveDfa()
to run our regex matching. We'll be using a DFA state table or compiled code, which will not have this overhead. The tradeoff for this overhead is faster operations in terms of building and manipulating the finite state machine.
在内部,此类将转换存储为IDictionary<CharFA<TAccept>,ICollection<char>>
,这基本上是反向映射,其中目标状态是键,指向它的输入是值。 这为整个代码中的多个操作提供了许多优势。 您可以通过转换为上述接口来获得此映射。 获取此数据的另一种方法与前面的方法类似,不同之处在于,它返回键值对的有序列表而不是字典。 为了启用此功能IList<KeyValuePair<CharFA<TAccept>, ICollection<char>>>
强制转换为IList<KeyValuePair<CharFA<TAccept>, ICollection<char>>>
。 但是,默认情况下,该类映射到我们习惯使用的熟悉的IDictionary<char, CharFA<TAccept>>
。 请注意,对于IDictionary<char, CharFA<TAccept>>
查找,此字典实际上比标准字典慢,但这没关系,因为通常我们不会使用or FillMove()
甚至MoveDfa()
来运行我们的正则表达式匹配。 我们将使用DFA状态表或已编译的代码,而不会产生这种开销。 这种开销的折衷是在构建和操纵有限状态机方面的更快操作。
Moving on, let's take a look at a monster of an equality comparer class in CharFA.SetComparer.cs:
继续,让我们看一下CharFA.SetComparer.cs中的一个相等比较器类的怪物 :
partial class CharFA<TAccept>
{
// this class provides a series of comparers for various CharFA operations
// these are primarily used during duplicate checking and in the powerset
// construction
private sealed class _SetComparer :
IEqualityComparer<IList<CharFA<TAccept>>>,
IEqualityComparer<ICollection<CharFA<TAccept>>>,
IEqualityComparer<IDictionary<char, CharFA<TAccept>>>
{
...
See my article on value equality semantics in C# for more about how equality comparers work. You can see this class implements three of them, each handling a different type of container. We use these in our scratch/working hashsets and dictionaries during powerset construction but also in other areas where we need to key a dictionary or hashtable by a collection of states or we need to compare input transitions (used by duplicate state detection code).
有关平等比较器如何工作的更多信息,请参见我有关C#中的值平等语义的文章。 您可以看到该类实现了其中的三个,每个都处理不同类型的容器。 在Powerset构建过程中,我们在暂存/工作哈希集和字典中使用了这些变量,但在其他地方,我们需要通过状态集合来键入字典或哈希表,或者需要比较输入转换(用于重复的状态检测代码)。
We can't do much without being able to build our state machines in the first place. Fortunately, we have a bunch of methods for doing so in CharFA.ThompsonConstruction.cs:
首先,我们不能做很多事情,而不能构建状态机。 幸运的是,在CharFA.ThompsonConstruction.cs中 ,我们有很多方法可以这样做:
Each of the methods takes either an input string or one or more expressions, possible additional parameters, and finally, an optional accept symbol. This file provides the following constructions:
每个方法都采用输入字符串或一个或多个表达式,可能的附加参数以及最后的可选接受符号。 该文件提供以下构造:
Literal()
: This builds a literal given the input string.Literal()
:这将根据输入字符串构建文字。Set()
: This builds a character set given the input string or collection of character ranges.Set()
:这将根据输入字符串或字符范围的集合来构建字符集。Or()
: This creates an alternation between two or more expressions. Basically, this will create a state machine that matches any of the specified expressions.Or()
:这将在两个或多个表达式之间产生交替。 基本上,这将创建一个与任何指定表达式匹配的状态机。Concat()
: This creates a concatenation of two or more expressions in a series.Concat()
:这将在一个序列中创建两个或多个表达式的串联。Optional()
: This creates a state machine that makes the passed in expression optional.Optional()
:这将创建一个状态机,使传入的表达式成为可选状态。Repeat()
: This repeats an expression an optionally specified minimum and maximum number of times.Repeat()
:这将表达式重复执行一次可选的最小和最大次数。CaseInsensitive()
: This makes the specified expression case insensitive.CaseInsensitive()
:这使指定的表达式不区分大小写。
Each of these works by taking the passed in FSM/expression or string and creating a new FSM based on it. Typically, these methods connect the lines between the different states in order to fulfill the operation.
每个方法都采用传入的FSM /表达式或字符串并基于它创建一个新的FSM。 通常,这些方法将不同状态之间的线连接起来以完成操作。
Consider Set()
:
考虑Set()
:
public static CharFA<TAccept> Set(IEnumerable<char> set, TAccept accept = default(TAccept))
{
var result = new CharFA<TAccept>();
var final = new CharFA<TAccept>(true, accept);
foreach (var ch in set)
result.InputTransitions[ch]= final;
return result;
}
What it does is create two new states, and then for each character in the set, it creates a transition from the first state to the final state, before returning the first state. This creates a graph as shown previously for character sets. Some of these can get quite complicated. Repeat()
is an example of that simply because there are so many corner cases.
它要做的是创建两个新状态,然后为集合中的每个字符创建一个从第一个状态到最终状态的过渡,然后再返回第一个状态。 这将创建一个如先前所示的字符集图。 其中一些会变得非常复杂。 仅因为有很多极端情况, Repeat()
就是一个例子。
This is all well and good, but what if we need to look at the graph and examine what we just built? Sure we can dig through the dictionaries in the debugger, but what a pain! Instead, why not just be able to render the state machine to a jpeg or a png? Enter CharFA.GraphViz.cs:
这一切都很好,但是如果我们需要查看图表并检查刚刚构建的内容该怎么办? 当然,我们可以在调试器中浏览字典,但是真是太痛苦了! 相反,为什么不仅仅能够将状态机呈现为jpeg或png? 输入CharFA.GraphViz.cs :
This requires GraphViz, naturally, so make sure it is installed. This basically encompasses the whole grotty affair of using the GraphViz dot utility to render images from state machine graphs. Dot is pretty powerful, but also messy, a bit like Perl. The code to generate these dot specifications reflects this. Basically how it works is it takes the state machine and writes a dot spec to a stream which it pipes to the dot utility, and then pipes in the image that comes back. This requires GraphViz to be in your path, but it should be already as long as it's installed. There are a number of options you can get to via CharFA<TAccept>.DotGraphOptions
which you would optionally pass to RenderToFile()
or RenderToStream()
. You shouldn't need to use WriteDotTo()
directly but you can if you want to see the dot specification output. Using this is simple:
自然,这需要GraphViz,因此请确保已安装它。 这基本上涵盖了使用GraphViz 点实用程序从状态机图渲染图像的整个过程。 Dot非常强大,但也很混乱,有点像Perl。 生成这些点规范的代码反映了这一点。 基本上,它的工作方式是使用状态机,将点规范写入流中,然后将其传递给dot实用程序,然后对返回的图像进行传递。 这要求GraphViz在您的路径中,但是它应该已经安装了。 您可以通过CharFA<TAccept>.DotGraphOptions
获得许多选项,您可以选择将其传递给RenderToFile()
或RenderToStream()
。 您不需要直接使用WriteDotTo()
,但是如果您想查看点规范输出,则可以使用。 使用这个很简单:
var lit = CharFA<string>.Literal("ABC");
lit.RenderToFile("5251476/literal.jpg");
This code takes care of all the terrible details. You can specify the output format via the file extension. For example, to render PNG files you'd use ".png" instead of ".jpg". I used this facility to render all the graphs for this article.
这段代码处理了所有可怕的细节。 您可以通过文件扩展名指定输出格式。 例如,要渲染PNG文件,您可以使用“ .png ”而不是“ .jpg ”。 我使用此工具来呈现本文的所有图形。
Finally, none of this is useful if we can't actually use the regular expressions to lex/tokenize or search text. This is what CharFA.Lexer.cs and CharFA.Matcher.cs are for:
最后,如果我们实际上不能使用正则表达式来进行词法处理/标记化或搜索文本,那么这些方法都没有用。 这是CharFA.Lexer.cs和CharFA.Matcher.cs的用途:
partial class CharFA<TAccept>
{
/// <summary>
/// Creates a lexer out of the specified FSM "expressions"
/// </summary>
/// <param name="exprs">The expressions to compose the lexer with</param>
/// <returns>An FSM representing the lexer.</returns>
public static CharFA<TAccept> ToLexer(params CharFA<TAccept>[] exprs)
{
var result = new CharFA<TAccept>();
for (var i = 0; i < exprs.Length; i++)
result.EpsilonTransitions.Add(exprs[i]);
return result;
}
/// <summary>
/// Lexes the next input from the parse context.
/// </summary>
/// <param name="context">The <see cref="ParseContext"/> to use.</param>
/// <param name="errorSymbol">The symbol to report in the case of an error</param>
/// <returns>The next symbol matched - <paramref name="context"/>
/// contains the capture and line information</returns>
public TAccept Lex(ParseContext context,TAccept errorSymbol = default(TAccept))
{
TAccept acc;
// get the initial states
var states = FillEpsilonClosure();
// prepare the parse context
context.EnsureStarted();
while (true)
{
// if no more input
if (-1 == context.Current)
{
// if we accept, return that
if (TryGetAnyAcceptSymbol(states, out acc))
return acc;
// otherwise return error
return errorSymbol;
}
// move by current character
var newStates = FillMove(states, (char)context.Current);
// we couldn't match anything
if (0 == newStates.Count)
{
// if we accept, return that
if (TryGetAnyAcceptSymbol(states, out acc))
return acc;
// otherwise error
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
return errorSymbol;
}
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
// iterate to our next states
states = newStates;
}
}
/// <summary>
/// Lexes the next input from the parse context.
/// </summary>
/// <param name="context">The <see cref="ParseContext"/> to use.</param>
/// <param name="errorSymbol">The symbol to report in the case of an error</param>
/// <returns>The next symbol matched - <paramref name="context"/>
/// contains the capture and line information</returns>
/// <remarks>This method will not work properly on an NFA but will not error
/// in that case, so take care to only use this with a DFA</remarks>
public TAccept LexDfa(ParseContext context, TAccept errorSymbol = default(TAccept))
{
// track our current state
var state = this;
// prepare the parse context
context.EnsureStarted();
while (true)
{
// if no more input
if (-1 == context.Current)
{
// if we accept, return that
if(state.IsAccepting)
return state.AcceptSymbol;
// otherwise return error
return errorSymbol;
}
// move by current character
var newState = state.MoveDfa((char)context.Current);
// we couldn't match anything
if (null == newState)
{
// if we accept, return that
if (state.IsAccepting)
return state.AcceptSymbol;
// otherwise error
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
return errorSymbol;
}
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
// iterate to our next states
state = newState;
}
}
public static int LexDfa(CharDfaEntry[] dfaTable,
ParseContext context, int errorSymbol = -1)
{
// track our current state
var state = 0;
// prepare the parse context
context.EnsureStarted();
while (true)
{
// if no more input
if (-1 == context.Current)
{
var sid = dfaTable[state].AcceptSymbolId;
// if we accept, return that
if (-1 != sid)
return sid;
// otherwise return error
return errorSymbol;
}
// move by current character
var newState = MoveDfa(dfaTable, state, (char)context.Current);
// we couldn't match anything
if (-1 == newState)
{
// if we accept, return that
if (-1 != dfaTable[state].AcceptSymbolId)
return dfaTable[state].AcceptSymbolId;
// otherwise error
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
return errorSymbol;
}
// store the current character
context.CaptureCurrent();
// advance the input
context.Advance();
// iterate to our next states
state = newState;
}
}
}
You can see we've introduced a new class, called ParseContext
. This class handles cursor position tracking and capturing during a lex or a match. It can also be used to implement hand rolled parsers, which was what it was designed for. See my Code Project article on it for more information. A ParseContext
can be created over any instance of IEnumerable<char>
. If you want raw speed, you can substitute the 1.0 version of ParseContext
which doesn't support lookahead (we don't need it here) but is ever so slightly faster.
您可以看到我们引入了一个名为ParseContext
的新类。 此类在lex或匹配期间处理光标位置的跟踪和捕获。 它也可以用来实现手动滚动解析器,这正是它的设计目的。 有关更多信息,请参见我的代码项目文章 。 可以在IEnumerable<char>
任何实例上创建ParseContext
。 如果需要原始速度,可以替换不支持先行(我们在这里不需要)的ParseContext
1.0版本 ,但速度会稍快一些。
The matching functionality is very similar so we won't cover the implementation here.
匹配功能非常相似,因此我们在这里不介绍其实现。
The three methods here all do the same thing using different mechanisms. They are listed in order of performance, worst to best. The better ones take upfront prep - converting to a DFA or a DFA state table but they perform better. They are fairly simple to use. Each one operates almost the same way, but lexing a DFA table is slightly more complicated because the symbols are mapped to ints
and you use a symbol table to resolve them.
这里的三种方法都使用不同的机制来做相同的事情。 按性能从大到小的顺序列出它们。 更好的代码需要预先准备-转换为DFA或DFA状态表,但它们的性能更好。 它们使用起来非常简单。 每种操作的方式几乎相同,但是对DFA表进行词法处理会稍微复杂一些,因为符号已映射到ints
并且您可以使用符号表来解析它们。
// create a parse context over our test string
var pc = ParseContext.Create(test);
// while not end of input
while (-1 != pc.Current)
{
// clear the capture so that we don't keep appending the token data
pc.ClearCapture();
// lex the next token, using #ERROR as our error symbol
var acc = lexer.Lex(pc, "#ERROR");
// write the result
Console.WriteLine("{0}: {1}",acc, pc.GetCapture());
}
Using matching to search through a string looks like this:
使用匹配搜索字符串如下所示:
CharFAMatch match;
var pc = ParseContext.Create(test);
while (null!=(match=word.Match(pc)))
Console.WriteLine("Found match at {0}: {1}", match.Position, match.Value);
We just keep calling Match()
or MatchDfa()
until it returns null
.
我们只是继续调用Match()
或MatchDfa()
直到它返回null
为止。
What about parsing actual textual regular expression? What if we want to create a CharFA<TAccept>
instance from an expression like (ABC|DEF)+
? This is why we have our DOM, which is composed of RegexXXXXExpression
expression classes. These do more than parse. You can use them to analyze and manipulate the textual representation of a regular expression before converting it to a CharFA<TAccept>
state machine.
解析实际的文本正则表达式怎么办? 如果我们想根据(ABC|DEF)+
这样的表达式创建CharFA<TAccept>
实例怎么办? 这就是为什么我们拥有由RegexXXXXExpression
表达式类组成的DOM的RegexXXXXExpression
。 这些所做的不仅仅是解析。 您可以使用它们来分析和处理正则表达式的文本表示形式,然后再将其转换为CharFA<TAccept>
状态机。
RegexExpression.Parse()
is how we get a regular expression DOM from its textual representation. It uses recursive descent parsing to build the DOM.
RegexExpression.Parse()
是我们如何从其文本表示形式中获取正则表达式DOM的方式。 它使用递归下降解析来构建DOM。
The following DOM expressions are available:
可以使用以下DOM表达式:
RegexExpression
: The abstract base of all other expressions.RegexExpression
:所有其他表达式的抽象库。RegexUnaryExpression
: An abstract base for an expression that has one target expression.RegexUnaryExpression
:具有一个目标表达式的表达式的抽象库。RegexBinaryExpression
: An abstract base for an expression that has a left and a right target expression.RegexBinaryExpression
:具有左目标表达式和右目标表达式的表达式的抽象库。RegexCharsetExpression
: Represents[]
expressions. Supports many POSIX character classes.RegexCharsetExpression
:表示[]
表达式。 支持许多POSIX字符类。RegexConcatExpression
: Represents regex concatenation, such asABC
.RegexConcatExpression
:表示正则表达式串联,例如ABC
。RegexLiteralExpression
: Represents a single literal character such asA
.RegexLiteralExpression
:表示单个文字字符,例如A
RegexOptionalExpression
: Represents the?
modifier.RegexOptionalExpression
:代表?
修饰符。RegexOrExpression
: Represents|
alternation.RegexOrExpression
:代表|
交替。RegexRepeatExpression
: Represents*
,+
and{,}
modifiers.RegexRepeatExpression
:表示*
,+
和{,}
修饰符。
Here's a brief example of how to use the DOM:
这是有关如何使用DOM的简短示例:
var test = "(ABC|DEF)*";
var dom = RegexExpression.Parse(test);
Console.WriteLine(dom.ToString());
var rep = dom as RegexRepeatExpression;
rep.MinOccurs = 1;
Console.WriteLine(dom.ToString());
var fa = dom.ToFA("Accept");
Which outputs:
哪个输出:
(ABC|DEF)*
(ABC|DEF)+
And creates a state machine graph of:
并创建一个状态机图:
Now, in order to transform something like the above to a DFA, we need to perform powerset construction.
现在,为了将上述内容转换为DFA,我们需要执行Powerset构建。
The code for this is in CharFA.PowersetConstruction.cs:
此代码在CharFA.PowersetConstruction.cs中 :
partial class CharFA<TAccept>
{
/// <summary>
/// Transforms an NFA to a DFA
/// </summary>
/// <param name="progress">The optional progress object used to report
/// the progress of the operation</param>
/// <returns>A new finite state machine equivalent to this state machine
/// but with no epsilon transitions</returns>
public CharFA<TAccept> ToDfa(IProgress<CharFAProgress> progress = null)
{
// The DFA states are keyed by the set of NFA states they represent.
var dfaMap = new Dictionary<List<CharFA<TAccept>>,
CharFA<TAccept>>(_SetComparer.Default);
var unmarked = new HashSet<CharFA<TAccept>>();
// compute the epsilon closure of the initial state in the NFA
var states = new List<CharFA<TAccept>>();
FillEpsilonClosure(states);
// create a new state to represent the current set of states. If one
// of those states is accepting, set this whole state to be accepting.
CharFA<TAccept> dfa = new CharFA<TAccept>();
var al = new List<TAccept>();
// find the accepting symbols for the current states
foreach (var fa in states)
if (fa.IsAccepting)
if (!al.Contains(fa.AcceptSymbol))
al.Add(fa.AcceptSymbol);
// here we assign the appropriate accepting symbol
int ac = al.Count;
if (1 == ac)
dfa.AcceptSymbol = al[0];
else if (1 < ac)
dfa.AcceptSymbol = al[0]; // could throw, just choose the first one
dfa.IsAccepting = 0 < ac;
CharFA<TAccept> result = dfa; // store the initial state for later,
// so we can return it.
// add it to the dfa map
dfaMap.Add(states, dfa);
dfa.Tag = new List<CharFA<TAccept>>(states);
// add it to the unmarked states, signalling that we still have work to do.
unmarked.Add(dfa);
bool done = false;
var j = 0;
while (!done)
{
// report our progress
if (null != progress)
progress.Report(new CharFAProgress(CharFAStatus.DfaTransform, j));
done = true;
// a new hashset used to hold our current key states
var mapKeys = new HashSet<List<CharFA<TAccept>>>(dfaMap.Keys, _SetComparer.Default);
foreach (var mapKey in mapKeys)
{
dfa = dfaMap[mapKey];
if (unmarked.Contains(dfa))
{
// when we get here, mapKey represents the epsilon closure of our
// current dfa state, which is indicated by kvp.Value
// build the transition list for the new state by combining the transitions
// from each of the old states
// retrieve every possible input for these states
var inputs = new HashSet<char>();
foreach (var state in mapKey)
{
var dtrns = state.InputTransitions as
IDictionary<CharFA<TAccept>, ICollection<char>>;
foreach (var trns in dtrns)
foreach (var inp in trns.Value)
inputs.Add(inp);
}
// for each input, create a new transition
foreach (var input in inputs)
{
var acc = new List<TAccept>();
var ns = new List<CharFA<TAccept>>();
foreach (var state in mapKey)
{
CharFA<TAccept> dst = null;
if (state.InputTransitions.TryGetValue(input, out dst))
{
foreach (var d in dst.FillEpsilonClosure())
{
// add the accepting symbols
if (d.IsAccepting)
if (!acc.Contains(d.AcceptSymbol))
acc.Add(d.AcceptSymbol);
if (!ns.Contains(d))
ns.Add(d);
}
}
}
CharFA<TAccept> ndfa;
if (!dfaMap.TryGetValue(ns, out ndfa))
{
ac = acc.Count;
ndfa = new CharFA<TAccept>(0 < ac);
// assign the appropriate accepting symbol
if (1 == ac)
ndfa.AcceptSymbol = acc[0];
else if (1 < ac)
ndfa.AcceptSymbol = acc[0]; // could throw, instead just set it
// to the first state's accept
dfaMap.Add(ns, ndfa);
// work on this new state
unmarked.Add(ndfa);
ndfa.Tag = new List<CharFA<TAccept>>(ns);
done = false;
}
dfa.InputTransitions.Add(input, ndfa);
}
// we're done with this state
unmarked.Remove(dfa);
}
}
++j;
}
return result;
}
}
Ooooh, that's ugly. The sordid truth is I spent ages getting it working a long time ago in a previous incarnation of my regex engine. It has been evolved over time. I only vaguely understand the math involved, so I haven't tried to recode it from scratch even though it could probably use it. The good news is, it works. Using ToDfa()
is straightforward, but the progress parameter deserves some explanation. It follows Microsoft's IProgress<T>
pattern used for long running tasks. In this case, it holds a count and a status. The count is simply the number of iterations currently, and the status shows what operation is being performed.
哦,这很丑。 愚蠢的事实是,我花了很长时间才使它在我的正则表达式引擎的化身中起作用。 它已经随着时间的流逝而发展。 我只是模糊地理解所涉及的数学,因此即使它可能会使用它,也没有尝试从头开始对其重新编码。 好消息是,它有效。 使用ToDfa()
很简单,但是progress参数值得一些解释。 它遵循Microsoft的IProgress<T>
模式,用于长时间运行的任务。 在这种情况下,它拥有一个计数和一个状态。 该计数仅是当前的迭代次数,状态显示正在执行的操作。
We have a related mechanism to generate a DFA state table in CharFA.DfaStateTable.cs. It contains ToDfaStateTable()
which takes an optional (but recommended) symbol table that is simply an array of TAccept
. It maps them to ids such that the id is the index of the TAccept
symbol in the array. Since the table based DFA only uses integers, this table is used to map them back to symbols. The created table can be passed to the static XXXXDfa()
methods.
我们在CharFA.DfaStateTable.cs中有一个相关的机制来生成DFA状态表。 它包含ToDfaStateTable()
,它带有一个可选的(但建议使用)符号表,该符号表只是一个TAccept
数组。 它将它们映射到id,以使id是数组中TAccept
符号的索引。 由于基于表的DFA仅使用整数,因此该表用于将它们映射回符号。 可以将创建的表传递给静态XXXXDfa()
方法。
Our engine is now feature complete, sans code generation, so that is what we'll cover next. Let's visit CharFA.CodeGeneration.cs:
我们的引擎现在具有完整的功能,没有代码生成,因此我们接下来将介绍。 让我们访问CharFA.CodeGeneration.cs :
This code supports two major styles of code generation; it can simply serialize the dfa table and symbol tables to code (table driven) or it can actually generate compilable state machine code using gotos. Generally, it's recommended to use the table driven approach as they are roughly the same speed unless the state machine is large, in which case the table driven form overtakes the compiled form. The compiled form is mainly included for curiosity and illustrative purposes. Keep in mind either way that the release builds will be much faster.
此代码支持两种主要的代码生成样式: 它可以简单地将dfa表和符号表序列化为代码(由表驱动),也可以使用gotos实际生成可编译的状态机代码。 通常,建议使用表驱动的方法,因为它们的速度大致相同,除非状态机很大,在这种情况下,表驱动的形式将超过编译形式。 出于好奇和说明目的,主要包括了已编译的表格。 请记住,发布版本的速度会更快。
Serialization of the table driven arrays are performed via GenerateSymbolTableInitializer()
and GenerateDfaStateTableInitializer()
. Each returns a Code DOM expression that can be used to initialize a field or a variable.
表驱动数组的序列化是通过GenerateSymbolTableInitializer()
和GenerateDfaStateTableInitializer()
。 每个返回一个Code DOM表达式,可用于初始化字段或变量。
Typically, what you'll do is use the Code DOM to build a class, then for each generated regular expression or lexer, you will serialize one or two read-only static member fields on that class: the symbol table, if using a lexer, and the DFA state table in either case. You can then use these fields at runtime like you would a normal runtime DFA state table, passing them to LexDfa()
, or MatchDfa()
as appropriate. When using the lexer, you will use the symbol table to resolve the accept symbols as shown below:
通常,您要做的是使用Code DOM来构建一个类,然后针对每个生成的正则表达式或词法分析器,对该类上的一个或两个只读静态成员字段进行序列化:符号表(如果使用词法分析器) ,以及两种情况下的DFA状态表。 然后,您可以像在常规运行时DFA状态表中那样在运行时使用这些字段,并将它们适当地传递给LexDfa()
或MatchDfa()
。 使用词法分析器时,将使用符号表来解析接受符号,如下所示:
pc = ParseContext.Create(test);
while (-1 != pc.Current)
{
pc.ClearCapture();
var acc = CharFA<string>.LexDfa(GeneratedLexDfaTable, pc, 3);
// when we write this, we map our symbol id back to the
// symbol using our symbol table.
Console.WriteLine("{0}: {1}", GeneratedLexSymbols[acc], pc.GetCapture());
}
This is the easiest code generation option, and is best for general purpose use.
这是最简单的代码生成选项,并且最适合常规用途。
The other way to generate code is through the GenerateMatchMethod()
and GenerateLexMethod()
methods, which generate compile-ready code using gotos. These don't require a DFA table, but you will need to use a symbol table in tandem with the lex method, so use the aforementioned method to generate that. You'll have to name the methods and set the access modifiers appropriately before you use them.
生成代码的另一种方法是通过GenerateMatchMethod()
和GenerateLexMethod()
方法,它们使用gotos生成可编译的代码。 这些不需要DFA表,但是您将需要与lex方法结合使用符号表,因此请使用上述方法生成该表。 在使用方法之前,您必须命名方法并适当设置访问修饰符。
Typically, as before you'll use the Code DOM to build a class, and then for each generated regular expression or lexer, you will use the above to create a static method (and an associated symbol table field for the lexers) on the class. Often, you will generate both lex and match methods plus one symbol table field for each expression. That's three members in total to be clear, in order to have the full regex capabilities. If you don't need lexing, you can leave it out, or similar with matching. To be clear, if all you need is matching you'll just have the one static method per expression. If all you need is lexing, you'll have one static method and one static read-only field for each. If you need both lexing and matching, you'll have two methods and one field per for a total of three members per expression.
通常,像以前一样,您将使用Code DOM来构建类,然后针对每个生成的正则表达式或词法分析器,使用上述方法在类上创建静态方法(以及词法分析器的关联符号表字段) 。 通常,您将为每个表达式生成lex和match方法,以及一个符号表字段。 为了拥有完整的正则表达式功能,总共需要三个成员。 如果您不需要词法化,可以将其省略,或者与之匹配。 明确地说,如果您需要的只是匹配项,则每个表达式只有一个静态方法。 如果您只需要词法分析,则每个方法都有一个静态方法和一个静态只读字段。 如果同时需要词法分析和匹配,则将有两个方法,每个都有一个字段,每个表达式总共三个成员。
Be conservative in your use of this as the compiled methods can become large pretty quickly. The arrays wind up large in the source too, but it's mostly whitespace. Also keep in mind, there is a slight overhead on startup with the table driven versions because of the static field initializers, so if you have a large number of these, your app may stall a little on startup. This is still much quicker than building the state machine at runtime, no matter how you do it.
保守地使用它,因为编译后的方法会很快变得很大。 数组在源代码中也很大,但是主要是空白。 还请记住,由于使用了静态字段初始化程序,因此使用表驱动版本启动时会产生一些开销,因此,如果您有大量的表初始化程序,您的应用程序可能会在启动时停滞不前。 无论您如何执行,这仍然比在运行时构建状态机快得多。
Let's explore the compiled version by revisiting our DFA lexer graph (this time without the extra clutter):
让我们通过重新访问DFA词法分析器图来探索编译版本(这次没有多余的混乱):
Keep this figure in mind. Now let's look at the default generated code for it:
记住这个数字。 现在,让我们来看一下默认生成的代码:
internal static int Lex(RE.ParseContext context)
{
context.EnsureStarted();
// q0
if (((context.Current >= '0')
&& (context.Current <= '9')))
{
context.CaptureCurrent();
context.Advance();
goto q1;
}
if ((((context.Current >= 'A')
&& (context.Current <= 'Z'))
|| ((context.Current >= 'a')
&& (context.Current <= 'z'))))
{
context.CaptureCurrent();
context.Advance();
goto q2;
}
if (((((context.Current == '\t')
|| ((context.Current >= '\n')
&& (context.Current <= '')))
|| (context.Current == '\r'))
|| (context.Current == ' ')))
{
context.CaptureCurrent();
context.Advance();
goto q3;
}
goto error;
q1:
if (((context.Current >= '0')
&& (context.Current <= '9')))
{
context.CaptureCurrent();
context.Advance();
goto q1;
}
return 0;
q2:
if ((((context.Current >= 'A')
&& (context.Current <= 'Z'))
|| ((context.Current >= 'a')
&& (context.Current <= 'z'))))
{
context.CaptureCurrent();
context.Advance();
goto q2;
}
return 1;
q3:
if (((((context.Current == '\t')
|| ((context.Current >= '\n')
&& (context.Current <= '')))
|| (context.Current == '\r'))
|| (context.Current == ' ')))
{
context.CaptureCurrent();
context.Advance();
goto q3;
}
return 2;
error:
context.CaptureCurrent();
context.Advance();
return 3;
}
If you compare the previous graph figure to the code above, you can see how they line up: Each state is represented by a label in the graph - except the first state, which is indicated by a comment since it's never jumped to. Each transition range meanwhile, is indicated by the relevant conditions in the if
statements. There is one if
for each possible destination state. CaptureCurrent()
just stores the character at the cursor, while Advance()
simply moves the cursor forward by one character. Each of the returned accept symbol ids is hard coded. The error condition knows to return 3
because we passed it into the generation method. Other than that, the code is very straightforward.
如果将上一个图形与上面的代码进行比较,您将看到它们如何排列:每个状态都由图形中的标签表示-第一个状态除外,第一个状态由注释表示,因为它从未跳转到。 同时,每个过渡范围由if
语句中的相关条件指示。 if
每个可能的目标状态都有一个。 CaptureCurrent()
仅将字符存储在光标处,而Advance()
只是将光标向前移动一个字符。 返回的每个接受符号ID均经过硬编码。 错误条件知道返回3
因为我们将其传递给了生成方法。 除此之外,代码非常简单。
Usually, you'll want to change the name of the method and perhaps the access modifiers before you use it. The above are just the default.
通常,在使用方法之前,您需要更改方法的名称以及访问修饰符。 以上只是默认设置。
The table generation methods aren't as interesting since they simply produce array initializer expressions. We won't cover the generated code here but the included demo project does. The two table generation methods use _Serialize()
which recursively creates Code DOM expressions to instantiate the values in instances or arrays. In order to get CharDfaEntry
and CharDfaTransitionEntry
to serialize we use Microsoft's component model type descriptor framework, and have two custom type converters in CharDfaEntry.cs which tell our code how to serialize the respective types. It uses InstanceDescriptor
which is a bit arcane, but see this article for details on how it works.
表生成方法并不是很有趣,因为它们只是生成数组初始化器表达式。 我们不会在这里介绍生成的代码,但是包含的演示项目会介绍。 这两种表生成方法使用_Serialize()
来递归地创建Code DOM表达式,以实例化实例或数组中的值。 为了使CharDfaEntry
和CharDfaTransitionEntry
进行序列化,我们使用Microsoft的组件模型类型描述符框架,并且在CharDfaEntry.cs中有两个自定义类型转换器,它们告诉我们的代码如何序列化各个类型。 它使用的InstanceDescriptor
有点不可思议,但有关其工作原理的详细信息,请参阅本文 。
Generally, you'll create a class with the Code DOM, and then add methods or fields you generate to it:
通常,您将使用Code DOM创建一个类,然后向其添加生成的方法或字段:
The process for generating both compiled and table driven lex code is shown below:
生成已编译的和表驱动的lex代码的过程如下所示:
// create the symbol table (include the error symbol at index/id 3)
var symbolTable = new string[] { "Digits", "Word", "Whitespace", "#ERROR" };
// create the DFA table we'll use to generate code
var dfaTable = lexer.ToDfaStateTable(symbolTable);
// create our new class - in production we'd change the name
// to something more appropriate
var compClass = new CodeTypeDeclaration("RegexGenerated");
compClass.TypeAttributes = System.Reflection.TypeAttributes.Class;
compClass.Attributes = MemberAttributes.Final | MemberAttributes.Static;
// add the symbol table field - in production we'll change the name
var symtblField = new CodeMemberField(typeof(string[]), "LexSymbols");
symtblField.Attributes = MemberAttributes.Static | MemberAttributes.Public;
// generate the symbol table init code
symtblField.InitExpression = CharFA<string>.GenerateSymbolTableInitializer(symbolTable);
compClass.Members.Add(symtblField);
// Generate and add the compiled lex method code
compClass.Members.Add(CharFA<string>.GenerateLexMethod(dfaTable, 3));
// in production we'd change the name of the returned method
// above
// add the DFA table field - in production we'd change the name
var dfatblField = new CodeMemberField(typeof(CharDfaEntry[]), "LexDfaTable");
dfatblField.Attributes = MemberAttributes.Static | MemberAttributes.Public;
// generate the DFA state table init code
dfatblField.InitExpression = CharFA<string>.GenerateDfaStateTableInitializer(dfaTable);
compClass.Members.Add(dfatblField);
// create the C# provider and generate the code
// we'll usually want to put this in a namespace
// but we haven't here
var prov = CodeDomProvider.CreateProvider("cs");
prov.GenerateCodeFromType(compClass, Console.Out, new CodeGeneratorOptions());
You'd want to repeat this on this class for every expression. I recommend using a partial class for each expression as the source already gets long. The mechanism is similar, but slightly easier for generating match methods, because those do not need a symbol table field.
您希望在此类中为每个表达式重复此操作。 我建议为每个表达式使用局部类,因为源已经很长了。 该机制相似,但是生成匹配方法稍微容易一些,因为它们不需要符号表字段。
Using the compiled Lex()
method is exactly like using the DFA state table version of LexDfa()
, except easier. You do not have to pass a DFA state table or the error symbol as those are already "baked in" to the method and constant. You still have to map symbol ids back to their symbols using the symbol table. This is for performance reasons when it comes to parsers, which tend to deal with symbols as int
ids internally.
使用编译后的Lex()
方法与使用LexDfa()
的DFA状态表版本LexDfa()
,只是更容易。 您不必传递DFA状态表或错误符号,因为它们已经“嵌入”到方法和常量中。 您仍然必须使用符号表将符号ID映射回其符号。 出于性能方面的考虑,解析器倾向于在内部将符号作为int
id进行处理。
Generating a compiled match method using GenerateMatchMethod()
is essentially the same, minus the error symbol id, which it doesn't need. Using the compiled match method is the same as using the static MatchDfa()
method, except you don't pass the DFA state table. See the code well above for how to do matching. You also won't need to generate a symbol table.
使用GenerateMatchMethod()
编译的match方法本质上是相同的,只是不需要了错误符号id。 使用编译的match方法与使用静态MatchDfa()
方法相同,除了不传递DFA状态表。 参见上面的代码,了解如何进行匹配。 您也不需要生成符号表。
We're now feature complete. The rest of the code files are either support or gold plating. This should provide you a solid foundation for crafting your own regular expression engine or modifying this one. Hopefully, you have enjoyed the journey.
现在,我们已经完成了功能。 其余代码文件是支持的或镀金的。 这将为您构建自己的正则表达式引擎或修改此引擎提供坚实的基础。 希望您喜欢这个旅程。
翻译自: https://www.codeproject.com/Articles/5251476/How-to-Build-a-Regex-Engine-in-Csharp
正则表达式引擎 源码 c#